feat(archive): Lua archive module for zip/tar with bounded memory#396
Merged
Conversation
Why: Wippy apps need to read large zip/tar archives from uploads and the fs, and to build archives, without loading them into memory — low-RAM servers would OOM. There is no archive support today. Pinning the SDK design first keeps the semantics from drifting before any code is written. What: Adds runtime/lua/modules/archive/spec.md specifying the archive module: a random reader for seekable sources (fs file / bytes), a sequential reader for forward-only upload streams, extract-to-fs, and a streaming writer (file or response stream). Defines a Go codec registry so new formats are added by implementing one interface, and pins a strict bounded-memory guarantee (O(window + buffer), guarded RAM-materializing paths) plus zip-slip and decompression-bomb defenses.
Why: The archive module needs a bounded-memory engine to read/write zip and tar archives without loading them into RAM. This is the format-agnostic core that the Lua bindings will sit on top of. What: Adds api/archive (Codec/RandomReadable/StreamReadable/Writable contracts + a registry with sniff/extension resolution) and system/archive with zip, tar, tar.gz and tar.zst codecs. Zip supports random access and forward-only streaming (local headers + deflate data descriptors, incl. zip64); tar supports random access via an offset index, streaming, and write; tar.gz/ tar.zst compose compression over tar for stream/write. Includes per-entry size caps and zip-slip path sanitization. Unit tests cover round-trips for all formats, sniff/resolve, limits, and the data-descriptor stream case.
Why: Lua apps need to read uploads/fs archives and build archives with bounded memory. This exposes the Go codec engine to Lua as the archive module. What: Adds the archive Lua module (open random reader, scan sequential walker, create streaming writer) over api/archive, plus its type manifest and boot wiring. Entries are exposed as stream.Stream so they pipe into fs:writefile; extract/extract_all/add* stream via io.CopyBuffer with zip-slip sanitation and per-entry/inline size caps. Adds fs.FS/fs.File Backend accessors so the archive module can use a Lua fs handle as a seekable source or destination. The wippy binary builds with the module registered.
Why: The streaming zip walker special-cased directory entries and returned early without consuming their trailing data descriptor. Any zip containing a directory entry (our add_dir emits deflate dir entries with data descriptors) desynced the stream, corrupting every entry read after the directory. Found via mutation testing; the existing stream tests had no directory entry. What: Directory entries are now drained like any other entry — their (empty) body is read to EOF and the data descriptor consumed — so the reader stays aligned to the next local file header. Locked in by TestStreamDirEntryNoDesync.
Why: The module must be proven correct and bounded-memory on multi-GB archives, with all Lua APIs exercised end to end and tests strong enough to kill regressions. What: Adds a Lua engine e2e covering create/add/add_file/add_dir, open/entries/ stat/read/stream, extract/extract_all, scan/walk, tar, and error guards; an env-gated giant-zip memory proof (16 GiB streamed at 6.5 MiB peak RSS; 4 GiB deflate stream at 7.9 MiB); and boundary/metadata/sniff/limit/sanitize tests that raise mutation efficacy to 89%. Notes the empirical memory numbers in the module spec. gofmt formatting only for tar.go/walker.go/writer.go.
Why: These system/archive test files were part of the validation work but missed in the previous commit's path spec. What: memory_proof_test.go: env-gated bounded-memory proofs (16 GiB random read at 6.5 MiB RSS; 4 GiB deflate stream at 7.9 MiB). mutation_kill_test.go: sniff/boundary/metadata/limit/sanitize tests raising mutation efficacy to 89%.
Why: The giant-zip memory proof measures peak RSS via syscall.Getrusage, which is not available on Windows; the test package failed to build on windows-latest. What: Adds a //go:build unix constraint to memory_proof_test.go. The proof is a heavy, env-gated local check; other archive tests remain cross-platform.
Why: An adversarial review of the PR surfaced several real issues: a walk over a large archive accumulated one resource-table entry per member (unbounded memory and stale streams reading the next entry's bytes), the per-file cap could hand back bytes past the limit, the sequential extract ignored buffer_bytes, and a few robustness/cleanup gaps. What: - walker: drop the previously yielded entry stream on each advance and on close, bounding the resource table to O(1) over a walk and making a stale stream error cleanly; thread Options through so extract_all honors buffer_bytes. - helpers: limitedReadCloser now trims its return to the cap, never delivering bytes beyond max_file_bytes. - tar: writer Close always closes the gzip/zstd wrapper (no leak on tar-close error); drop the unused reader closer field. - zip: require BOTH local-header sentinels before treating a data descriptor as zip64; stop the stream walker after rejecting a stored data-descriptor entry; drop the unused reader closer field. - tests: tar random access across GNU/PAX long-name and empty entries; cap trim precision.
Why: max_total_bytes was documented as a decompression-bomb defense, parsed, and defaulted, but never enforced — a limit that silently did nothing. Separately, the streaming zip path decoded out-of-range MS-DOS date/time fields into nonsensical normalized timestamps. What: - writeToFS now returns bytes written; extract_all (random and streaming) sums the uncompressed output and fails with max_total_bytes once the cap is crossed. Adds a maxTotalBytes helper applying the default. - msdosTime clamps month/day/hour/minute/second to valid ranges so a corrupt field yields a bounded time instead of a normalized surprise.
…cap trim Why: Lock in the review-driven fixes with regression tests. What: - walk() keeps the resource table bounded (asserts Table().Len() after a walk). - max_total_bytes rejects an over-budget extract_all and admits an in-budget one. - msdosTime clamps out-of-range fields; cap reader trims to the limit across multiple chunked reads without underflowing; the store+data-descriptor rejection stops the walker cleanly.
Why: The archive module had Go-level engine and binding tests but no Lua integration tests running on the real wippy runtime, including the dispatcher-backed entry stream path that the Go unit tests cannot exercise. What: Adds tests/app/src/test/archive/ (roundtrip, stream, scan, errors) following the app test conventions: create/open/read/extract round-trip for zip and tar; reading an entry as a real stream.Stream and piping it into fs:writefile; forward-only scan/walk and extract_all from a byte source; and error/limit handling (unknown format, max_inline_bytes, max_total_bytes, formats listing). Verified locally: archive suite 4/4 passing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Wippy Lua apps need to work with large zip/tar archives — read them from uploads and from a named filesystem, and build them — without loading the whole archive into memory. The servers have little RAM, so a multi-GB archive read into memory would OOM. There was no archive (container) support in the runtime: only the in-memory
compressmodule.What
A new Lua
archivemodule, design-first and fully implemented, sitting on the existing seams (io/fs,stream.Stream, theresourcecleanup store) so it composes withfs,stream, andhttpwith no new infrastructure.Lua API
archive.open(source, …)— random reader over a seekable source (anfsfile, anfshandle + path, or bytes):entries(lazy iterator),stat,read(RAM-capped bymax_inline_bytes),stream(returns astream.Stream),extract,extract_all,close.archive.scan(source, …)— forward-only walker for upload streams (req:stream(), multipartfile:stream()):walk,extract_all,close.archive.create(dest, …)— streaming writer to anfsfile or a writable response stream:add,add_file,add_dir,close.archive.formats()— registered format ids.Entries are exposed as real
stream.Streamvalues, so they pipe straight intofs:writefile(path, entry);extract*/add*stream viaio.CopyBuffer.Architecture
api/archive— codec contracts + registry (Codec+RandomReadable/StreamReadable/Writable), with magic-byte and extension resolution. Adding a format is one interface, no Lua-API or core change.system/archive— built-in codecs: zip (random + forward-only streaming via local headers, deflate data descriptors, zip64), tar (random via offset index + streaming + write), tar.gz / tar.zst (streaming + write; random access is a documented fast-follow).runtime/lua/modules/archive— the Lua bindings + type manifest; boot wiring inboot/components/runtime/lua.fs.FS/fs.FilegainBackend()accessors so the module can use a Lua fs handle as a seekable source or destination.Safety & memory
bytessource,read()) are explicitly guarded.max_entries,max_total_bytes,max_file_bytes).archive.read/archive.write.Testing
Validated locally to completion (proof artifacts kept local, per repo policy):
tests/app)stream())golangci-lint(all new packages)runtime/lua/...,system/archive,api/archive, boot)An adversarial multi-agent code review was run on the diff; its real findings were fixed (bounded the
walk()resource-table growth, made the per-file cap exact, fixed a tar writer close leak, hardened zip64 data-descriptor detection). Earlier validation also fixed a directory-entry stream desync (the zip walker skipped dir entries' data descriptors, corrupting reads after any directory — found via mutation testing, regression-tested) and a Windows build break in the memory proof (syscall.Getrusageis Unix-only, now//go:build unix).To build and run:
make build-wippyfromruntime/producesdist/wippy-<goos>-<goarch>with the module registered. The giant-zip proof is env-gated:ARCHIVE_MEM_PROOF=1 go test -run TestGiantZip ./system/archive/.