Skip to content

feat(archive): Lua archive module for zip/tar with bounded memory#396

Merged
wolfy-j merged 12 commits into
mainfrom
docs/archive-module-spec
Jun 27, 2026
Merged

feat(archive): Lua archive module for zip/tar with bounded memory#396
wolfy-j merged 12 commits into
mainfrom
docs/archive-module-spec

Conversation

@skhaz

@skhaz skhaz commented Jun 26, 2026

Copy link
Copy Markdown
Member

Why

Wippy Lua apps need to work with large zip/tar archives — read them from uploads and from a named filesystem, and build them — without loading the whole archive into memory. The servers have little RAM, so a multi-GB archive read into memory would OOM. There was no archive (container) support in the runtime: only the in-memory compress module.

What

A new Lua archive module, design-first and fully implemented, sitting on the existing seams (io/fs, stream.Stream, the resource cleanup store) so it composes with fs, stream, and http with no new infrastructure.

Lua API

  • archive.open(source, …) — random reader over a seekable source (an fs file, an fs handle + path, or bytes):
    entries (lazy iterator), stat, read (RAM-capped by max_inline_bytes), stream (returns a stream.Stream), extract, extract_all, close.
  • archive.scan(source, …) — forward-only walker for upload streams (req:stream(), multipart file:stream()): walk, extract_all, close.
  • archive.create(dest, …) — streaming writer to an fs file or a writable response stream: add, add_file, add_dir, close.
  • archive.formats() — registered format ids.

Entries are exposed as real stream.Stream values, so they pipe straight into fs:writefile(path, entry); extract*/add* stream via io.CopyBuffer.

Architecture

  • api/archive — codec contracts + registry (Codec + RandomReadable / StreamReadable / Writable), with magic-byte and extension resolution. Adding a format is one interface, no Lua-API or core change.
  • system/archive — built-in codecs: zip (random + forward-only streaming via local headers, deflate data descriptors, zip64), tar (random via offset index + streaming + write), tar.gz / tar.zst (streaming + write; random access is a documented fast-follow).
  • runtime/lua/modules/archive — the Lua bindings + type manifest; boot wiring in boot/components/runtime/lua.
  • fs.FS / fs.File gain Backend() accessors so the module can use a Lua fs handle as a seekable source or destination.

Safety & memory

  • Bounded memory: peak RSS is O(decompression window + copy buffer), independent of archive size. The only RAM-materializing paths (bytes source, read()) are explicitly guarded.
  • Zip-slip path sanitization on every extract; decompression-bomb caps (max_entries, max_total_bytes, max_file_bytes).
  • Security-gated via archive.read / archive.write.

Testing

Validated locally to completion (proof artifacts kept local, per repo policy):

Check Result
Bounded-memory proof — random reader 16 GiB zip streamed, every entry CRC-verified, 6.5 MiB peak RSS (~2500x) — no OOM
Bounded-memory proof — streaming deflate 4 GiB decompressed (1000x ratio) at 7.9 MiB peak RSS
Lua engine e2e every public API exercised on a real fs handle — PASS
Lua integration suite (tests/app) roundtrip/stream/scan/errors on the real runtime binary — 4/4 PASS (incl. dispatcher-backed stream())
Go codec round-trips (zip/tar/tar.gz/tar.zst) random + stream + write, incl. deflate data descriptors — PASS
golangci-lint (all new packages) 0 issues
Mutation testing (gremlins) ~86% efficacy (91 killed)
Cross-platform green on ubuntu-latest and windows-latest
Regressions (runtime/lua/..., system/archive, api/archive, boot) none

An adversarial multi-agent code review was run on the diff; its real findings were fixed (bounded the walk() resource-table growth, made the per-file cap exact, fixed a tar writer close leak, hardened zip64 data-descriptor detection). Earlier validation also fixed a directory-entry stream desync (the zip walker skipped dir entries' data descriptors, corrupting reads after any directory — found via mutation testing, regression-tested) and a Windows build break in the memory proof (syscall.Getrusage is Unix-only, now //go:build unix).

To build and run: make build-wippy from runtime/ produces dist/wippy-<goos>-<goarch> with the module registered. The giant-zip proof is env-gated: ARCHIVE_MEM_PROOF=1 go test -run TestGiantZip ./system/archive/.

Why:
Wippy apps need to read large zip/tar archives from uploads and the fs,
and to build archives, without loading them into memory — low-RAM servers
would OOM. There is no archive support today. Pinning the SDK design first
keeps the semantics from drifting before any code is written.

What:
Adds runtime/lua/modules/archive/spec.md specifying the archive module:
a random reader for seekable sources (fs file / bytes), a sequential
reader for forward-only upload streams, extract-to-fs, and a streaming
writer (file or response stream). Defines a Go codec registry so new
formats are added by implementing one interface, and pins a strict
bounded-memory guarantee (O(window + buffer), guarded RAM-materializing
paths) plus zip-slip and decompression-bomb defenses.
@skhaz skhaz requested a review from wolfy-j June 26, 2026 15:29
skhaz added 5 commits June 26, 2026 12:45
Why:
The archive module needs a bounded-memory engine to read/write zip and tar
archives without loading them into RAM. This is the format-agnostic core
that the Lua bindings will sit on top of.

What:
Adds api/archive (Codec/RandomReadable/StreamReadable/Writable contracts +
a registry with sniff/extension resolution) and system/archive with zip,
tar, tar.gz and tar.zst codecs. Zip supports random access and forward-only
streaming (local headers + deflate data descriptors, incl. zip64); tar
supports random access via an offset index, streaming, and write; tar.gz/
tar.zst compose compression over tar for stream/write. Includes per-entry
size caps and zip-slip path sanitization. Unit tests cover round-trips for
all formats, sniff/resolve, limits, and the data-descriptor stream case.
Why:
Lua apps need to read uploads/fs archives and build archives with bounded
memory. This exposes the Go codec engine to Lua as the archive module.

What:
Adds the archive Lua module (open random reader, scan sequential walker,
create streaming writer) over api/archive, plus its type manifest and boot
wiring. Entries are exposed as stream.Stream so they pipe into fs:writefile;
extract/extract_all/add* stream via io.CopyBuffer with zip-slip sanitation
and per-entry/inline size caps. Adds fs.FS/fs.File Backend accessors so the
archive module can use a Lua fs handle as a seekable source or destination.
The wippy binary builds with the module registered.
Why:
The streaming zip walker special-cased directory entries and returned early
without consuming their trailing data descriptor. Any zip containing a
directory entry (our add_dir emits deflate dir entries with data descriptors)
desynced the stream, corrupting every entry read after the directory. Found
via mutation testing; the existing stream tests had no directory entry.

What:
Directory entries are now drained like any other entry — their (empty) body
is read to EOF and the data descriptor consumed — so the reader stays aligned
to the next local file header. Locked in by TestStreamDirEntryNoDesync.
Why:
The module must be proven correct and bounded-memory on multi-GB archives,
with all Lua APIs exercised end to end and tests strong enough to kill
regressions.

What:
Adds a Lua engine e2e covering create/add/add_file/add_dir, open/entries/
stat/read/stream, extract/extract_all, scan/walk, tar, and error guards; an
env-gated giant-zip memory proof (16 GiB streamed at 6.5 MiB peak RSS; 4 GiB
deflate stream at 7.9 MiB); and boundary/metadata/sniff/limit/sanitize tests
that raise mutation efficacy to 89%. Notes the empirical memory numbers in
the module spec. gofmt formatting only for tar.go/walker.go/writer.go.
Why:
These system/archive test files were part of the validation work but missed
in the previous commit's path spec.

What:
memory_proof_test.go: env-gated bounded-memory proofs (16 GiB random read at
6.5 MiB RSS; 4 GiB deflate stream at 7.9 MiB). mutation_kill_test.go:
sniff/boundary/metadata/limit/sanitize tests raising mutation efficacy to 89%.
@skhaz skhaz changed the title docs(archive): design spec for the Lua archive module feat(archive): Lua archive module for zip/tar with bounded memory Jun 26, 2026
skhaz and others added 6 commits June 26, 2026 13:20
Why:
The giant-zip memory proof measures peak RSS via syscall.Getrusage, which is
not available on Windows; the test package failed to build on windows-latest.

What:
Adds a //go:build unix constraint to memory_proof_test.go. The proof is a
heavy, env-gated local check; other archive tests remain cross-platform.
Why:
An adversarial review of the PR surfaced several real issues: a walk over a
large archive accumulated one resource-table entry per member (unbounded
memory and stale streams reading the next entry's bytes), the per-file cap
could hand back bytes past the limit, the sequential extract ignored
buffer_bytes, and a few robustness/cleanup gaps.

What:
- walker: drop the previously yielded entry stream on each advance and on
  close, bounding the resource table to O(1) over a walk and making a stale
  stream error cleanly; thread Options through so extract_all honors
  buffer_bytes.
- helpers: limitedReadCloser now trims its return to the cap, never delivering
  bytes beyond max_file_bytes.
- tar: writer Close always closes the gzip/zstd wrapper (no leak on tar-close
  error); drop the unused reader closer field.
- zip: require BOTH local-header sentinels before treating a data descriptor
  as zip64; stop the stream walker after rejecting a stored data-descriptor
  entry; drop the unused reader closer field.
- tests: tar random access across GNU/PAX long-name and empty entries; cap
  trim precision.
Why:
max_total_bytes was documented as a decompression-bomb defense, parsed, and
defaulted, but never enforced — a limit that silently did nothing. Separately,
the streaming zip path decoded out-of-range MS-DOS date/time fields into
nonsensical normalized timestamps.

What:
- writeToFS now returns bytes written; extract_all (random and streaming) sums
  the uncompressed output and fails with max_total_bytes once the cap is
  crossed. Adds a maxTotalBytes helper applying the default.
- msdosTime clamps month/day/hour/minute/second to valid ranges so a corrupt
  field yields a bounded time instead of a normalized surprise.
…cap trim

Why:
Lock in the review-driven fixes with regression tests.

What:
- walk() keeps the resource table bounded (asserts Table().Len() after a walk).
- max_total_bytes rejects an over-budget extract_all and admits an in-budget one.
- msdosTime clamps out-of-range fields; cap reader trims to the limit across
  multiple chunked reads without underflowing; the store+data-descriptor
  rejection stops the walker cleanly.
Why:
The archive module had Go-level engine and binding tests but no Lua
integration tests running on the real wippy runtime, including the
dispatcher-backed entry stream path that the Go unit tests cannot exercise.

What:
Adds tests/app/src/test/archive/ (roundtrip, stream, scan, errors) following
the app test conventions: create/open/read/extract round-trip for zip and
tar; reading an entry as a real stream.Stream and piping it into
fs:writefile; forward-only scan/walk and extract_all from a byte source; and
error/limit handling (unknown format, max_inline_bytes, max_total_bytes,
formats listing). Verified locally: archive suite 4/4 passing.
@wolfy-j wolfy-j merged commit fc6095f into main Jun 27, 2026
4 checks passed
@wolfy-j wolfy-j deleted the docs/archive-module-spec branch June 27, 2026 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants