feat(vfs): sparse writes with coverage-map cache (clean rewrite)#180
Open
XciD wants to merge 65 commits into
Open
feat(vfs): sparse writes with coverage-map cache (clean rewrite)#180XciD wants to merge 65 commits into
XciD wants to merge 65 commits into
Conversation
Switch from tracking the main branch to pinning a specific rev. The pinned commit includes the range_upload entry point that the sparse write feature relies on. Pinning makes builds reproducible and protects against accidental xet-core regressions slipping in via main.
Sparse-write flushes need to compose a new CAS file from an existing reconstruction plus a small set of modified byte ranges, without re- uploading the unchanged bytes. xet-core's upload_ranges does the heavy lifting; this commit wires it through XetOps with a clean API. The caller passes RangeSnapshot values containing the offset and a Bytes buffer of pre-read data. range_upload itself never touches the staging file and never holds an I/O lock, so concurrent writes during a flush cannot interleave with our reads and produce torn content. That responsibility moves to the caller, which reads from staging under the per-inode I/O lock before invoking range_upload. A BytesReader wraps each snapshot as an AsyncRead so xet-core's DirtyInput interface stays satisfied without round-tripping through disk again. Truncate-past-end is handled by appending a synthetic empty DirtyInput over the cut tail so the original bytes past the new size are dropped from the composition. MockXet implements the trait so tests can exercise the sparse upload path without a live CAS connection.
Track sparse-write staging files via two range vectors instead of the flag-soup that the previous design accumulated: - `coverage`: ranges of staging holding real bytes (either cached from CAS or written by the user). Bytes outside coverage are sparse holes that read paths must fill from CAS. - `dirty_ranges`: subset of coverage that the user modified since the last commit. Fed to range_upload at flush time to compose the new CAS file without re-uploading unchanged bytes. Invariant `dirty_ranges ⊆ coverage` is structural: track_write touches both, track_read_fill touches coverage only. No `staging_holds_full` boolean to keep in sync — staging completeness is `is_fully_covered(size)`, derived from coverage. apply_commit_sparse re-keys the state to the new hash with dirty cleared and coverage preserved: the new CAS file was just composed from the covered staging bytes, so staging still matches the new hash at those offsets. Reads outside coverage will fetch fresh from the new hash on demand, lazily growing coverage again. apply_noop_commit clears dirty + pending_deletes when the generation matches, but leaves coverage and sparse_write intact — the on-disk staging file is unchanged, so the existing cache is still valid for the unchanged hash. The InodeEntry gains a `sparse_write: Option<Arc<SparseWriteState>>` field, set to None on both existing constructors. No call site uses this yet — wiring follows in subsequent commits. Unit tests cover the helper functions (merge_range, clip_ranges, subtract_ranges) and the state-machine methods (track_write, track_read_fill, clip_to_size, holes_in, is_fully_covered).
Plumb sparse_writes through VfsConfig and VirtualFs, and add the per-inode sync I/O lock that the sparse read/write paths will use to serialize the (pread + coverage snapshot) and (pwrite + track_write) critical sections against each other and against range_upload's reads. The derivation `sparse_writes = config.sparse_writes && (advanced_writes || overlay)` ensures the feature can never be enabled without the staging substrate it requires; the CLI layer will assert the implication explicitly so users get a hard error rather than a silent downgrade. TestOpts gets a `sparse_writes` field; when set, the helper forces advanced_writes on so the underlying VfsConfig derivation does not drop the flag. No call site uses the new field/lock yet — wiring follows.
When `--sparse-writes` is enabled and the open targets a CAS-backed file (non-truncate, non-overlay, non-empty xet_hash), punch a sparse hole of `size` bytes instead of downloading the original content. Reads in [0, size) outside the dirty regions will fetch from CAS lazily and persist into staging via fill_sparse_holes — that part lands in the next commit. A SparseWriteState is installed under the inode-table write lock that guards `set_dirty`, so poll can never observe a (dirty, sparse_write = None) intermediate. The state is keyed to the snapshot xet_hash; if poll rotated `entry.xet_hash` between `open()` and here, the snapshot check fails and we skip the install — the inode is left clean (no sparse_write, no dirty) and the next write will re-prepare against the current hash. `staging_holds_original` distinguishes three open paths: reused staging that mirrors the current hash, fresh download into staging, and a fresh sparse hole. The first two use `SparseWriteState::new_full` (coverage covers everything; fill_sparse_holes is a no-op). The third uses `SparseWriteState::new` (empty coverage; reads fill on demand). For the simple `--advanced-writes` path (no sparse_writes), nothing changes: download-or-create-empty as before, no sparse_write installed.
Add fill_sparse_holes to the read path. After pread fills buf with
staging bytes (zeros for sparse holes, real bytes for covered ranges),
walk the holes recorded in the sparse_write snapshot:
1. Download the hole's bytes from CAS.
2. Overlay them into buf — the user-visible read result.
3. pwrite them into staging at the hole's offset and update coverage,
so subsequent reads of the same range hit staging instead of
re-downloading.
The pread + sparse_write snapshot run under the per-inode io_lock, so
buf and the snapshot describe a consistent point-in-time view of
staging. The CAS download runs without any lock held (a long blocking
fetch must not block writers). Before pwriting cached bytes back into
staging, re-acquire io_lock and re-check the live coverage: a
concurrent writer may have filled the hole in the meantime, and we
must never overwrite their dirty bytes with the original CAS content.
User-visible reads always reflect the snapshot at pread time. Even if
a writer fills the hole between snapshot and overlay, the read returns
the snapshot's bytes — POSIX permits any interleaving for concurrent
read/write, and the staging cache is updated only where it doesn't
collide with the writer.
Bytes past `original_size` are skipped entirely: they don't exist in
CAS, so pread already filled buf correctly (either zeros for a sparse
hole extension or the user's own written bytes).
Wrap the write path's pwrite + inode metadata update with the per-inode io_lock, and call track_write on the inode's sparse_write to extend both coverage and dirty_ranges by the just-written window. The io_lock makes the (pwrite + sparse_write update) pair atomic against the read path's (pread + sparse_write snapshot) pair. A concurrent reader sees either the pre-write state (zeros from a hole, plus no dirty_range at this offset) or the post-write state (real bytes in staging, plus matching dirty_range), never an inconsistent middle. When sparse_write is None (advanced mode without --sparse-writes, or clean inode), the lock is still taken but track_write is skipped — the overhead is one stdlib mutex per write, which the existing flush-batch GC and setattr-truncate paths already pay.
Three sparse-related cases in setattr(size): 1. Inode already has sparse_write (open installed it): the grow path calls track_write to mark the new tail as dirty (its zeros will be uploaded via range_upload), the shrink path calls clip_to_size to trim coverage and dirty_ranges past new_size. 2. Setattr arrives on a clean CAS-backed file with no prior open and no staging file. In sparse_writes mode, install sparse_write fresh keyed to the original hash, punch a sparse hole sized to new_size (no download), then track the grow/shrink. This is the only setattr install point — open is the other; both produce the same end state. 3. Inode has no sparse_write and sparse_writes is off, or has_remote_xet is false: existing behavior (download then set_len). Falls through unchanged. The set_len call is skipped on the sparse-install path because the freshly-created staging file is already exactly new_size bytes.
Split the flush_batch upload loop into two passes: - Pass A (per-item): for each FlushItem with a sparse_snapshot, read the dirty bytes from staging under the per-inode io_lock, package them as RangeSnapshot values, and call range_upload to compose the new CAS file from the existing reconstruction. The io_lock makes the read atomic against concurrent pwrites, so xet-core hashes a self-consistent snapshot — not a chimeric mix of pre- and post-write bytes. - Pass B (batched): non-sparse items go through the existing upload_files chunked path unchanged. FlushItem gains `file_size` and `sparse_snapshot` fields captured under the inode read lock at snapshot time. apply_commit_sparse is called on the inode after a successful sparse upload: it re-keys sparse_write to the new hash, clears dirty_ranges, and preserves coverage so subsequent reads through still-open handles can keep hitting the staging cache. The unchanged-files path (same hash, no pending_deletes) now uses apply_noop_commit instead of apply_commit, leaving coverage intact when nothing actually changed remotely. The snapshot_dirty_bytes helper handles the per-range read with a graceful short-read fallback for the setattr-shrink race window.
Expose the sparse-write feature through the CLI. The flag is off by default and marked experimental. The implication --sparse-writes -> --advanced-writes is enforced at the setup layer: setting just --sparse-writes turns advanced_writes on too. Without this enforcement, VfsConfig's derivation would silently downgrade sparse_writes to false when advanced_writes is missing, and the user would see the legacy write path behind their back. A startup warn! announces that the feature is experimental so users running it know to report issues.
update_remote_file already gated on !is_dirty (matching main); a pool-cached writable fh by itself does not block poll because the inode is clean between writes. The Codex finding about idle pool fhs freezing poll applied to the feat/append-write code, not this branch. Document that sparse_write deliberately stays keyed to its original_hash across a poll-driven xet_hash rotation: readers and writers keep operating in their snapshot's frame of reference, just like the non-sparse path's flush silently absorbs poll-detected changes. The sparse path doesn't make this worse.
Nine new integration tests pin the user-visible behaviors of --sparse-writes: - sparse_open_skips_download: open does not call download_to_file - sparse_read_returns_cas_bytes: read fetches the original from CAS - sparse_read_caches_into_staging: subsequent reads hit the staging cache, not CAS - sparse_read_uncached_region_hits_cas: cache fills lazily; reads outside the cached window still go to CAS - sparse_write_overlays_on_original: writes show through subsequent reads as overlays on the CAS-backed bytes - sparse_flush_commits_composed_file: range_upload produces the expected composition (original + dirty overlay) at flush time - sparse_write_past_eof_extends: writes past EOF extend the file with the gap zero-filled - sparse_setattr_shrink_truncates_at_flush: setattr-shrink lands the truncated content in CAS - sparse_noop_flush_preserves_hash_and_cache: open-read-release with no writes keeps the original hash and doesn't dirty anything These intentionally exercise observable behavior (read return values, final committed hash, CAS stream call counts) rather than implementation details (the SparseWriteState fields, lifecycle of internal flags). Unit-level coverage of the state-machine helpers lives in inode.rs.
fsx_paranoid is a paranoid fsx variant: after every mutation it closes the file, waits for the async flush to commit to CAS, then re-opens and reads back. This catches range_upload composition bugs that the canonical fsx test misses (which reads from local staging, not from the CAS round-trip). The mount uses `--sparse-writes --advanced-writes` so the test actually exercises the sparse path — without --sparse-writes, range_upload is never invoked and the test cannot fulfill its purpose. Wired into the same CI job as fsx so any range_upload regression breaks CI in the same place. Also fixes a clippy warning in flush.rs (useless .into_iter()).
Contributor
Benchmark Results |
Contributor
POSIX Compliance (pjdfstest) |
P1 — preserve dirty sparse_write on reopen (mod.rs): When can_reuse_staging is true and an existing sparse_write is already keyed to the same hash, keep it. Replacing it with a fresh empty state dropped the in-flight dirty_ranges and coverage from the first open, which made the next flush a no-op while the user's writes were still in staging. P1 — track zero gap in past-EOF writes (mod.rs): When pwrite starts past entry.size, the gap [prev_size, offset) is zero-filled by the file extension. Mark the whole [prev_size, new_end) range dirty (not just [offset, new_end)) so range_upload composes the zeros into the new CAS file. Without this the real xet-core would commit a truncated file; MockXet zero-pads to new_file_size and the existing test masked the bug. P2 — bail with EIO on xet_hash drift during open (mod.rs): The drift check moves before set_dirty. If poll rotated entry.xet_hash between the open() snapshot and here, the staging we just punched as a hole no longer matches reality. Previously we'd skip the sparse_write install but still set_dirty + return a writable handle, so the next flush would see (dirty, sparse_write = None) and take the regular full-staging upload path — committing zero-filled holes as real content. Tests: - sparse_write_past_eof_extends adds a whitebox dirty_ranges assertion so the mock can no longer mask a missing-zero-gap regression. - sparse_reopen_dirty_preserves_state exercises the reopen-before-flush path and pins the dirty_ranges invariant.
Address two test-quality gaps Codex flagged on the prior fix: - sparse_reopen_dirty_preserves_state was non-deterministic: it released fh1 then reopened, with an early-return guard if the background flush had already cleared the dirty flag. Hold fh1 open across the reopen instead — a held writable handle keeps the inode dirty regardless of debounce timing, so the reopen path is always exercised. Also assert the user's bytes read back through the reopened handle. - Add sparse_open_hash_drift_returns_eio: drives open_advanced_write directly with a stale snapshot hash to exercise the drift branch deterministically (no need to race the poll loop). Asserts EIO and that the inode is left clean with no sparse_write installed.
…ion, GC budget) Addresses the findings from the max code review of the sparse-write path. Correctness: - Fix silent data loss on O_TRUNC then write then reopen-without-truncate. O_TRUNC clears sparse_write but keeps xet_hash; a write then lands while sparse_write is None (untracked); the reopen used to install an empty-coverage SparseWriteState over the dirty staging, so reads overlaid the original CAS bytes and the flush no-oped back to the original hash, losing the write. The reopen now only installs empty coverage for a freshly punched hole, full coverage when staging mirrors the original, and otherwise falls back to the non-sparse full-staging upload. Adds a regression test. Performance / contention: - fill_sparse_holes no longer holds the global inode-table write lock across the blocking staging writes, and coalesces the per-hole CAS fetches into a single ranged stream (one reconstruction round-trip instead of N for a fragmented read). - read() skips the io_lock + coverage snapshot on non-sparse mounts and for whole-file-cache handles (new OpenFile::Local.staging_backed flag), restoring the previously lock-free local read path. - Sparse opens charge actual disk usage (a freshly punched hole is ~0 blocks) against the GC budget instead of the full logical size, so a small edit to a large CAS file no longer evicts other staging files; materialized bytes are accounted as reads fill holes. Lifecycle: - A whole-file-cache read handle no longer consults the inode's lingering sparse_write after a sparse commit, eliminating the EBADF pwrite-per-hole and redundant CAS refetch on read-only reopens. Cleanup / safety: - Replace the hand-rolled BytesReader with std::io::Cursor (tokio AsyncRead). - Extract the original_range composition mapping into the pure, unit-tested plan_original_ranges (the mock XetOps bypassed this logic entirely). - fill_sparse_holes uses the safe FileExt::write_at instead of raw libc::pwrite. - snapshot_dirty_bytes uses read_exact_at and fails loud on a short read (the staging lock makes it unreachable) instead of silently composing a wrong-length file. - Remove the dead is_fully_covered helper.
- Revert the sparse-hole GC accounting change. Charging a freshly punched hole at its actual disk usage (~0 blocks) while try_remove / write / setattr still credit the logical size left the global staging budget asymmetric: eviction over-subtracted bytes_used, saturating it toward 0 and disarming GC. Restore consistent logical-size accounting (a sparse hole is charged conservatively at its logical size). Making the staging budget truly block-based is a separate, whole-model change. - Mark the HTTP non-Xet download cache and freshly-created file handles as not staging-backed, so reads of them keep the lock-free fast path instead of taking the per-inode io_lock + coverage snapshot. Those handles never carry a sparse_write, so the sparse read path was pure overhead for them on sparse mounts (the lock regression the staging_backed flag set out to remove). - Add multi-snapshot plan_original_ranges tests covering the by-index reader-pairing that range_upload relies on (data entries before the synthetic truncate tail).
Quality-only cleanups from the /simplify pass; no behavior change. - Add SparseWriteState::resize_to(prev_size, new_size) and call it from both setattr sparse-install branches, replacing the verbatim-duplicated grow/shrink (track_write / clip_to_size) idiom. - Drop the redundant `staging_holds_original` local in open_advanced_write: at its only read site can_reuse_staging is already true, so it equals staging_is_current; use that directly. - Extract InodeEntry::touch_commit_clocks() for the mtime/ctime/last_revalidated tail triplicated across apply_commit / apply_commit_sparse / apply_noop_commit. - Skip the per-inode io_lock on the write path when sparse writes are disabled, matching the read fast-path: the lock only serializes against sparse-read snapshots, so non-sparse mounts keep their lock-free write path. - Remove the now-stale #[allow(dead_code)] on StagingCoordinator::io_lock (it has real callers).
Poll can rotate an inode's xet_hash on a clean inode (InodeTable::update_remote_file) while leaving a retained sparse_write keyed to the old hash. open_advanced_write already guards this drift, but setattr did not: it reused the stale sparse_write, so flush composed range_upload against the wrong base and silently overwrote the poll-discovered remote update. Read the current revision (cur_hash/cur_size) under the write lock and: - rebuild the sparse_write against the current xet revision when the hash drifted to a new xet hash (empty coverage, reads refill from the current CAS object); - on a grow, truncate staging to cur_size before extending so the extension is POSIX zero-fill, not stale tail bytes from the old (larger) revision; - when the current revision is no longer xet-backed (xet_hash None), keep the existing hash (last-writer-wins, the same accepted race as the non-sparse path) rather than dropping the state (which would upload the hole-staging zeros). Adds three regression tests covering hash-to-hash drift, grow after drift to a smaller revision, and grow after drift to a non-xet revision. The grow tests were verified to fail without the staging zero-fill. Also folds in small /simplify cleanups in the same area: drop a stale allow(dead_code) on the now-wired sparse_writes field and remove a doubly-negated need_set_len intermediate.
…ranoid Two findings from the sparse-write review: - open_local_readonly opened the per-inode sparse staging file read-only, but fill_sparse_holes writes fetched CAS bytes back into staging via write_at on that same fd. On a read-only fd every cache write failed with EBADF, so each read of a hole re-fetched from CAS and logged a warning instead of honoring the staging cache. Open staging-backed read handles with write permission (the FUSE handle stays read-only). Complete cache files keep the read-only fast path. - fsx_paranoid mounted without --direct-io, so the kernel page cache (FOPEN_KEEP_CACHE) could serve the read-back from the just-written pages and a range_upload that corrupted the committed CAS object would still match the reference. Add --direct-io so every verification read goes through the FUSE handler and, on the now-clean inode, reads the committed CAS revision. Adds a regression test that a read-only open of a dirty sparse file caches hole-fills into staging (second read hits the cache); verified to fail with the read-only fd.
Closes four findings from the high-effort code review on the sparse-write PR: - flush.rs Pass A error fan-out: a sparse range_upload failure used to attribute "range_upload failed" to every item in to_flush, including non-sparse items that Pass B had not even attempted yet. Now only the failing inode is marked errored; siblings commit independently. - flush.rs head-of-line blocking: Pass A early-returned on the first sparse failure, starving independent sparse items behind a sticky upstream error. The loop now continues; failed items leave their upload_results slot as None and the downstream commit-apply skips them so they retry on the next debounce. - mod.rs setattr sparse-install with cur_hash=None: the `cur_hash.or(prev_hash)` fallback built a SparseWriteState keyed to prev_hash with original_size=cur_size, mismatching what plan_original_ranges expects from the real xet upload_ranges (the mock papers it over by using original.len()). When poll rotates the inode to non-xet between snapshot and lock we now return EIO so the caller retries against the rotated state. - inode.rs apply_noop_commit: now promotes staging_is_current=true on the non-sparse path (upload_files rehashed staging back to the previous hash, so staging IS the committed content). Sparse path keeps the flag false because holes remain unmaterialized.
Regression coverage for the four fixes in f478a78: - flush_pass_a_failure_does_not_pollute_non_sparse_sibling: one sparse inode + one non-sparse inode in the same batch; the sparse Pass A failure must not contaminate flush_errors for the non-sparse sibling, which still commits via upload_files. - flush_pass_a_failure_does_not_block_independent_sparse_items: two sparse inodes in one batch with a one-shot upload failure; the sibling whose range_upload succeeds must end clean. Before the fix the early-return killed Pass A entirely so neither converged. Smoke tests for sparse edges that weren't covered: - sparse_write_at_eof_boundary_extends_cleanly: a write landing exactly at EOF (no gap, no overlap) must not leave a phantom seam between covered and dirty regions. - sparse_setattr_shrink_inside_dirty_range_keeps_clipped_writes: setattr-shrink that cuts through the middle of a dirty range must keep the clipped prefix, not drop the range entirely. - sparse_zero_length_write_is_a_noop: defensive — pwrite(0) must not create a zero-length entry in dirty_ranges nor change size.
Three P2 issues from Codex review of f478a78: - flush.rs Pass B abort fan-out (regression in f478a78): narrowing the error message to the failing chunk left sparse Pass A successes and earlier successful Pass B chunks silently dirty — fsync could return success on uncommitted content. Restore the "mark every aborted item" behavior, but use `or_insert_with` so the more specific Pass A range_upload / snapshot error is preserved for items that already failed there. - mod.rs setattr non-xet drift validation (regression in f478a78): the EIO return for `want_sparse_install && cur_hash.is_none()` ran AFTER `entry.size`, `mtime`, `ctime`, and `set_dirty()` had already been bumped. The setattr appeared to fail to the caller but left the inode mutated locally with no scheduled flush. Hoist the cur_hash validation to before any inode mutation; the staging hole punched earlier is harmlessly replaced on the next open. - mod.rs sparse drift to LARGER non-xet (pre-existing): when poll rotates a clean sparse inode to a non-xet revision whose live size exceeds the CAS base size, a subsequent setattr that lands inside the gap [sw.original_size, cur_size) extends the staging file with zeros from set_len but doesn't track them. range_upload then composes only [0, original_size) from the old base; bytes in [original_size, new_size) are silently dropped from the commit. Track the gap up to min(cur_size, new_size) before resizing. The mock zero-fills past original_size so this is invisible under unit tests — real xet upload_ranges produces a malformed commit; the new sparse_setattr_after_drift_to_larger_non_xet_tracks_gap test asserts the dirty_ranges shape directly instead of relying on composition output.
Codex follow-up on 247a56e: the EIO path for `want_sparse_install && cur_hash.is_none()` now skips the inode mutation, but it left the staging file created at File::create(staging_path) + set_len(new_size) just before the write-lock acquisition. That stale placeholder makes `local_backing_exists` true for the next setattr call. If the inode rotates back to xet-backed before then, the next setattr's `want_sparse_install` predicate fails (local_exists=true) and the file falls through to the non-sparse path, full-uploading the stale hole-of-zeros as the new content. Drop the staging file via `StagingDir::try_remove(ino)` on the EIO exit. The GC budget was not charged yet (resize_bytes happens further down, past the early return), so the cleanup is purely on-disk.
Codex P3 follow-up on e90f056: StagingDir::try_remove(ino) subtracts file_size(ino) from bytes_used via resize_bytes(size, 0). On the EIO rollback path the staging file was never charged (the only resize_bytes in this setattr branch sits below the early return), so try_remove silently undercounts unrelated staging usage — every drift retry would shift the global budget down by the placeholder's logical size. Use a raw std::fs::remove_file on this path instead. The cleanup is best-effort: a NotFound or transient error is benign because the next open replaces the staging file anyway.
Codex P2 (fresh review): fill_sparse_holes releases io_lock during the CAS stream await, then re-acquires it to persist still-hole bytes into staging. A concurrent setattr-shrink (which holds staging.lock + inode_table.write(), not io_lock) can clip entry.size below the snapshot's original_size during this window. The post-await still_holes computation used the original (pre-shrink) range, so write_at past the new EOF would re-extend the staging file with stale CAS bytes and track_read_fill would record coverage there. User-visible consequence: a subsequent setattr-grow back through the clipped region would expose the stale CAS bytes instead of POSIX zeros, and the staging file's on-disk length silently exceeds entry.size, breaking the invariant that staging[..entry.size] is the canonical view. Read entry.size from the same inode_table.read() guard that resolves sw, and clamp every hole's end to live_size before computing still holes. Holes whose start is already past the new EOF are dropped.
Codex P2 follow-up on f54af0e: clamping still_holes inside fill_sparse_holes only narrows the race — the lock is dropped before the write_at loop, and a second setattr-shrink can still slip in after the recheck and re-extend the staging file past the new EOF. Have the size-change branch of setattr take staging.io_lock(ino) before acquiring inode_table.write(), holding it across set_len and the entry.size / sparse_write update. Same lock order as read/write/ fill_sparse_holes (io_lock → inode_table). No awaits are held under io_lock — set_len and open_local_backing_file are sync syscalls. Updated the module-level lock-ordering comment to document io_lock's role and its position relative to staging.lock and inode_table.
Codex follow-up on 8201ac5 (P2 + P3): - write() used to skip io_lock when `sparse_writes` was false (an optimization for the non-sparse advanced-writes mode where readers also skip io_lock). With setattr's size branch now taking io_lock unconditionally, a non-sparse advanced-writes pwrite races against setattr's set_len: the truncate can land between the pwrite and the subsequent entry.size update, leaving inode.size larger than the on-disk staging length. Take io_lock for every WriteTarget::Local pwrite. - Updated the stale comment in fill_sparse_holes that claimed setattr doesn't take io_lock — that became false after 8201ac5.
Covers three of the four user-visible bugs surfaced by the Codex review: 1. flush_pass_b_failure_preserves_pass_a_sparse_success: builds a batch with a sparse item (Pass A range_upload succeeds) and a non-sparse item (Pass B upload_files fails via the new fail_next_upload_files mock hook). Asserts the sparse Pass A success has no flush_errors entry while the actual Pass B failure does surface for fsync. 2. apply_commit_sparse_mismatch_does_not_stamp_clocks: snapshots mtime/ctime/last_revalidated, then invokes apply_commit_sparse with a stale dirty_generation (mismatch path). Asserts all three clock fields are unchanged — without the fix last_revalidated would be bumped to now and let subsequent lookups skip HEAD revalidation while the inode is still dirty with the racing writer's bytes. 3. write_zero_bytes_past_eof_is_a_noop: issues a zero-byte pwrite past EOF and asserts entry.size is unchanged, then writes a real byte and verifies the inode flushes cleanly — without the fix the prior zero-byte write would have advanced entry.size past the on-disk staging length, stranding snapshot_dirty_bytes on UnexpectedEof. Also adds MockXet::fail_next_upload_files so Pass B failure can be exercised independently of Pass A. The fourth fix (per-inode lock map leak) doesn't get a dedicated test because the leak is only observable via long-running churn that existing GC tests cover via inode-eviction; the new forget_locks call runs unconditionally on drop_staging, so any path that evicts an inode exercises it for free.
…tion
The drift handling in setattr carried two code paths whose only purpose
was to react to a clean sparse inode being rotated to a non-Xet revision
by poll between snapshot and lock:
- The `want_sparse_install && cur_hash.is_none()` early-return that
cleaned up the punched staging file and returned EIO.
- The `_ =>` arm in the sparse-write drift rebuild that kept the state
keyed to the OLD hash and resized in place (last-writer-wins against
a None current revision).
That scenario can't happen on hf-mount's writable surface: buckets are
uniformly Xet, and non-Xet repos are read-only so writes never reach
setattr at all. The previous `cur_hash.expect("non-xet drift rejected
above")` was already trusting this invariant after the early return.
Collapse the rebuild path from a 3-arm match to a single if/else on
hash equality, replace the trust-the-early-return expect with one that
documents the invariant directly, and drop the two regression tests
(`sparse_setattr_grow_after_drift_to_non_xet_zero_fills_tail`,
`sparse_setattr_after_drift_to_larger_non_xet_tracks_gap`) that
simulated this rotation by calling update_remote_file(ino, None, ...)
directly. The remaining drift tests still cover Xet → other-Xet
rotation, which IS reachable.
With the Xet → non-Xet rotation paths gone, the gap-track inside the matching-hash arm of the sparse-write rebuild is also unreachable. The only way cur_size could exceed sw.original_size on this branch is via user writes (which already tracked them as dirty under io_lock + the inode write lock); Xet content-addressing forbids a poll rotation that keeps the same hash while changing size. Collapse the branch to a single resize_to call. The dirty bytes between sw.original_size and cur_size are guaranteed to already be in dirty_ranges from the originating write().
…efactor The test was added to validate plan_original_ranges on the snapshot shape that setattr's drift gap-track branch produced. With that branch gone (Xet → non-Xet is unreachable, and Xet → same-hash-different-size is impossible by content-addressing), no caller produces this specific snapshot shape anymore. The boundary case (start == original_size) is still trivially covered by write_past_eof_is_an_insert which exercises the same plan_original_ranges branch.
Three small simplifications surfaced by a /simplify pass: 1. `SparseWriteState::new_resized(hash, original_size, new_size)` helper. Setattr's sparse-install and drift-rebuild branches both built a fresh SparseWriteState with `new(hash, cur_size); resize_to(cur_size, new_size); Arc::new(sw)`. Factored into one constructor. 2. `track_write` now takes the prior EOF and handles the zero-fill gap internally instead of asking every caller to compute `effective_start = offset.min(prev_eof)`. write() drops its 4-line manual gap math, and the invariant "writes past EOF must dirty the zero gap" lives on SparseWriteState where it belongs (Codex altitude finding A3). track_write also reuses track_read_fill for the coverage merge instead of duplicating the merge_range call. 3. Collapse setattr's install/drift/in-place-resize trifurcation into a single `needs_rebuild` predicate: rebuild when we just punched a sparse-install hole OR the existing sparse_write is keyed to a stale hash; otherwise resize in place. Removes the duplicated drift-vs-install rebuild blocks and one copy of the `Xet → non-Xet unreachable` expect. Also tightens `new_full` to a direct struct construction instead of the spread `..Self::new(...)` pattern (which was allocating two empty Vecs just to override one of them). Adds a unit test for `track_write` past EOF to pin the gap-tracking behavior at the state-machine layer instead of relying on integration coverage in write()/setattr.
The `let snapshots_result = ...; let snapshots = match snapshots_result` pair had no reason to be two bindings — fold the match directly into the let. Saves one line, removes the indirection.
A multi-worker stress test that exercises the sparse-write state machine under concurrent ops on many files. Each worker owns its own file (no cross-worker shadow races), but flushes share a global FlushManager so Pass A/B batching, lock-map churn, drift handling, and concurrent staging lifecycles all get exercised together. Workload mix per worker is biased toward the edge cases unit tests cover individually but not in combinations: - in-file overwrite - write spanning EOF - write at exactly EOF - write past EOF (the zero-fill gap) - setattr-grow and setattr-shrink - reopen (sparse_write preservation + drift retry) - fsync (force apply_commit_sparse cycle) - mid-stream random-range read (verified against shadow) Validation is byte-exact: each worker maintains a shadow Vec<u8> mirror, and after release+wait_for_clean does a full-file CAS-level read that must match shadow byte for byte. A divergence fails the test with the offending offset and a reproducible seed. Default: 12 files × 200 ops, ~1.5 s wall-time, runs as part of the lib test suite. Override via STRESS_FILES / STRESS_OPS_PER_FILE / STRESS_SEED for heavier runs (verified 64 × 2000 = 128k ops in ~12 s, no divergence across 5 seeds). Surfaced one known race during development: `open()` can return EIO when xet_hash drifts between the pre-lock snapshot and the install under-lock — the code comment claims "the caller retries" but no retry sits anywhere above (FUSE adapter passes EIO straight through). The test mirrors the documented contract with a bounded retry loop in `open_with_retry`. Worth a separate fix to either handle the drift internally (re-snapshot under the lock) or to wire a retry into the FUSE adapter.
`open_advanced_write` used to capture `xet_hash` / `size` from the caller's pre-lock snapshot (taken in `open()` via `get_file_entry`) and surface EIO when those values disagreed with the inode under `staging.lock(ino)`. The comment claimed "the caller retries" but nothing above wires that retry — FUSE and NFS adapters pass EIO straight through to the user. The new sparse-write stress test reproduces the race deterministically: under heavy concurrent flushes, `apply_commit_sparse` rotates entry.xet_hash between `open()`'s snapshot and `open_advanced_write`'s install, returning EIO for what should be an in-flight reopen on otherwise clean state. Two changes: 1. `open_advanced_write` now re-reads xet_hash and size from the inode table under staging.lock (alongside the existing is_dirty / staging_is_current re-read). The caller's parameters become advisory (prefixed `_`) and the rest of the function uses the freshly-snapshotted values — punching staging at the live size, installing sparse_write keyed to the live hash, comparing against the live hash for the materializes-remote check. 2. `open()` wraps the advanced-write path in a bounded internal retry (4 attempts with exponential backoff 5/10/20 ms) so a flush that completes between the inode_table.read() and the inode_table.write() inside open_advanced_write — the narrow residual race window — also resolves without surfacing EIO. After the retry budget is exhausted any remaining EIO propagates so genuine IO errors still reach the caller. Updates the pre-existing drift test to reflect the new contract: `open_advanced_write` is now robust to a stale snapshot and installs a sparse_write keyed to the live hash instead of returning EIO.
…test Two changes paired with the open_advanced_write drift fix: 1. New `StressOp::RogueRotation` injects poll-induced remote rotations into the stress workload. When fired, the worker quiesces (fsync + release + wait_for_clean), registers a freshly-generated remote blob under a new CAS hash, calls `update_remote_file` on the inode, updates its shadow to the new content, and reopens. This exercises the drift-rebuild path in `open_advanced_write` for real (sparse_write keyed to the OLD committed hash, entry.xet_hash now pointing at the rotated remote revision) and validates that the reopen + subsequent writes still produce a correct final file (shadow byte-exact match at end-of-worker). ~4% of ops are rotations; 32 files × 1000 ops triggers dozens per run without blowing up wall time. Verified across multiple seeds at 48 files × 1500 ops (≈ 72 k ops) — no divergence. 2. The pre-existing `sparse_open_hash_drift_returns_eio` test exercised the OLD drift contract (EIO returned when a stale xet_hash snapshot disagreed with the live inode). With the production fix re-reading xet_hash/size under staging.lock, that contract no longer holds — a stale snapshot is simply ignored. Rename + rewrite the test as `sparse_open_stale_snapshot_uses_live_state` to assert the new behaviour: open succeeds and installs a sparse_write keyed to the live hash.
…l upload Mirrors the workload from https://huggingface.co/blog/delta-weight-sync: seed a checkpoint-sized file on a mounted HF bucket, then for N steps modify ~1% of bytes at scattered offsets (the wire-side equivalent of patching changed bf16 elements) and time each commit end-to-end. Two tests: - bench_sparse_delta_commits: --sparse-writes → range_upload composes only dirty bytes against the old CAS base. - bench_full_overwrite_commits: --advanced-writes only → every step re-uploads the whole file (baseline for the "ship the full checkpoint" approach the blog post compares against). Configurable via env: BENCH_FILE_SIZE_MB default 64 BENCH_STEPS default 5 BENCH_DELTA_RATE default 0.01 (1%) BENCH_AVG_RUN_LEN default 8 Run with HF_TOKEN set: cargo test --release --test sparse_delta_bench -- --nocapture --test-threads=1
Temporary instrumentation to diagnose the original_size mismatch reported by xet-core when the bench runs (109MB vs the 64MB the bench actually writes). Will be removed once the bug is identified.
The delta-weight-sync style bench surfaced a wire-side bug: after a non-sparse upload of a 64 MiB seed file, `XetFileInfo::file_size` returned 68943872 bytes (~65.75 MiB, +1.75 MiB inflated). The CAS content was still correctly 64 MiB (verified via xet-core's reconstruction info: `caller said original_size=68943872 but reconstruction info reports 67108864`), so the bug is purely in xet's returned metadata, not in the uploaded bytes. Root cause sits upstream in xet-core's `file_cleaner::finish_inner`: the returned `file_size` comes from `deduplication_metrics.total_bytes`, which over-counts when in-flight chunks match earlier bytes via local dedup. Until that's fixed upstream, trust the size we already know authoritatively: - `flush_batch`: use `item.file_size` (captured under `staging.lock` alongside the dirty_generation snapshot; equals the on-disk staging length at upload time since concurrent O_TRUNC/setattr-shrink also take staging.lock) when calling `apply_commit` / `apply_commit_sparse` and when logging the uploaded size. - `streaming_commit`: use `channel.bytes_written` (the atomic the write path increments per syscall — see WriteTarget::Streaming branch) as the committed size for `apply_commit`. Both sources are bit-exact with what we actually fed to xet-core, so they sidestep the over-counted return value entirely. Without this fix, a Pass B upload would set `entry.size` (and thus the next `sparse_write.original_size`) to the inflated value, and every subsequent `range_upload` would be rejected by xet-core with the original_size/reconstruction-size mismatch — the delta-weight-sync workflow was unusable end-to-end before this fix. Also tightens the bench: the previous diagnostic warn line becomes an `assert_eq!` so a future regression of the same shape fails the bench loudly instead of slipping through.
The previous bench simulated 'patch each changed bf16 element with its own 2-byte pwrite' (~53k scattered tiny writes per step). That hit a known quadratic cost in xet-core's upload_ranges window planner — single steps took 60+ minutes. It also doesn't match what the blog actually does: the delta-weight-sync TRL extension BATCHES the changed elements into a single sparse-safetensors blob and uploads that as one file. Rewrite the bench around the realistic batched pattern: - bench_sparse_delta_blob (default): one contiguous blob write per step, sized to delta_rate × file_size. This is the path our sparse-write design is actually optimized for — range_upload composes 1 dirty range against the old CAS base. - bench_full_overwrite (default): same blob workload, but on a non-sparse mount so the legacy upload_files re-uploads the full file every step. Baseline for the 'ship the whole checkpoint each step' approach the blog compares against. - bench_sparse_delta_scattered (opt-in via BENCH_RUN_SCATTERED=1): the old pathological N-tiny-pwrite pattern, kept as a regression marker for the upload_ranges window-planner cost. Off by default — single step takes 10+ minutes.
Bench at delta-weight-sync style workloads (1 contiguous blob per step, ~1% of file) shows our sparse-write loses to non-sparse + xet CDC dedup below ~1 GB: 64 MB: sparse 3.97s vs non-sparse 0.36s (~11x slower) 256 MB: sparse 4.57s vs non-sparse 0.59s (~8x slower) 1024 MB: sparse 0.88s vs non-sparse 1.35s (~1.5x faster) 1200 MB: sparse 1.45s vs non-sparse 1.37s (~par) xet-core's `upload_files` already wire-trims small files efficiently via CDC chunk dedup — only modified chunks go on the wire. The sparse path adds a reconstruction-info round-trip + per-window compose overhead that doesn't pay off until the file is large enough for local hashing + chunk-query traffic to dominate. Add a per-mount fallback threshold: files smaller than `sparse_min_size_bytes` (default 256 MiB, override via `HF_MOUNT_SPARSE_MIN_BYTES`) take the standard non-sparse path. Wired through `VfsConfig` so the CLI/setup layer owns the env-var read and the mock tests can set 0 to keep their tiny-file sparse semantics. No behavior change for files >= threshold; users who want sparse on all files unconditionally can set HF_MOUNT_SPARSE_MIN_BYTES=0.
When a sparse-write inode's `coverage` spans the full file (typical after a long-lived handle that read most of the file plus some dirty writes), the staging file already contains the committed content byte-for-byte. range_upload's per-window CAS re-fetches (`stream_cas_range` for old prefix/suffix bytes around each dirty range) then download data we already have locally — for a 12 MB edit in a 1.2 GB file with a 64 MB segment window, that's ~50 MB of waste per commit. Bypass range_upload for those items and route them through Pass B's `upload_files`. Same wire bytes — xet-core's CDC dedup re-uses existing xorbs for unchanged chunks — but no redundant CAS downloads to compose the new file. Gated on `coverage spans [0, file_size) && !dirty_ranges.is_empty()`: the second clause preserves the "no-op flush keeps the original hash" semantics. A flush with no dirty bytes but full coverage still hits range_upload's `dirty_inputs.is_empty()` short-circuit and returns the original hash unchanged. Forcing upload_files there would re-CDC into a new (content-equivalent but xet-different) hash and lose the cache. Apply path is unchanged: items with `sparse_snapshot.is_some()` still go through `apply_commit_sparse` (preserves the coverage map for the next session), regardless of which upload path produced the new hash.
…vings The warm-cache benches all show sparse and full + xet CDC dedup roughly tied at 1 GB+ scale on EC2: the staging file holds the seed content, so sparse's range_upload composes against a coverage-full state (which the fast path now routes through upload_files anyway) and the two paths become functionally equivalent. Where sparse actually wins is the cold-cache case: a fresh node opens a remote checkpoint, makes one edit, commits. The full-overwrite path has to download the entire checkpoint into local staging before it can open it for write; the sparse path punches a hole and composes against the existing CAS reconstruction, never materializing the unchanged bulk locally. Bench scenario: 1. Mount #1: seed the file, wait for the Hub commit, unmount + wipe cache. 2. Mount #2 (cold): same bucket, fresh cache_dir → no local state. 3. Write a single delta-sized blob, sync, verify; report the wall-time. Expected at 1.2 GB / 1% delta: sparse cold ~1-2 s, full cold ~12 s + tail (the 1.2 GB download dominates the full path).
Sparse items with full coverage take the upload_files fast path (uploaded as-is, no range_upload composition). The commit-apply dispatched on sparse_snapshot.is_some(), so these items went through apply_commit_sparse which leaves staging_is_current=false. Result: the next open re-downloaded the full file even though staging already matched the committed hash byte-for-byte. Track the fast-path decision per item and dispatch through apply_commit (drop sparse_write, flag staging_is_current) instead. Also flip the fast-path index store from Vec<usize> with linear .contains lookups to a Vec<bool> aligned with to_flush (O(1) per item).
…r install with open threshold Previously open_advanced_write returned EIO on a drift (xet_hash rotation between the open() snapshot and the install lock). The retry loop in open() caught any EIO, meaning a real disk error also paid the 4-attempt backoff (~75 ms) before surfacing. Switch the drift signal to EAGAIN (a dedicated internal sentinel; userspace never sees it because the loop upgrades a final EAGAIN to EIO). Real EIO now surfaces on the first attempt. The setattr sparse-install branch was also missing the sparse_min_size_bytes threshold check, so a small file getting its first edit via setattr would take the sparse path while the same file edited via open(O_TRUNC) would not. Mirror open_advanced_write's want_sparse predicate so both entry points agree on which files engage sparse. Also document why setattr's mid-mutation 'needs_rebuild requires Some(hash)' keeps the existing expect() rather than returning EIO: entry.size and set_dirty have already been applied above, so a mid-mutation EIO would leave a dirty inode with stale/empty sparse_write and the next flush would commit corrupt content. Failing loud is safer than failing silent.
…ck omission - snapshot_dirty_bytes: translate UnexpectedEof into a greppable invariant-violation message naming the offending range. Default Display is 'failed to fill whole buffer' which leaks zero context. - setattr sparse-install: document why no io_lock is needed (no prior staging file under !local_exists, so no Arc<File> from a previous open can race against File::create's O_TRUNC window; staging.lock(ino) already excludes concurrent open_advanced_write installs).
…nches exercise sparse The 256 MiB production default for HF_MOUNT_SPARSE_MIN_BYTES (added when small-file fallback landed) silently downgraded any test or bench whose files sat below that size, even when --sparse-writes was passed on the command line: - tests/fsx_paranoid.rs: MAX_SIZE = 1 MiB. The test's stated purpose is to catch range_upload composition regressions, but range_upload is only exercised on the sparse path. Without the override, this CI test ran the legacy download-then-upload path and would not have flagged sparse composition bugs. - tests/sparse_delta_bench.rs: default file size 64 MiB. The SPARSE_BLOB / SPARSE_SCATTERED / SPARSE_COLD benchmarks compared full-upload against full-upload rather than sparse against full unless the operator bumped BENCH_FILE_SIZE_MB above 256. Both call sites now set HF_MOUNT_SPARSE_MIN_BYTES=0 before spawning the mount child. Each tests/*.rs file compiles as its own test binary so the env mutation is contained, and the variable only matters for processes that also pass --sparse-writes — FULL baselines are unaffected. Found by codex review.
…f unsafe set_var cargo test runs tokio tests inside a binary concurrently, so calling std::env::set_var from a test setup races with sibling tests reading the parent env or spawning their own children. The race is benign for fsx (single test in the binary) but real for sparse_delta_bench, which has five parallel benchmark tests. Add mount_bucket_with_env that takes an &[(&str, &str)] of env pairs and forwards them via Command::env on the child only. The mutation never touches the parent's process env so the unsafe set_var precondition becomes moot. Also document the post-fast-path-fix consequence on the warm SPARSE_BLOB benchmark: the seed leaves staging fully covered, so subsequent delta opens install full coverage and flush picks the fast-path upload_files route — meaning warm SPARSE_BLOB now measures upload_files, not range_upload. The cold benchmark is the one that actually exercises range_upload composition. Found by codex follow-up review.
The existing cross-file stress exercises FlushManager batching and inter- inode locks but pins each worker to its own file, so the per-inode io_lock, sparse_write state machine, and dirty_generation guard never see real concurrent traffic. This new variant points N workers at a SINGLE inode — exactly the FUSE configuration where the per-inode synchronization is on the critical path. Workload mix per worker: 40% writes (overwrite / at-EOF / past-EOF-with- gap / random), 35% reads (heavy mix to thrash io_lock + sparse-fill), 15% setattr grow/shrink, 8% fsync, 5% reopen. RogueRotation is dropped (needs global quiesce — incompatible with concurrent siblings). Validation is fuzz-style rather than equivalence: overlapping concurrent writes make byte content indeterminate, so we assert (1) no panic / no poison / no surprise errno, (2) post-quiesce sparse_write invariants (sorted non-overlapping coverage and dirty_ranges, dirty ⊆ coverage, nothing past entry.size), (3) final read returns entry.size bytes. Defaults are 8 workers × 500 ops (~330 ms). Heavier configs via SAME_INODE_WORKERS / SAME_INODE_OPS / SAME_INODE_SEED — passes locally through 8/500, 16/2000, 24/5000 ×2 seeds, and 32/10000.
Companion to the chaos same-inode stress. That variant maximises race coverage but can only assert structural invariants because overlapping concurrent writes make byte content scheduler-dependent. This variant trades some race surface for a STRONG oracle: - File is split into N disjoint per-worker slots - Each worker writes ONLY within its own slot, maintaining a private shadow that is ground truth for that range - Reads can target ANY slot — exercises concurrent pread vs. another worker's pwrite + sparse-coverage snapshot pairing on the io_lock - After quiesce, the committed file must equal the concatenation of every worker's final shadow, byte-for-byte A silent corruption (torn read, lost write, fill_sparse_holes clobbering user bytes, dirty_generation guard missing a write) fails the final assert with the offending offset and owning worker. setattr is dropped to keep the per-slot ownership invariant clean (setattr is already exercised heavily by the chaos variant and the cross-file stress). Defaults: 8 workers × 1000 ops, 4 KiB slots (~500 ms). Heavier configs via VERIFIED_WORKERS / VERIFIED_OPS / VERIFIED_SLOT_SIZE / VERIFIED_SEED. Locally passes 8/1000, 16/5000, 32/20000 ×2 seeds, 64/30000, 64/50000.
The mock-driven stress runs hundreds of thousands of ops per second, fast enough that it's reasonable to wonder whether the workload actually fires all op types and exercises the sparse CAS path. Add per-op-type counters (writes/reads/fsyncs/reopens) plus MockXet activity (CAS streams + CAS objects created) and print them at the end of every run. Sanity asserts catch the failure modes a stress test silently degenerates into: - assert(writes > 0): rng never picked Write - assert(reads > 0): rng never picked Read - assert(fsyncs > 0): rng never picked Fsync - assert(reopens > 0): rng never picked Reopen - assert(cas_streams > 0): sparse-fill path never fired (would mean the staging cache covered every read from the start, indicating the sparse install didn't engage) If a future refactor accidentally short-circuits one of these op kinds the assertion fires immediately instead of the test passing with reduced coverage.
… mount New tests/sparse_concurrent_real.rs: - N tokio workers writing concurrently on a single FUSE-mounted inode - Each worker owns a disjoint slot so the byte content is deterministic and we can byte-compare after a fresh remount that bypasses the local staging cache - Exercises the full real-CAS pipeline: FUSE kernel routing + hf-mount-fuse async handler + range_upload composition + Hub commit + cold-cache read-back from CAS The remount-and-compare design specifically targets corruption modes that the in-process MockXet stress can't surface — anything that would only manifest with real network round-trips, real xet-core CDC, or real Hub commit semantics now fails the byte-level assert with offset + owner worker id. Defaults: 6 workers × 12 ops × 1 MiB slots ≈ 30 s end-to-end. Tunable via CONCURRENT_REAL_WORKERS / CONCURRENT_REAL_OPS_PER_WORKER. Wired into the fsx CI job alongside fsx_paranoid. Also fix a silent-pass bug shared by fsx_paranoid: mount_bucket_with_env only WARNS on a failed wait_for_mount and returns the Child regardless. If the mount fails (stale FUSE state, broken macFUSE on a dev machine, missing kernel module), every std::fs operation then hits the bare /tmp dir, the round-trip read returns the same local bytes, and the test silently passes despite no actual CAS activity. Both tests now call common::is_mounted after the mount step and panic if the mount isn't live.
Three tests and one NFS errno mapping ported from feat/append-write (the predecessor branch). These pin invariants that the rewrite preserves but that weren't previously exercised end-to-end: - sparse_post_flush_read_returns_cas_bytes_not_zeros: after apply_commit_sparse the inode keeps its sparse_write (rekeyed to the new hash) with cleared dirty_ranges. A read of an unmodified region must fill holes from the just-committed CAS file, not return zeros from the staging holes. Pre-fix on PR #41 this returned '00XX000000' instead of '01XX456789'. - sparse_fill_short_cas_stream_returns_eio_not_panic: when a CAS download yields zero bytes for a non-empty hole range, fill_sparse_holes must surface EIO rather than panicking on an OOB copy_from_slice. The code already has the data.len() < span check; this test pins it. - sparse_noop_flush_advances_mtime: apply_noop_commit calls touch_commit_clocks so mtime advances even when the upload returns the unchanged hash. Required for POSIX-y fsync semantics: a successful fsync after an open-for-write should reflect modification time, not freeze it because the bytes happened to dedup. - nfs: errno_to_nfs maps libc::EAGAIN -> NFS3ERR_JUKEBOX defensively. EAGAIN is now the sparse-write drift sentinel (returned from open_advanced_write when poll rotates xet_hash between snapshot and install). The retry loop in open() normally upgrades it to EIO so userspace never sees it, but mapping it directly avoids degrading a future leak to NFS3ERR_IO on NFS exports. Tests not backported (already covered or not applicable to the rewrite): - 25 helper-level unit tests (sparse_merge_range_*, sparse_clip_*, sparse_two_*, sparse_engulf_*, etc.) duplicate what src/virtual_fs/inode.rs::tests already covers structurally. - b1/b5/b6/b8/c1/c2/c3/c4 numbered tests target specific bugs of the old approach (setattr loses bytes, stale sparse_write rollbacks). The rewrite addresses them structurally; equivalent regression coverage exists via sparse_setattr_after_hash_drift_uses_current_revision, sparse_otrunc_then_write_then_reopen_keeps_user_data, and the same-inode concurrent stress / verified-integrity tests. - D1/E1/E3 source-shape tests grep mod.rs for specific code patterns (e.g. 'self.staging.io_lock(' in read() before the pread). Equivalent invariants in the new code shape (which uses io_lock + dirty_generation guard slightly differently) would need full rewrite to be useful, and the behavioural integrity stress already catches the bugs these were guarding.
Two new env vars on the multi-worker real-CAS stress so it can scale beyond the original 1 MiB-per-slot defaults: - CONCURRENT_REAL_SLOT_KB (default 1024): override the per-worker slot size. With 16 workers × 65536 KiB = 1 GiB file, the test exercises range_upload composition on realistic checkpoint-sized files. - CONCURRENT_REAL_FSYNC_EVERY_N (default 0 = off): force sync_all every N ops per worker. Simulates the training-loop pattern of periodic checkpoint saves and exercises multiple apply_commit_sparse cycles within a single test run instead of just the final flush. Validated on EC2 (us-east-1, m6i.2xlarge) over a 12-min looped stress: 20 iterations × 16 workers × 10000 ops × 64 MiB slots × fsync_every_20 = 3.2M writes + 160k fsyncs across 20 full seed-to-CAS-to-remount-and- verify cycles, all byte-compare PASS.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Clean-slate rewrite of the sparse-write feature that previously lived on
feat/append-write(which accumulated 25+ fix commits and was gated off-by-default because the state machine was too fragile to maintain).The architectural change: track sparse-write staging via an explicit
coveragerange vector instead of astaging_holds_full_originalflag plus implicit invariants scattered across 6+ call sites. Each byte position is in exactly one of three states recorded structurally:coverageand indirty_ranges: user wrote here, range_upload composes from stagingcoverage, not indirty_ranges: cached from CAS, matchesoriginal_hashcoverage: sparse hole, read fills lazily from CASdirty_ranges ⊆ coverageis structural, not asserted.track_writetouches both;track_read_filltouches coverage only.Why this is simpler than the previous attempt
feat/append-write)SparseWriteState(original_hash,original_size,effective_original_size,dirty_ranges,staging_holds_full_original)original_hash,original_size,coverage,dirty_ranges)apply_commit(was_sparse_upload: bool)flag-driven divergenceapply_commit(non-sparse),apply_commit_sparse(sparse)staging_holds_full_originalflag promoted/demoted ad-hoc in 5 placescoverage.covers([0, size))fill_sparse_holespersists into staging, subsequent reads hit cacherange_uploadPreadReader holds io_lock per-chunk (Codex finding: chimeric reads)range_uploadtakes pre-snapshottedBytesin memory, no lock held during uploadsparse_writes=true+advanced_writes=false→ effective false)TestOpts.sparse_writesforcesadvanced_writes--sparse-writessilently downgrades if--advanced-writesmissingCommits
13 commits, each independently green (
cargo test --lib --features fusepasses after every one):Test coverage
inode.rs::tests): 16 tests for the helper functions (merge_range,clip_ranges,subtract_ranges) and the state-machine methods (track_write,track_read_fill,clip_to_size,holes_in,is_fully_covered).virtual_fs::tests): 9 invariant tests covering the observable behavior — open skips download, read fetches from CAS, second read hits staging cache, write overlays correctly, flush composes the right CAS file, setattr-shrink works, no-op flush preserves hash.--sparse-writes, exercises CAS round-trip after every random mutation.Total: 356 tests pass, up from 329 on
main. No tests removed.Status
--sparse-writesis off by default and marked experimental. The flag implies--advanced-writes(enforced at the setup layer so users can't accidentally get the legacy write path).Closes the design problem behind
feat/append-write. Once this lands, that branch can be closed.