Skip to content

feat(vfs): sparse writes with coverage-map cache (clean rewrite)#180

Open
XciD wants to merge 65 commits into
mainfrom
refactor/sparse-write-fresh
Open

feat(vfs): sparse writes with coverage-map cache (clean rewrite)#180
XciD wants to merge 65 commits into
mainfrom
refactor/sparse-write-fresh

Conversation

@XciD

@XciD XciD commented May 28, 2026

Copy link
Copy Markdown
Member

Summary

Clean-slate rewrite of the sparse-write feature that previously lived on feat/append-write (which accumulated 25+ fix commits and was gated off-by-default because the state machine was too fragile to maintain).

The architectural change: track sparse-write staging via an explicit coverage range vector instead of a staging_holds_full_original flag plus implicit invariants scattered across 6+ call sites. Each byte position is in exactly one of three states recorded structurally:

  • In coverage and in dirty_ranges: user wrote here, range_upload composes from staging
  • In coverage, not in dirty_ranges: cached from CAS, matches original_hash
  • Outside coverage: sparse hole, read fills lazily from CAS

dirty_ranges ⊆ coverage is structural, not asserted. track_write touches both; track_read_fill touches coverage only.

Why this is simpler than the previous attempt

Previous (feat/append-write) This branch
5 fields on SparseWriteState (original_hash, original_size, effective_original_size, dirty_ranges, staging_holds_full_original) 4 fields (original_hash, original_size, coverage, dirty_ranges)
3 install sites (open, write lazy-install, setattr) with subtly different gates 2 install sites (open, setattr) with the same predicate
apply_commit(was_sparse_upload: bool) flag-driven divergence Two methods: apply_commit (non-sparse), apply_commit_sparse (sparse)
staging_holds_full_original flag promoted/demoted ad-hoc in 5 places Removed; derived as coverage.covers([0, size))
Drift retry returns EAGAIN; FUSE maps to userspace EAGAIN No drift retry needed; snapshot/install happens under same write lock
Reads re-download from CAS on every hole hit fill_sparse_holes persists into staging, subsequent reads hit cache
range_upload PreadReader holds io_lock per-chunk (Codex finding: chimeric reads) range_upload takes pre-snapshotted Bytes in memory, no lock held during upload
Test infrastructure had silent flag downgrade (sparse_writes=true + advanced_writes=false → effective false) TestOpts.sparse_writes forces advanced_writes
CLI --sparse-writes silently downgrades if --advanced-writes missing CLI implication enforced at setup layer

Commits

13 commits, each independently green (cargo test --lib --features fuse passes after every one):

2fca1d3 chore(deps): pin xet-core to 40f95307
23d3b81 feat(xet): add range_upload over pre-snapshotted dirty ranges
6964e1d feat(inode): add SparseWriteState with coverage map
118bf4f feat(vfs): add sparse_writes config + per-inode I/O lock
5cf732b feat(vfs): install sparse_write in open_advanced_write
3b80ddc feat(vfs): fill sparse holes from CAS and cache into staging
2271825 feat(vfs): track writes in sparse_write under io_lock
daceb10 feat(vfs): wire sparse_write through setattr
36f13d2 feat(flush): sparse upload via range_upload + apply_commit_sparse
6de58f8 feat(setup): add --sparse-writes flag (implies --advanced-writes)
b242229 docs(inode): note that sparse_write outlives a poll-rotation
90649da test(vfs): invariant tests for sparse-write path
40a1459 test+ci: wire fsx_paranoid CAS round-trip into the fsx job

Test coverage

  • Unit tests (inode.rs::tests): 16 tests for the helper functions (merge_range, clip_ranges, subtract_ranges) and the state-machine methods (track_write, track_read_fill, clip_to_size, holes_in, is_fully_covered).
  • Integration tests (virtual_fs::tests): 9 invariant tests covering the observable behavior — open skips download, read fetches from CAS, second read hits staging cache, write overlays correctly, flush composes the right CAS file, setattr-shrink works, no-op flush preserves hash.
  • fsx_paranoid: wired into CI with --sparse-writes, exercises CAS round-trip after every random mutation.

Total: 356 tests pass, up from 329 on main. No tests removed.

Status

--sparse-writes is off by default and marked experimental. The flag implies --advanced-writes (enforced at the setup layer so users can't accidentally get the legacy write path).

Closes the design problem behind feat/append-write. Once this lands, that branch can be closed.

XciD added 13 commits May 28, 2026 15:56
Switch from tracking the main branch to pinning a specific rev. The
pinned commit includes the range_upload entry point that the sparse
write feature relies on. Pinning makes builds reproducible and protects
against accidental xet-core regressions slipping in via main.
Sparse-write flushes need to compose a new CAS file from an existing
reconstruction plus a small set of modified byte ranges, without re-
uploading the unchanged bytes. xet-core's upload_ranges does the heavy
lifting; this commit wires it through XetOps with a clean API.

The caller passes RangeSnapshot values containing the offset and a
Bytes buffer of pre-read data. range_upload itself never touches the
staging file and never holds an I/O lock, so concurrent writes during
a flush cannot interleave with our reads and produce torn content.
That responsibility moves to the caller, which reads from staging
under the per-inode I/O lock before invoking range_upload.

A BytesReader wraps each snapshot as an AsyncRead so xet-core's
DirtyInput interface stays satisfied without round-tripping through
disk again. Truncate-past-end is handled by appending a synthetic empty
DirtyInput over the cut tail so the original bytes past the new size
are dropped from the composition.

MockXet implements the trait so tests can exercise the sparse upload
path without a live CAS connection.
Track sparse-write staging files via two range vectors instead of the
flag-soup that the previous design accumulated:

- `coverage`: ranges of staging holding real bytes (either cached from
  CAS or written by the user). Bytes outside coverage are sparse holes
  that read paths must fill from CAS.
- `dirty_ranges`: subset of coverage that the user modified since the
  last commit. Fed to range_upload at flush time to compose the new
  CAS file without re-uploading unchanged bytes.

Invariant `dirty_ranges ⊆ coverage` is structural: track_write touches
both, track_read_fill touches coverage only. No `staging_holds_full`
boolean to keep in sync — staging completeness is `is_fully_covered(size)`,
derived from coverage.

apply_commit_sparse re-keys the state to the new hash with dirty cleared
and coverage preserved: the new CAS file was just composed from the
covered staging bytes, so staging still matches the new hash at those
offsets. Reads outside coverage will fetch fresh from the new hash
on demand, lazily growing coverage again.

apply_noop_commit clears dirty + pending_deletes when the generation
matches, but leaves coverage and sparse_write intact — the on-disk
staging file is unchanged, so the existing cache is still valid for
the unchanged hash.

The InodeEntry gains a `sparse_write: Option<Arc<SparseWriteState>>`
field, set to None on both existing constructors. No call site uses
this yet — wiring follows in subsequent commits.

Unit tests cover the helper functions (merge_range, clip_ranges,
subtract_ranges) and the state-machine methods (track_write,
track_read_fill, clip_to_size, holes_in, is_fully_covered).
Plumb sparse_writes through VfsConfig and VirtualFs, and add the
per-inode sync I/O lock that the sparse read/write paths will use to
serialize the (pread + coverage snapshot) and (pwrite + track_write)
critical sections against each other and against range_upload's reads.

The derivation `sparse_writes = config.sparse_writes && (advanced_writes
|| overlay)` ensures the feature can never be enabled without the
staging substrate it requires; the CLI layer will assert the
implication explicitly so users get a hard error rather than a silent
downgrade.

TestOpts gets a `sparse_writes` field; when set, the helper forces
advanced_writes on so the underlying VfsConfig derivation does not drop
the flag. No call site uses the new field/lock yet — wiring follows.
When `--sparse-writes` is enabled and the open targets a CAS-backed
file (non-truncate, non-overlay, non-empty xet_hash), punch a sparse
hole of `size` bytes instead of downloading the original content. Reads
in [0, size) outside the dirty regions will fetch from CAS lazily and
persist into staging via fill_sparse_holes — that part lands in the
next commit.

A SparseWriteState is installed under the inode-table write lock that
guards `set_dirty`, so poll can never observe a (dirty, sparse_write =
None) intermediate. The state is keyed to the snapshot xet_hash; if
poll rotated `entry.xet_hash` between `open()` and here, the snapshot
check fails and we skip the install — the inode is left clean (no
sparse_write, no dirty) and the next write will re-prepare against the
current hash.

`staging_holds_original` distinguishes three open paths: reused staging
that mirrors the current hash, fresh download into staging, and a fresh
sparse hole. The first two use `SparseWriteState::new_full` (coverage
covers everything; fill_sparse_holes is a no-op). The third uses
`SparseWriteState::new` (empty coverage; reads fill on demand).

For the simple `--advanced-writes` path (no sparse_writes), nothing
changes: download-or-create-empty as before, no sparse_write installed.
Add fill_sparse_holes to the read path. After pread fills buf with
staging bytes (zeros for sparse holes, real bytes for covered ranges),
walk the holes recorded in the sparse_write snapshot:

  1. Download the hole's bytes from CAS.
  2. Overlay them into buf — the user-visible read result.
  3. pwrite them into staging at the hole's offset and update coverage,
     so subsequent reads of the same range hit staging instead of
     re-downloading.

The pread + sparse_write snapshot run under the per-inode io_lock, so
buf and the snapshot describe a consistent point-in-time view of
staging. The CAS download runs without any lock held (a long blocking
fetch must not block writers). Before pwriting cached bytes back into
staging, re-acquire io_lock and re-check the live coverage: a
concurrent writer may have filled the hole in the meantime, and we
must never overwrite their dirty bytes with the original CAS content.

User-visible reads always reflect the snapshot at pread time. Even if
a writer fills the hole between snapshot and overlay, the read returns
the snapshot's bytes — POSIX permits any interleaving for concurrent
read/write, and the staging cache is updated only where it doesn't
collide with the writer.

Bytes past `original_size` are skipped entirely: they don't exist in
CAS, so pread already filled buf correctly (either zeros for a sparse
hole extension or the user's own written bytes).
Wrap the write path's pwrite + inode metadata update with the per-inode
io_lock, and call track_write on the inode's sparse_write to extend
both coverage and dirty_ranges by the just-written window.

The io_lock makes the (pwrite + sparse_write update) pair atomic against
the read path's (pread + sparse_write snapshot) pair. A concurrent
reader sees either the pre-write state (zeros from a hole, plus no
dirty_range at this offset) or the post-write state (real bytes in
staging, plus matching dirty_range), never an inconsistent middle.

When sparse_write is None (advanced mode without --sparse-writes, or
clean inode), the lock is still taken but track_write is skipped — the
overhead is one stdlib mutex per write, which the existing flush-batch
GC and setattr-truncate paths already pay.
Three sparse-related cases in setattr(size):

1. Inode already has sparse_write (open installed it): the grow path
   calls track_write to mark the new tail as dirty (its zeros will be
   uploaded via range_upload), the shrink path calls clip_to_size to
   trim coverage and dirty_ranges past new_size.

2. Setattr arrives on a clean CAS-backed file with no prior open and
   no staging file. In sparse_writes mode, install sparse_write fresh
   keyed to the original hash, punch a sparse hole sized to new_size
   (no download), then track the grow/shrink. This is the only setattr
   install point — open is the other; both produce the same end state.

3. Inode has no sparse_write and sparse_writes is off, or has_remote_xet
   is false: existing behavior (download then set_len). Falls through
   unchanged.

The set_len call is skipped on the sparse-install path because the
freshly-created staging file is already exactly new_size bytes.
Split the flush_batch upload loop into two passes:

- Pass A (per-item): for each FlushItem with a sparse_snapshot, read
  the dirty bytes from staging under the per-inode io_lock, package
  them as RangeSnapshot values, and call range_upload to compose the
  new CAS file from the existing reconstruction. The io_lock makes the
  read atomic against concurrent pwrites, so xet-core hashes a
  self-consistent snapshot — not a chimeric mix of pre- and post-write
  bytes.

- Pass B (batched): non-sparse items go through the existing
  upload_files chunked path unchanged.

FlushItem gains `file_size` and `sparse_snapshot` fields captured under
the inode read lock at snapshot time. apply_commit_sparse is called on
the inode after a successful sparse upload: it re-keys sparse_write to
the new hash, clears dirty_ranges, and preserves coverage so subsequent
reads through still-open handles can keep hitting the staging cache.

The unchanged-files path (same hash, no pending_deletes) now uses
apply_noop_commit instead of apply_commit, leaving coverage intact when
nothing actually changed remotely.

The snapshot_dirty_bytes helper handles the per-range read with a
graceful short-read fallback for the setattr-shrink race window.
Expose the sparse-write feature through the CLI. The flag is off by
default and marked experimental. The implication --sparse-writes ->
--advanced-writes is enforced at the setup layer: setting just
--sparse-writes turns advanced_writes on too. Without this enforcement,
VfsConfig's derivation would silently downgrade sparse_writes to false
when advanced_writes is missing, and the user would see the legacy
write path behind their back.

A startup warn! announces that the feature is experimental so users
running it know to report issues.
update_remote_file already gated on !is_dirty (matching main); a
pool-cached writable fh by itself does not block poll because the
inode is clean between writes. The Codex finding about idle pool fhs
freezing poll applied to the feat/append-write code, not this branch.

Document that sparse_write deliberately stays keyed to its original_hash
across a poll-driven xet_hash rotation: readers and writers keep
operating in their snapshot's frame of reference, just like the
non-sparse path's flush silently absorbs poll-detected changes. The
sparse path doesn't make this worse.
Nine new integration tests pin the user-visible behaviors of
--sparse-writes:

- sparse_open_skips_download: open does not call download_to_file
- sparse_read_returns_cas_bytes: read fetches the original from CAS
- sparse_read_caches_into_staging: subsequent reads hit the staging
  cache, not CAS
- sparse_read_uncached_region_hits_cas: cache fills lazily; reads
  outside the cached window still go to CAS
- sparse_write_overlays_on_original: writes show through subsequent
  reads as overlays on the CAS-backed bytes
- sparse_flush_commits_composed_file: range_upload produces the
  expected composition (original + dirty overlay) at flush time
- sparse_write_past_eof_extends: writes past EOF extend the file
  with the gap zero-filled
- sparse_setattr_shrink_truncates_at_flush: setattr-shrink lands the
  truncated content in CAS
- sparse_noop_flush_preserves_hash_and_cache: open-read-release with
  no writes keeps the original hash and doesn't dirty anything

These intentionally exercise observable behavior (read return values,
final committed hash, CAS stream call counts) rather than implementation
details (the SparseWriteState fields, lifecycle of internal flags).
Unit-level coverage of the state-machine helpers lives in inode.rs.
fsx_paranoid is a paranoid fsx variant: after every mutation it closes
the file, waits for the async flush to commit to CAS, then re-opens
and reads back. This catches range_upload composition bugs that the
canonical fsx test misses (which reads from local staging, not from
the CAS round-trip).

The mount uses `--sparse-writes --advanced-writes` so the test
actually exercises the sparse path — without --sparse-writes,
range_upload is never invoked and the test cannot fulfill its purpose.

Wired into the same CI job as fsx so any range_upload regression
breaks CI in the same place.

Also fixes a clippy warning in flush.rs (useless .into_iter()).
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Benchmark Results

============================================================
  Benchmark — 50MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    239.7 MB/s     233.4 MB/s
  Sequential re-read                2251.9 MB/s    2422.7 MB/s
  Range read (1MB@25MB)                0.5 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1444.5 MB/s
  Close latency (CAS+Hub)            0.137 s
  Write end-to-end                   291.3 MB/s
  Dedup write                       1729.7 MB/s
  Dedup close latency                0.089 s
  Dedup end-to-end                   424.5 MB/s
============================================================
============================================================
  Benchmark — 200MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    823.7 MB/s     856.5 MB/s
  Sequential re-read                2221.2 MB/s    2440.4 MB/s
  Range read (1MB@25MB)                0.2 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1693.4 MB/s
  Close latency (CAS+Hub)            0.111 s
  Write end-to-end                   872.5 MB/s
  Dedup write                       1654.9 MB/s
  Dedup close latency                0.119 s
  Dedup end-to-end                   835.4 MB/s
============================================================
============================================================
  Benchmark — 500MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                   1303.2 MB/s    1219.7 MB/s
  Sequential re-read                2231.9 MB/s    2408.5 MB/s
  Range read (1MB@25MB)                0.2 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1521.7 MB/s
  Close latency (CAS+Hub)            0.125 s
  Write end-to-end                  1101.4 MB/s
  Dedup write                       1453.3 MB/s
  Dedup close latency                0.141 s
  Dedup end-to-end                  1031.4 MB/s
============================================================
============================================================
  fio Benchmark Results
------------------------------------------------------------
  Job                        FUSE MB/s   NFS MB/s  FUSE IOPS   NFS IOPS
  ------------------------- ---------- ---------- ---------- ----------
  seq-read-100M                  526.3      456.6                      
  seq-reread-100M               2500.0       29.5                      
  rand-read-4k-100M                0.1        0.1         14         15
  seq-read-5x10M                 806.5      806.5                      
  rand-read-10x1M                  0.1        0.1         34         35
  Random Read Latency           FUSE avg      NFS avg
  ------------------------- ------------ ------------
  rand-read-4k-100M           69483.8 us   64931.0 us
  rand-read-10x1M             29434.6 us   28726.5 us
============================================================

@github-actions

Copy link
Copy Markdown
Contributor

POSIX Compliance (pjdfstest)

============================================================
  pjdfstest POSIX Compliance Results
------------------------------------------------------------
  Files: 130/130 passed    Tests: 832 total (0 subtests failed)
  Result: PASS
------------------------------------------------------------
  Category               Passed    Total   Status
  -------------------- -------- -------- --------
  chflags                     5        5       OK
  chmod                       8        8       OK
  chown                       6        6       OK
  ftruncate                  13       13       OK
  granular                    5        5       OK
  mkdir                       9        9       OK
  open                       19       19       OK
  posix_fallocate             1        1       OK
  rename                     10       10       OK
  rmdir                      11       11       OK
  symlink                    10       10       OK
  truncate                   13       13       OK
  unlink                     11       11       OK
  utimensat                   9        9       OK
============================================================

XciD added 15 commits May 28, 2026 17:15
P1 — preserve dirty sparse_write on reopen (mod.rs):
When can_reuse_staging is true and an existing sparse_write is already
keyed to the same hash, keep it. Replacing it with a fresh empty state
dropped the in-flight dirty_ranges and coverage from the first open,
which made the next flush a no-op while the user's writes were still
in staging.

P1 — track zero gap in past-EOF writes (mod.rs):
When pwrite starts past entry.size, the gap [prev_size, offset) is
zero-filled by the file extension. Mark the whole [prev_size, new_end)
range dirty (not just [offset, new_end)) so range_upload composes the
zeros into the new CAS file. Without this the real xet-core would
commit a truncated file; MockXet zero-pads to new_file_size and the
existing test masked the bug.

P2 — bail with EIO on xet_hash drift during open (mod.rs):
The drift check moves before set_dirty. If poll rotated entry.xet_hash
between the open() snapshot and here, the staging we just punched as
a hole no longer matches reality. Previously we'd skip the
sparse_write install but still set_dirty + return a writable handle,
so the next flush would see (dirty, sparse_write = None) and take the
regular full-staging upload path — committing zero-filled holes as
real content.

Tests:
- sparse_write_past_eof_extends adds a whitebox dirty_ranges assertion
  so the mock can no longer mask a missing-zero-gap regression.
- sparse_reopen_dirty_preserves_state exercises the reopen-before-flush
  path and pins the dirty_ranges invariant.
Address two test-quality gaps Codex flagged on the prior fix:

- sparse_reopen_dirty_preserves_state was non-deterministic: it
  released fh1 then reopened, with an early-return guard if the
  background flush had already cleared the dirty flag. Hold fh1 open
  across the reopen instead — a held writable handle keeps the inode
  dirty regardless of debounce timing, so the reopen path is always
  exercised. Also assert the user's bytes read back through the
  reopened handle.

- Add sparse_open_hash_drift_returns_eio: drives open_advanced_write
  directly with a stale snapshot hash to exercise the drift branch
  deterministically (no need to race the poll loop). Asserts EIO and
  that the inode is left clean with no sparse_write installed.
…ion, GC budget)

Addresses the findings from the max code review of the sparse-write path.

Correctness:
- Fix silent data loss on O_TRUNC then write then reopen-without-truncate.
  O_TRUNC clears sparse_write but keeps xet_hash; a write then lands while
  sparse_write is None (untracked); the reopen used to install an
  empty-coverage SparseWriteState over the dirty staging, so reads overlaid the
  original CAS bytes and the flush no-oped back to the original hash, losing the
  write. The reopen now only installs empty coverage for a freshly punched hole,
  full coverage when staging mirrors the original, and otherwise falls back to
  the non-sparse full-staging upload. Adds a regression test.

Performance / contention:
- fill_sparse_holes no longer holds the global inode-table write lock across the
  blocking staging writes, and coalesces the per-hole CAS fetches into a single
  ranged stream (one reconstruction round-trip instead of N for a fragmented
  read).
- read() skips the io_lock + coverage snapshot on non-sparse mounts and for
  whole-file-cache handles (new OpenFile::Local.staging_backed flag), restoring
  the previously lock-free local read path.
- Sparse opens charge actual disk usage (a freshly punched hole is ~0 blocks)
  against the GC budget instead of the full logical size, so a small edit to a
  large CAS file no longer evicts other staging files; materialized bytes are
  accounted as reads fill holes.

Lifecycle:
- A whole-file-cache read handle no longer consults the inode's lingering
  sparse_write after a sparse commit, eliminating the EBADF pwrite-per-hole and
  redundant CAS refetch on read-only reopens.

Cleanup / safety:
- Replace the hand-rolled BytesReader with std::io::Cursor (tokio AsyncRead).
- Extract the original_range composition mapping into the pure, unit-tested
  plan_original_ranges (the mock XetOps bypassed this logic entirely).
- fill_sparse_holes uses the safe FileExt::write_at instead of raw libc::pwrite.
- snapshot_dirty_bytes uses read_exact_at and fails loud on a short read (the
  staging lock makes it unreachable) instead of silently composing a
  wrong-length file.
- Remove the dead is_fully_covered helper.
- Revert the sparse-hole GC accounting change. Charging a freshly punched hole
  at its actual disk usage (~0 blocks) while try_remove / write / setattr still
  credit the logical size left the global staging budget asymmetric: eviction
  over-subtracted bytes_used, saturating it toward 0 and disarming GC. Restore
  consistent logical-size accounting (a sparse hole is charged conservatively at
  its logical size). Making the staging budget truly block-based is a separate,
  whole-model change.
- Mark the HTTP non-Xet download cache and freshly-created file handles as not
  staging-backed, so reads of them keep the lock-free fast path instead of
  taking the per-inode io_lock + coverage snapshot. Those handles never carry a
  sparse_write, so the sparse read path was pure overhead for them on sparse
  mounts (the lock regression the staging_backed flag set out to remove).
- Add multi-snapshot plan_original_ranges tests covering the by-index
  reader-pairing that range_upload relies on (data entries before the
  synthetic truncate tail).
Quality-only cleanups from the /simplify pass; no behavior change.

- Add SparseWriteState::resize_to(prev_size, new_size) and call it from both
  setattr sparse-install branches, replacing the verbatim-duplicated grow/shrink
  (track_write / clip_to_size) idiom.
- Drop the redundant `staging_holds_original` local in open_advanced_write: at
  its only read site can_reuse_staging is already true, so it equals
  staging_is_current; use that directly.
- Extract InodeEntry::touch_commit_clocks() for the mtime/ctime/last_revalidated
  tail triplicated across apply_commit / apply_commit_sparse / apply_noop_commit.
- Skip the per-inode io_lock on the write path when sparse writes are disabled,
  matching the read fast-path: the lock only serializes against sparse-read
  snapshots, so non-sparse mounts keep their lock-free write path.
- Remove the now-stale #[allow(dead_code)] on StagingCoordinator::io_lock (it
  has real callers).
Poll can rotate an inode's xet_hash on a clean inode
(InodeTable::update_remote_file) while leaving a retained sparse_write
keyed to the old hash. open_advanced_write already guards this drift,
but setattr did not: it reused the stale sparse_write, so flush composed
range_upload against the wrong base and silently overwrote the
poll-discovered remote update.

Read the current revision (cur_hash/cur_size) under the write lock and:
- rebuild the sparse_write against the current xet revision when the hash
  drifted to a new xet hash (empty coverage, reads refill from the
  current CAS object);
- on a grow, truncate staging to cur_size before extending so the
  extension is POSIX zero-fill, not stale tail bytes from the old
  (larger) revision;
- when the current revision is no longer xet-backed (xet_hash None),
  keep the existing hash (last-writer-wins, the same accepted race as the
  non-sparse path) rather than dropping the state (which would upload the
  hole-staging zeros).

Adds three regression tests covering hash-to-hash drift, grow after
drift to a smaller revision, and grow after drift to a non-xet revision.
The grow tests were verified to fail without the staging zero-fill.

Also folds in small /simplify cleanups in the same area: drop a stale
allow(dead_code) on the now-wired sparse_writes field and remove a
doubly-negated need_set_len intermediate.
…ranoid

Two findings from the sparse-write review:

- open_local_readonly opened the per-inode sparse staging file read-only,
  but fill_sparse_holes writes fetched CAS bytes back into staging via
  write_at on that same fd. On a read-only fd every cache write failed
  with EBADF, so each read of a hole re-fetched from CAS and logged a
  warning instead of honoring the staging cache. Open staging-backed read
  handles with write permission (the FUSE handle stays read-only). Complete
  cache files keep the read-only fast path.

- fsx_paranoid mounted without --direct-io, so the kernel page cache
  (FOPEN_KEEP_CACHE) could serve the read-back from the just-written pages
  and a range_upload that corrupted the committed CAS object would still
  match the reference. Add --direct-io so every verification read goes
  through the FUSE handler and, on the now-clean inode, reads the committed
  CAS revision.

Adds a regression test that a read-only open of a dirty sparse file caches
hole-fills into staging (second read hits the cache); verified to fail with
the read-only fd.
Closes four findings from the high-effort code review on the sparse-write
PR:

- flush.rs Pass A error fan-out: a sparse range_upload failure used to
  attribute "range_upload failed" to every item in to_flush, including
  non-sparse items that Pass B had not even attempted yet. Now only the
  failing inode is marked errored; siblings commit independently.

- flush.rs head-of-line blocking: Pass A early-returned on the first
  sparse failure, starving independent sparse items behind a sticky
  upstream error. The loop now continues; failed items leave their
  upload_results slot as None and the downstream commit-apply skips
  them so they retry on the next debounce.

- mod.rs setattr sparse-install with cur_hash=None: the
  `cur_hash.or(prev_hash)` fallback built a SparseWriteState keyed to
  prev_hash with original_size=cur_size, mismatching what
  plan_original_ranges expects from the real xet upload_ranges (the
  mock papers it over by using original.len()). When poll rotates the
  inode to non-xet between snapshot and lock we now return EIO so the
  caller retries against the rotated state.

- inode.rs apply_noop_commit: now promotes staging_is_current=true on
  the non-sparse path (upload_files rehashed staging back to the
  previous hash, so staging IS the committed content). Sparse path
  keeps the flag false because holes remain unmaterialized.
Regression coverage for the four fixes in f478a78:

- flush_pass_a_failure_does_not_pollute_non_sparse_sibling: one sparse
  inode + one non-sparse inode in the same batch; the sparse Pass A
  failure must not contaminate flush_errors for the non-sparse sibling,
  which still commits via upload_files.

- flush_pass_a_failure_does_not_block_independent_sparse_items: two
  sparse inodes in one batch with a one-shot upload failure; the
  sibling whose range_upload succeeds must end clean. Before the fix
  the early-return killed Pass A entirely so neither converged.

Smoke tests for sparse edges that weren't covered:

- sparse_write_at_eof_boundary_extends_cleanly: a write landing
  exactly at EOF (no gap, no overlap) must not leave a phantom seam
  between covered and dirty regions.

- sparse_setattr_shrink_inside_dirty_range_keeps_clipped_writes:
  setattr-shrink that cuts through the middle of a dirty range must
  keep the clipped prefix, not drop the range entirely.

- sparse_zero_length_write_is_a_noop: defensive — pwrite(0) must not
  create a zero-length entry in dirty_ranges nor change size.
Three P2 issues from Codex review of f478a78:

- flush.rs Pass B abort fan-out (regression in f478a78): narrowing the
  error message to the failing chunk left sparse Pass A successes and
  earlier successful Pass B chunks silently dirty — fsync could return
  success on uncommitted content. Restore the "mark every aborted item"
  behavior, but use `or_insert_with` so the more specific Pass A
  range_upload / snapshot error is preserved for items that already
  failed there.

- mod.rs setattr non-xet drift validation (regression in f478a78):
  the EIO return for `want_sparse_install && cur_hash.is_none()` ran
  AFTER `entry.size`, `mtime`, `ctime`, and `set_dirty()` had already
  been bumped. The setattr appeared to fail to the caller but left the
  inode mutated locally with no scheduled flush. Hoist the cur_hash
  validation to before any inode mutation; the staging hole punched
  earlier is harmlessly replaced on the next open.

- mod.rs sparse drift to LARGER non-xet (pre-existing): when poll
  rotates a clean sparse inode to a non-xet revision whose live size
  exceeds the CAS base size, a subsequent setattr that lands inside
  the gap [sw.original_size, cur_size) extends the staging file with
  zeros from set_len but doesn't track them. range_upload then
  composes only [0, original_size) from the old base; bytes in
  [original_size, new_size) are silently dropped from the commit.
  Track the gap up to min(cur_size, new_size) before resizing. The
  mock zero-fills past original_size so this is invisible under unit
  tests — real xet upload_ranges produces a malformed commit; the
  new sparse_setattr_after_drift_to_larger_non_xet_tracks_gap test
  asserts the dirty_ranges shape directly instead of relying on
  composition output.
Codex follow-up on 247a56e: the EIO path for `want_sparse_install &&
cur_hash.is_none()` now skips the inode mutation, but it left the
staging file created at File::create(staging_path) + set_len(new_size)
just before the write-lock acquisition. That stale placeholder makes
`local_backing_exists` true for the next setattr call. If the inode
rotates back to xet-backed before then, the next setattr's
`want_sparse_install` predicate fails (local_exists=true) and the
file falls through to the non-sparse path, full-uploading the stale
hole-of-zeros as the new content.

Drop the staging file via `StagingDir::try_remove(ino)` on the EIO
exit. The GC budget was not charged yet (resize_bytes happens further
down, past the early return), so the cleanup is purely on-disk.
Codex P3 follow-up on e90f056: StagingDir::try_remove(ino) subtracts
file_size(ino) from bytes_used via resize_bytes(size, 0). On the EIO
rollback path the staging file was never charged (the only resize_bytes
in this setattr branch sits below the early return), so try_remove
silently undercounts unrelated staging usage — every drift retry would
shift the global budget down by the placeholder's logical size.

Use a raw std::fs::remove_file on this path instead. The cleanup is
best-effort: a NotFound or transient error is benign because the next
open replaces the staging file anyway.
Codex P2 (fresh review): fill_sparse_holes releases io_lock during the
CAS stream await, then re-acquires it to persist still-hole bytes into
staging. A concurrent setattr-shrink (which holds staging.lock +
inode_table.write(), not io_lock) can clip entry.size below the
snapshot's original_size during this window. The post-await still_holes
computation used the original (pre-shrink) range, so write_at past the
new EOF would re-extend the staging file with stale CAS bytes and
track_read_fill would record coverage there.

User-visible consequence: a subsequent setattr-grow back through the
clipped region would expose the stale CAS bytes instead of POSIX zeros,
and the staging file's on-disk length silently exceeds entry.size,
breaking the invariant that staging[..entry.size] is the canonical view.

Read entry.size from the same inode_table.read() guard that resolves
sw, and clamp every hole's end to live_size before computing still
holes. Holes whose start is already past the new EOF are dropped.
Codex P2 follow-up on f54af0e: clamping still_holes inside
fill_sparse_holes only narrows the race — the lock is dropped before
the write_at loop, and a second setattr-shrink can still slip in after
the recheck and re-extend the staging file past the new EOF.

Have the size-change branch of setattr take staging.io_lock(ino) before
acquiring inode_table.write(), holding it across set_len and the
entry.size / sparse_write update. Same lock order as read/write/
fill_sparse_holes (io_lock → inode_table). No awaits are held under
io_lock — set_len and open_local_backing_file are sync syscalls.

Updated the module-level lock-ordering comment to document io_lock's
role and its position relative to staging.lock and inode_table.
Codex follow-up on 8201ac5 (P2 + P3):

- write() used to skip io_lock when `sparse_writes` was false (an
  optimization for the non-sparse advanced-writes mode where readers
  also skip io_lock). With setattr's size branch now taking io_lock
  unconditionally, a non-sparse advanced-writes pwrite races against
  setattr's set_len: the truncate can land between the pwrite and the
  subsequent entry.size update, leaving inode.size larger than the
  on-disk staging length. Take io_lock for every WriteTarget::Local
  pwrite.

- Updated the stale comment in fill_sparse_holes that claimed setattr
  doesn't take io_lock — that became false after 8201ac5.
XciD added 29 commits June 1, 2026 16:02
Covers three of the four user-visible bugs surfaced by the Codex review:

1. flush_pass_b_failure_preserves_pass_a_sparse_success: builds a batch
   with a sparse item (Pass A range_upload succeeds) and a non-sparse
   item (Pass B upload_files fails via the new fail_next_upload_files
   mock hook). Asserts the sparse Pass A success has no flush_errors
   entry while the actual Pass B failure does surface for fsync.

2. apply_commit_sparse_mismatch_does_not_stamp_clocks: snapshots
   mtime/ctime/last_revalidated, then invokes apply_commit_sparse with
   a stale dirty_generation (mismatch path). Asserts all three clock
   fields are unchanged — without the fix last_revalidated would be
   bumped to now and let subsequent lookups skip HEAD revalidation
   while the inode is still dirty with the racing writer's bytes.

3. write_zero_bytes_past_eof_is_a_noop: issues a zero-byte pwrite past
   EOF and asserts entry.size is unchanged, then writes a real byte
   and verifies the inode flushes cleanly — without the fix the prior
   zero-byte write would have advanced entry.size past the on-disk
   staging length, stranding snapshot_dirty_bytes on UnexpectedEof.

Also adds MockXet::fail_next_upload_files so Pass B failure can be
exercised independently of Pass A.

The fourth fix (per-inode lock map leak) doesn't get a dedicated test
because the leak is only observable via long-running churn that
existing GC tests cover via inode-eviction; the new forget_locks call
runs unconditionally on drop_staging, so any path that evicts an inode
exercises it for free.
…tion

The drift handling in setattr carried two code paths whose only purpose
was to react to a clean sparse inode being rotated to a non-Xet revision
by poll between snapshot and lock:

- The `want_sparse_install && cur_hash.is_none()` early-return that
  cleaned up the punched staging file and returned EIO.
- The `_ =>` arm in the sparse-write drift rebuild that kept the state
  keyed to the OLD hash and resized in place (last-writer-wins against
  a None current revision).

That scenario can't happen on hf-mount's writable surface: buckets are
uniformly Xet, and non-Xet repos are read-only so writes never reach
setattr at all. The previous `cur_hash.expect("non-xet drift rejected
above")` was already trusting this invariant after the early return.

Collapse the rebuild path from a 3-arm match to a single if/else on
hash equality, replace the trust-the-early-return expect with one that
documents the invariant directly, and drop the two regression tests
(`sparse_setattr_grow_after_drift_to_non_xet_zero_fills_tail`,
`sparse_setattr_after_drift_to_larger_non_xet_tracks_gap`) that
simulated this rotation by calling update_remote_file(ino, None, ...)
directly. The remaining drift tests still cover Xet → other-Xet
rotation, which IS reachable.
With the Xet → non-Xet rotation paths gone, the gap-track inside the
matching-hash arm of the sparse-write rebuild is also unreachable. The
only way cur_size could exceed sw.original_size on this branch is via
user writes (which already tracked them as dirty under io_lock + the
inode write lock); Xet content-addressing forbids a poll rotation that
keeps the same hash while changing size.

Collapse the branch to a single resize_to call. The dirty bytes between
sw.original_size and cur_size are guaranteed to already be in
dirty_ranges from the originating write().
…efactor

The test was added to validate plan_original_ranges on the snapshot
shape that setattr's drift gap-track branch produced. With that branch
gone (Xet → non-Xet is unreachable, and Xet → same-hash-different-size
is impossible by content-addressing), no caller produces this specific
snapshot shape anymore. The boundary case (start == original_size) is
still trivially covered by write_past_eof_is_an_insert which exercises
the same plan_original_ranges branch.
Three small simplifications surfaced by a /simplify pass:

1. `SparseWriteState::new_resized(hash, original_size, new_size)` helper.
   Setattr's sparse-install and drift-rebuild branches both built a fresh
   SparseWriteState with `new(hash, cur_size); resize_to(cur_size, new_size);
   Arc::new(sw)`. Factored into one constructor.

2. `track_write` now takes the prior EOF and handles the zero-fill gap
   internally instead of asking every caller to compute
   `effective_start = offset.min(prev_eof)`. write() drops its 4-line
   manual gap math, and the invariant "writes past EOF must dirty the
   zero gap" lives on SparseWriteState where it belongs (Codex altitude
   finding A3). track_write also reuses track_read_fill for the coverage
   merge instead of duplicating the merge_range call.

3. Collapse setattr's install/drift/in-place-resize trifurcation into a
   single `needs_rebuild` predicate: rebuild when we just punched a
   sparse-install hole OR the existing sparse_write is keyed to a stale
   hash; otherwise resize in place. Removes the duplicated drift-vs-install
   rebuild blocks and one copy of the `Xet → non-Xet unreachable` expect.

Also tightens `new_full` to a direct struct construction instead of the
spread `..Self::new(...)` pattern (which was allocating two empty Vecs
just to override one of them).

Adds a unit test for `track_write` past EOF to pin the gap-tracking
behavior at the state-machine layer instead of relying on integration
coverage in write()/setattr.
The `let snapshots_result = ...; let snapshots = match snapshots_result`
pair had no reason to be two bindings — fold the match directly into the
let. Saves one line, removes the indirection.
A multi-worker stress test that exercises the sparse-write state machine
under concurrent ops on many files. Each worker owns its own file (no
cross-worker shadow races), but flushes share a global FlushManager so
Pass A/B batching, lock-map churn, drift handling, and concurrent
staging lifecycles all get exercised together.

Workload mix per worker is biased toward the edge cases unit tests
cover individually but not in combinations:
- in-file overwrite
- write spanning EOF
- write at exactly EOF
- write past EOF (the zero-fill gap)
- setattr-grow and setattr-shrink
- reopen (sparse_write preservation + drift retry)
- fsync (force apply_commit_sparse cycle)
- mid-stream random-range read (verified against shadow)

Validation is byte-exact: each worker maintains a shadow Vec<u8> mirror,
and after release+wait_for_clean does a full-file CAS-level read that
must match shadow byte for byte. A divergence fails the test with the
offending offset and a reproducible seed.

Default: 12 files × 200 ops, ~1.5 s wall-time, runs as part of the lib
test suite. Override via STRESS_FILES / STRESS_OPS_PER_FILE / STRESS_SEED
for heavier runs (verified 64 × 2000 = 128k ops in ~12 s, no
divergence across 5 seeds).

Surfaced one known race during development: `open()` can return EIO
when xet_hash drifts between the pre-lock snapshot and the install
under-lock — the code comment claims "the caller retries" but no retry
sits anywhere above (FUSE adapter passes EIO straight through). The
test mirrors the documented contract with a bounded retry loop in
`open_with_retry`. Worth a separate fix to either handle the drift
internally (re-snapshot under the lock) or to wire a retry into the
FUSE adapter.
`open_advanced_write` used to capture `xet_hash` / `size` from the caller's
pre-lock snapshot (taken in `open()` via `get_file_entry`) and surface EIO
when those values disagreed with the inode under `staging.lock(ino)`. The
comment claimed "the caller retries" but nothing above wires that retry —
FUSE and NFS adapters pass EIO straight through to the user.

The new sparse-write stress test reproduces the race deterministically:
under heavy concurrent flushes, `apply_commit_sparse` rotates entry.xet_hash
between `open()`'s snapshot and `open_advanced_write`'s install, returning
EIO for what should be an in-flight reopen on otherwise clean state.

Two changes:

1. `open_advanced_write` now re-reads xet_hash and size from the inode
   table under staging.lock (alongside the existing is_dirty /
   staging_is_current re-read). The caller's parameters become advisory
   (prefixed `_`) and the rest of the function uses the freshly-snapshotted
   values — punching staging at the live size, installing sparse_write
   keyed to the live hash, comparing against the live hash for the
   materializes-remote check.

2. `open()` wraps the advanced-write path in a bounded internal retry
   (4 attempts with exponential backoff 5/10/20 ms) so a flush that
   completes between the inode_table.read() and the inode_table.write()
   inside open_advanced_write — the narrow residual race window — also
   resolves without surfacing EIO. After the retry budget is exhausted
   any remaining EIO propagates so genuine IO errors still reach the
   caller.

Updates the pre-existing drift test to reflect the new contract:
`open_advanced_write` is now robust to a stale snapshot and installs a
sparse_write keyed to the live hash instead of returning EIO.
…test

Two changes paired with the open_advanced_write drift fix:

1. New `StressOp::RogueRotation` injects poll-induced remote rotations
   into the stress workload. When fired, the worker quiesces (fsync +
   release + wait_for_clean), registers a freshly-generated remote
   blob under a new CAS hash, calls `update_remote_file` on the inode,
   updates its shadow to the new content, and reopens.

   This exercises the drift-rebuild path in `open_advanced_write` for
   real (sparse_write keyed to the OLD committed hash, entry.xet_hash
   now pointing at the rotated remote revision) and validates that the
   reopen + subsequent writes still produce a correct final file
   (shadow byte-exact match at end-of-worker).

   ~4% of ops are rotations; 32 files × 1000 ops triggers dozens per
   run without blowing up wall time. Verified across multiple seeds at
   48 files × 1500 ops (≈ 72 k ops) — no divergence.

2. The pre-existing `sparse_open_hash_drift_returns_eio` test
   exercised the OLD drift contract (EIO returned when a stale
   xet_hash snapshot disagreed with the live inode). With the
   production fix re-reading xet_hash/size under staging.lock, that
   contract no longer holds — a stale snapshot is simply ignored.
   Rename + rewrite the test as `sparse_open_stale_snapshot_uses_live_state`
   to assert the new behaviour: open succeeds and installs a
   sparse_write keyed to the live hash.
…l upload

Mirrors the workload from https://huggingface.co/blog/delta-weight-sync:
seed a checkpoint-sized file on a mounted HF bucket, then for N steps
modify ~1% of bytes at scattered offsets (the wire-side equivalent of
patching changed bf16 elements) and time each commit end-to-end.

Two tests:
- bench_sparse_delta_commits: --sparse-writes → range_upload composes only
  dirty bytes against the old CAS base.
- bench_full_overwrite_commits: --advanced-writes only → every step
  re-uploads the whole file (baseline for the "ship the full checkpoint"
  approach the blog post compares against).

Configurable via env:
  BENCH_FILE_SIZE_MB   default 64
  BENCH_STEPS          default 5
  BENCH_DELTA_RATE     default 0.01 (1%)
  BENCH_AVG_RUN_LEN    default 8

Run with HF_TOKEN set:
  cargo test --release --test sparse_delta_bench -- --nocapture --test-threads=1
Temporary instrumentation to diagnose the original_size mismatch reported
by xet-core when the bench runs (109MB vs the 64MB the bench actually
writes). Will be removed once the bug is identified.
The delta-weight-sync style bench surfaced a wire-side bug: after a
non-sparse upload of a 64 MiB seed file, `XetFileInfo::file_size`
returned 68943872 bytes (~65.75 MiB, +1.75 MiB inflated). The CAS
content was still correctly 64 MiB (verified via xet-core's
reconstruction info: `caller said original_size=68943872 but
reconstruction info reports 67108864`), so the bug is purely in xet's
returned metadata, not in the uploaded bytes.

Root cause sits upstream in xet-core's `file_cleaner::finish_inner`:
the returned `file_size` comes from `deduplication_metrics.total_bytes`,
which over-counts when in-flight chunks match earlier bytes via local
dedup. Until that's fixed upstream, trust the size we already know
authoritatively:

- `flush_batch`: use `item.file_size` (captured under `staging.lock`
  alongside the dirty_generation snapshot; equals the on-disk staging
  length at upload time since concurrent O_TRUNC/setattr-shrink also
  take staging.lock) when calling `apply_commit` / `apply_commit_sparse`
  and when logging the uploaded size.

- `streaming_commit`: use `channel.bytes_written` (the atomic the write
  path increments per syscall — see WriteTarget::Streaming branch) as
  the committed size for `apply_commit`.

Both sources are bit-exact with what we actually fed to xet-core, so
they sidestep the over-counted return value entirely.

Without this fix, a Pass B upload would set `entry.size` (and thus
the next `sparse_write.original_size`) to the inflated value, and
every subsequent `range_upload` would be rejected by xet-core with
the original_size/reconstruction-size mismatch — the
delta-weight-sync workflow was unusable end-to-end before this fix.

Also tightens the bench: the previous diagnostic warn line becomes
an `assert_eq!` so a future regression of the same shape fails the
bench loudly instead of slipping through.
The previous bench simulated 'patch each changed bf16 element with its own
2-byte pwrite' (~53k scattered tiny writes per step). That hit a known
quadratic cost in xet-core's upload_ranges window planner — single steps
took 60+ minutes. It also doesn't match what the blog actually does: the
delta-weight-sync TRL extension BATCHES the changed elements into a
single sparse-safetensors blob and uploads that as one file.

Rewrite the bench around the realistic batched pattern:

- bench_sparse_delta_blob (default): one contiguous blob write per step,
  sized to delta_rate × file_size. This is the path our sparse-write
  design is actually optimized for — range_upload composes 1 dirty range
  against the old CAS base.

- bench_full_overwrite (default): same blob workload, but on a non-sparse
  mount so the legacy upload_files re-uploads the full file every step.
  Baseline for the 'ship the whole checkpoint each step' approach the
  blog compares against.

- bench_sparse_delta_scattered (opt-in via BENCH_RUN_SCATTERED=1): the
  old pathological N-tiny-pwrite pattern, kept as a regression marker
  for the upload_ranges window-planner cost. Off by default — single
  step takes 10+ minutes.
Bench at delta-weight-sync style workloads (1 contiguous blob per step,
~1% of file) shows our sparse-write loses to non-sparse + xet CDC dedup
below ~1 GB:

  64 MB:   sparse 3.97s    vs non-sparse 0.36s  (~11x slower)
  256 MB:  sparse 4.57s    vs non-sparse 0.59s  (~8x slower)
  1024 MB: sparse 0.88s    vs non-sparse 1.35s  (~1.5x faster)
  1200 MB: sparse 1.45s    vs non-sparse 1.37s  (~par)

xet-core's `upload_files` already wire-trims small files efficiently
via CDC chunk dedup — only modified chunks go on the wire. The sparse
path adds a reconstruction-info round-trip + per-window compose overhead
that doesn't pay off until the file is large enough for local hashing
+ chunk-query traffic to dominate.

Add a per-mount fallback threshold: files smaller than
`sparse_min_size_bytes` (default 256 MiB, override via
`HF_MOUNT_SPARSE_MIN_BYTES`) take the standard non-sparse path. Wired
through `VfsConfig` so the CLI/setup layer owns the env-var read and
the mock tests can set 0 to keep their tiny-file sparse semantics.

No behavior change for files >= threshold; users who want sparse on
all files unconditionally can set HF_MOUNT_SPARSE_MIN_BYTES=0.
When a sparse-write inode's `coverage` spans the full file (typical after
a long-lived handle that read most of the file plus some dirty writes),
the staging file already contains the committed content byte-for-byte.
range_upload's per-window CAS re-fetches (`stream_cas_range` for old
prefix/suffix bytes around each dirty range) then download data we
already have locally — for a 12 MB edit in a 1.2 GB file with a 64 MB
segment window, that's ~50 MB of waste per commit.

Bypass range_upload for those items and route them through Pass B's
`upload_files`. Same wire bytes — xet-core's CDC dedup re-uses existing
xorbs for unchanged chunks — but no redundant CAS downloads to compose
the new file.

Gated on `coverage spans [0, file_size) && !dirty_ranges.is_empty()`:
the second clause preserves the "no-op flush keeps the original hash"
semantics. A flush with no dirty bytes but full coverage still hits
range_upload's `dirty_inputs.is_empty()` short-circuit and returns the
original hash unchanged. Forcing upload_files there would re-CDC into
a new (content-equivalent but xet-different) hash and lose the cache.

Apply path is unchanged: items with `sparse_snapshot.is_some()` still
go through `apply_commit_sparse` (preserves the coverage map for the
next session), regardless of which upload path produced the new hash.
…vings

The warm-cache benches all show sparse and full + xet CDC dedup roughly
tied at 1 GB+ scale on EC2: the staging file holds the seed content, so
sparse's range_upload composes against a coverage-full state (which the
fast path now routes through upload_files anyway) and the two paths
become functionally equivalent.

Where sparse actually wins is the cold-cache case: a fresh node opens a
remote checkpoint, makes one edit, commits. The full-overwrite path has
to download the entire checkpoint into local staging before it can open
it for write; the sparse path punches a hole and composes against the
existing CAS reconstruction, never materializing the unchanged bulk
locally.

Bench scenario:
1. Mount #1: seed the file, wait for the Hub commit, unmount + wipe cache.
2. Mount #2 (cold): same bucket, fresh cache_dir → no local state.
3. Write a single delta-sized blob, sync, verify; report the wall-time.

Expected at 1.2 GB / 1% delta: sparse cold ~1-2 s, full cold ~12 s + tail
(the 1.2 GB download dominates the full path).
Sparse items with full coverage take the upload_files fast path (uploaded
as-is, no range_upload composition). The commit-apply dispatched on
sparse_snapshot.is_some(), so these items went through apply_commit_sparse
which leaves staging_is_current=false. Result: the next open re-downloaded
the full file even though staging already matched the committed hash
byte-for-byte. Track the fast-path decision per item and dispatch through
apply_commit (drop sparse_write, flag staging_is_current) instead.

Also flip the fast-path index store from Vec<usize> with linear .contains
lookups to a Vec<bool> aligned with to_flush (O(1) per item).
…r install with open threshold

Previously open_advanced_write returned EIO on a drift (xet_hash rotation
between the open() snapshot and the install lock). The retry loop in open()
caught any EIO, meaning a real disk error also paid the 4-attempt backoff
(~75 ms) before surfacing. Switch the drift signal to EAGAIN (a dedicated
internal sentinel; userspace never sees it because the loop upgrades a
final EAGAIN to EIO). Real EIO now surfaces on the first attempt.

The setattr sparse-install branch was also missing the sparse_min_size_bytes
threshold check, so a small file getting its first edit via setattr would
take the sparse path while the same file edited via open(O_TRUNC) would not.
Mirror open_advanced_write's want_sparse predicate so both entry points
agree on which files engage sparse.

Also document why setattr's mid-mutation 'needs_rebuild requires Some(hash)'
keeps the existing expect() rather than returning EIO: entry.size and
set_dirty have already been applied above, so a mid-mutation EIO would
leave a dirty inode with stale/empty sparse_write and the next flush
would commit corrupt content. Failing loud is safer than failing silent.
…ck omission

- snapshot_dirty_bytes: translate UnexpectedEof into a greppable
  invariant-violation message naming the offending range. Default Display
  is 'failed to fill whole buffer' which leaks zero context.
- setattr sparse-install: document why no io_lock is needed (no prior
  staging file under !local_exists, so no Arc<File> from a previous open
  can race against File::create's O_TRUNC window; staging.lock(ino)
  already excludes concurrent open_advanced_write installs).
…nches exercise sparse

The 256 MiB production default for HF_MOUNT_SPARSE_MIN_BYTES (added when
small-file fallback landed) silently downgraded any test or bench whose
files sat below that size, even when --sparse-writes was passed on the
command line:

- tests/fsx_paranoid.rs: MAX_SIZE = 1 MiB. The test's stated purpose is to
  catch range_upload composition regressions, but range_upload is only
  exercised on the sparse path. Without the override, this CI test ran the
  legacy download-then-upload path and would not have flagged sparse
  composition bugs.

- tests/sparse_delta_bench.rs: default file size 64 MiB. The SPARSE_BLOB /
  SPARSE_SCATTERED / SPARSE_COLD benchmarks compared full-upload against
  full-upload rather than sparse against full unless the operator bumped
  BENCH_FILE_SIZE_MB above 256.

Both call sites now set HF_MOUNT_SPARSE_MIN_BYTES=0 before spawning the
mount child. Each tests/*.rs file compiles as its own test binary so the
env mutation is contained, and the variable only matters for processes
that also pass --sparse-writes — FULL baselines are unaffected.

Found by codex review.
…f unsafe set_var

cargo test runs tokio tests inside a binary concurrently, so calling
std::env::set_var from a test setup races with sibling tests reading the
parent env or spawning their own children. The race is benign for fsx
(single test in the binary) but real for sparse_delta_bench, which has
five parallel benchmark tests.

Add mount_bucket_with_env that takes an &[(&str, &str)] of env pairs and
forwards them via Command::env on the child only. The mutation never
touches the parent's process env so the unsafe set_var precondition
becomes moot.

Also document the post-fast-path-fix consequence on the warm SPARSE_BLOB
benchmark: the seed leaves staging fully covered, so subsequent delta
opens install full coverage and flush picks the fast-path upload_files
route — meaning warm SPARSE_BLOB now measures upload_files, not
range_upload. The cold benchmark is the one that actually exercises
range_upload composition.

Found by codex follow-up review.
The existing cross-file stress exercises FlushManager batching and inter-
inode locks but pins each worker to its own file, so the per-inode
io_lock, sparse_write state machine, and dirty_generation guard never
see real concurrent traffic. This new variant points N workers at a
SINGLE inode — exactly the FUSE configuration where the per-inode
synchronization is on the critical path.

Workload mix per worker: 40% writes (overwrite / at-EOF / past-EOF-with-
gap / random), 35% reads (heavy mix to thrash io_lock + sparse-fill),
15% setattr grow/shrink, 8% fsync, 5% reopen. RogueRotation is dropped
(needs global quiesce — incompatible with concurrent siblings).

Validation is fuzz-style rather than equivalence: overlapping concurrent
writes make byte content indeterminate, so we assert (1) no panic / no
poison / no surprise errno, (2) post-quiesce sparse_write invariants
(sorted non-overlapping coverage and dirty_ranges, dirty ⊆ coverage,
nothing past entry.size), (3) final read returns entry.size bytes.

Defaults are 8 workers × 500 ops (~330 ms). Heavier configs via
SAME_INODE_WORKERS / SAME_INODE_OPS / SAME_INODE_SEED — passes locally
through 8/500, 16/2000, 24/5000 ×2 seeds, and 32/10000.
Companion to the chaos same-inode stress. That variant maximises race
coverage but can only assert structural invariants because overlapping
concurrent writes make byte content scheduler-dependent. This variant
trades some race surface for a STRONG oracle:

- File is split into N disjoint per-worker slots
- Each worker writes ONLY within its own slot, maintaining a private
  shadow that is ground truth for that range
- Reads can target ANY slot — exercises concurrent pread vs. another
  worker's pwrite + sparse-coverage snapshot pairing on the io_lock
- After quiesce, the committed file must equal the concatenation of
  every worker's final shadow, byte-for-byte

A silent corruption (torn read, lost write, fill_sparse_holes
clobbering user bytes, dirty_generation guard missing a write) fails
the final assert with the offending offset and owning worker.

setattr is dropped to keep the per-slot ownership invariant clean
(setattr is already exercised heavily by the chaos variant and the
cross-file stress).

Defaults: 8 workers × 1000 ops, 4 KiB slots (~500 ms). Heavier configs
via VERIFIED_WORKERS / VERIFIED_OPS / VERIFIED_SLOT_SIZE / VERIFIED_SEED.
Locally passes 8/1000, 16/5000, 32/20000 ×2 seeds, 64/30000, 64/50000.
The mock-driven stress runs hundreds of thousands of ops per second, fast
enough that it's reasonable to wonder whether the workload actually fires
all op types and exercises the sparse CAS path. Add per-op-type counters
(writes/reads/fsyncs/reopens) plus MockXet activity (CAS streams + CAS
objects created) and print them at the end of every run.

Sanity asserts catch the failure modes a stress test silently degenerates
into:
- assert(writes > 0): rng never picked Write
- assert(reads > 0): rng never picked Read
- assert(fsyncs > 0): rng never picked Fsync
- assert(reopens > 0): rng never picked Reopen
- assert(cas_streams > 0): sparse-fill path never fired (would mean the
  staging cache covered every read from the start, indicating the sparse
  install didn't engage)

If a future refactor accidentally short-circuits one of these op kinds
the assertion fires immediately instead of the test passing with reduced
coverage.
… mount

New tests/sparse_concurrent_real.rs:
- N tokio workers writing concurrently on a single FUSE-mounted inode
- Each worker owns a disjoint slot so the byte content is deterministic
  and we can byte-compare after a fresh remount that bypasses the local
  staging cache
- Exercises the full real-CAS pipeline: FUSE kernel routing + hf-mount-fuse
  async handler + range_upload composition + Hub commit + cold-cache
  read-back from CAS

The remount-and-compare design specifically targets corruption modes that
the in-process MockXet stress can't surface — anything that would only
manifest with real network round-trips, real xet-core CDC, or real Hub
commit semantics now fails the byte-level assert with offset + owner
worker id.

Defaults: 6 workers × 12 ops × 1 MiB slots ≈ 30 s end-to-end. Tunable via
CONCURRENT_REAL_WORKERS / CONCURRENT_REAL_OPS_PER_WORKER. Wired into the
fsx CI job alongside fsx_paranoid.

Also fix a silent-pass bug shared by fsx_paranoid: mount_bucket_with_env
only WARNS on a failed wait_for_mount and returns the Child regardless.
If the mount fails (stale FUSE state, broken macFUSE on a dev machine,
missing kernel module), every std::fs operation then hits the bare /tmp
dir, the round-trip read returns the same local bytes, and the test
silently passes despite no actual CAS activity. Both tests now call
common::is_mounted after the mount step and panic if the mount isn't
live.
Three tests and one NFS errno mapping ported from feat/append-write (the
predecessor branch). These pin invariants that the rewrite preserves but
that weren't previously exercised end-to-end:

- sparse_post_flush_read_returns_cas_bytes_not_zeros: after apply_commit_sparse
  the inode keeps its sparse_write (rekeyed to the new hash) with cleared
  dirty_ranges. A read of an unmodified region must fill holes from the
  just-committed CAS file, not return zeros from the staging holes. Pre-fix
  on PR #41 this returned '00XX000000' instead of '01XX456789'.

- sparse_fill_short_cas_stream_returns_eio_not_panic: when a CAS download
  yields zero bytes for a non-empty hole range, fill_sparse_holes must
  surface EIO rather than panicking on an OOB copy_from_slice. The code
  already has the data.len() < span check; this test pins it.

- sparse_noop_flush_advances_mtime: apply_noop_commit calls
  touch_commit_clocks so mtime advances even when the upload returns the
  unchanged hash. Required for POSIX-y fsync semantics: a successful
  fsync after an open-for-write should reflect modification time, not
  freeze it because the bytes happened to dedup.

- nfs: errno_to_nfs maps libc::EAGAIN -> NFS3ERR_JUKEBOX defensively.
  EAGAIN is now the sparse-write drift sentinel (returned from
  open_advanced_write when poll rotates xet_hash between snapshot and
  install). The retry loop in open() normally upgrades it to EIO so
  userspace never sees it, but mapping it directly avoids degrading a
  future leak to NFS3ERR_IO on NFS exports.

Tests not backported (already covered or not applicable to the rewrite):

- 25 helper-level unit tests (sparse_merge_range_*, sparse_clip_*,
  sparse_two_*, sparse_engulf_*, etc.) duplicate what
  src/virtual_fs/inode.rs::tests already covers structurally.

- b1/b5/b6/b8/c1/c2/c3/c4 numbered tests target specific bugs of the
  old approach (setattr loses bytes, stale sparse_write rollbacks). The
  rewrite addresses them structurally; equivalent regression coverage
  exists via sparse_setattr_after_hash_drift_uses_current_revision,
  sparse_otrunc_then_write_then_reopen_keeps_user_data, and the
  same-inode concurrent stress / verified-integrity tests.

- D1/E1/E3 source-shape tests grep mod.rs for specific code patterns
  (e.g. 'self.staging.io_lock(' in read() before the pread). Equivalent
  invariants in the new code shape (which uses io_lock + dirty_generation
  guard slightly differently) would need full rewrite to be useful, and
  the behavioural integrity stress already catches the bugs these were
  guarding.
Two new env vars on the multi-worker real-CAS stress so it can scale
beyond the original 1 MiB-per-slot defaults:

- CONCURRENT_REAL_SLOT_KB (default 1024): override the per-worker slot
  size. With 16 workers × 65536 KiB = 1 GiB file, the test exercises
  range_upload composition on realistic checkpoint-sized files.

- CONCURRENT_REAL_FSYNC_EVERY_N (default 0 = off): force sync_all every
  N ops per worker. Simulates the training-loop pattern of periodic
  checkpoint saves and exercises multiple apply_commit_sparse cycles
  within a single test run instead of just the final flush.

Validated on EC2 (us-east-1, m6i.2xlarge) over a 12-min looped stress:
20 iterations × 16 workers × 10000 ops × 64 MiB slots × fsync_every_20
= 3.2M writes + 160k fsyncs across 20 full seed-to-CAS-to-remount-and-
verify cycles, all byte-compare PASS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant