feat: sparse writes with range_upload, zero-download write path by XciD · Pull Request #41 · huggingface/hf-mount

XciD · 2026-03-16T10:43:12Z

Summary

Replaces the full-download write path with sparse staging files and dirty range tracking. When opening an existing file for write, we create a sparse staging file (set_len()) instead of downloading CAS content. Only the dirty byte ranges are tracked and uploaded.

How it works

Open for write: create sparse staging file at original size, no CAS download
Write: pwrite() to staging file, SparseWriteState::track_write() records dirty ranges (O(log n) merge with binary search)
Read: pread() from staging, fill_sparse_holes() downloads CAS data on demand for non-dirty regions
Truncate: clip_to_size() trims dirty ranges on shrink, track_write() marks extension as dirty on grow
Flush: range_upload composes a new CAS file from stable prefix/suffix + re-chunked dirty regions via upload_ranges() in xet-core

For a 200MB file with a 1KB mid-file edit, this downloads 0 bytes on open and uploads only the dirty chunks (not 200MB).

Key components

SparseWriteState (inode.rs): tracks original_hash, original_size, sorted dirty_ranges vec
fill_sparse_holes (mod.rs): downloads CAS data for sparse holes in read buffer
range_upload (xet.rs): builds DirtyInput vec from dirty ranges, each with its own async reader
FlushEntry/FlushSuccess structs (flush.rs): replace raw tuples for flush pipeline
dirty_generation counter (reused): prevents a stale flush from clearing dirty state after concurrent writes/renames
NFS handle upgrade (nfs.rs): upgrades read-only pool handles to writable on first WRITE RPC

Tests

47 new unit tests covering sparse writes, flush races, POSIX edge cases
Sparse write smoke tests in integration suite (mid-file write, append past EOF, CAS round-trip, multi-write accumulation, large file range_upload)
fsx: 50k random ops (staging) + 100 paranoid ops (CAS round-trip per mutation)
xfstests generic/quick suite (167+ pass)

Dependencies

xet-core branch adrien/combined-hf-mount (PR #717) which adds upload_ranges() API for composing files from CAS prefix/suffix + dirty regions

github-actions · 2026-03-16T10:51:12Z

POSIX Compliance (pjdfstest)

============================================================
  pjdfstest POSIX Compliance Results
------------------------------------------------------------
  Files: 130/130 passed    Tests: 832 total (0 subtests failed)
  Result: PASS
------------------------------------------------------------
  Category               Passed    Total   Status
  -------------------- -------- -------- --------
  chflags                     5        5       OK
  chmod                       8        8       OK
  chown                       6        6       OK
  ftruncate                  13       13       OK
  granular                    5        5       OK
  mkdir                       9        9       OK
  open                       19       19       OK
  posix_fallocate             1        1       OK
  rename                     10       10       OK
  rmdir                      11       11       OK
  symlink                    10       10       OK
  truncate                   13       13       OK
  unlink                     11       11       OK
  utimensat                   9        9       OK
============================================================

github-actions · 2026-03-16T10:51:50Z

Benchmark Results

============================================================
  Benchmark — 50MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    186.0 MB/s     235.6 MB/s
  Sequential re-read                2222.7 MB/s    2322.0 MB/s
  Range read (1MB@25MB)                0.4 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1480.0 MB/s
  Close latency (CAS+Hub)            0.138 s
  Write end-to-end                   291.7 MB/s
  Dedup write                       1988.4 MB/s
  Dedup close latency                0.116 s
  Dedup end-to-end                   353.1 MB/s
============================================================
============================================================
  Benchmark — 200MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    766.1 MB/s     922.3 MB/s
  Sequential re-read                2217.5 MB/s    2482.6 MB/s
  Range read (1MB@25MB)                0.2 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1824.9 MB/s
  Close latency (CAS+Hub)            0.547 s
  Write end-to-end                   304.5 MB/s
  Dedup write                       1813.2 MB/s
  Dedup close latency                0.095 s
  Dedup end-to-end                   972.3 MB/s
============================================================
============================================================
  Benchmark — 500MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                   1348.6 MB/s    1370.3 MB/s
  Sequential re-read                2188.3 MB/s    2501.7 MB/s
  Range read (1MB@25MB)                0.2 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1594.6 MB/s
  Close latency (CAS+Hub)            0.165 s
  Write end-to-end                  1044.0 MB/s
  Dedup write                       1573.3 MB/s
  Dedup close latency                0.135 s
  Dedup end-to-end                  1103.8 MB/s
============================================================
============================================================
  fio Benchmark Results
------------------------------------------------------------
  Job                        FUSE MB/s   NFS MB/s  FUSE IOPS   NFS IOPS
  ------------------------- ---------- ---------- ---------- ----------
  seq-read-100M                  365.0      483.1                      
  seq-reread-100M               2702.7      420.2                      
  rand-read-4k-100M                0.1        0.1         14         19
  seq-read-5x10M                 793.7      806.5                      
  rand-read-10x1M                  0.1        0.1         32         35
  Random Read Latency           FUSE avg      NFS avg
  ------------------------- ------------ ------------
  rand-read-4k-100M           72859.0 us   51803.8 us
  rand-read-10x1M             31347.3 us   28543.2 us
============================================================

Restore main's tests/fsx.rs (canonical xfstests fsx binary, 50k random ops) and pull in the paranoid mode from PR #41's earlier fsx implementation as a separate test file. Paranoid mode does full CAS round-trip after each mutation: write/truncate, flush + sleep, re-open + read-back, verify against in-memory reference. This catches composition bugs in range_upload that the canonical fsx misses (since fsx reads from the local staging file, not from CAS). Slow (~1.5s per op for flush debounce + CAS propagation). Use FSX_PARANOID_OPS to control iteration count (default: 100).

The 7 sparse-write tests appended to run_write_tests exercise the full range_upload + fill_sparse_holes pipeline (PR #41 + xet-core PR #717), which requires the new /v2/file-chunk-hashes endpoint. Until that endpoint ships, the flush retries fail in a loop ('caller said original_size=N but reconstruction info reports M') and the 5s sleeps make the test feel stuck. Skip the new tests by default; opt in with HF_MOUNT_SPARSE_TESTS=1 once the endpoint is live.

Restore main's tests/fsx.rs (canonical xfstests fsx binary, 50k random ops) and pull in the paranoid mode from PR #41's earlier fsx implementation as a separate test file. Paranoid mode does full CAS round-trip after each mutation: write/truncate, flush + sleep, re-open + read-back, verify against in-memory reference. This catches composition bugs in range_upload that the canonical fsx misses (since fsx reads from the local staging file, not from CAS). Slow (~1.5s per op for flush debounce + CAS propagation). Use FSX_PARANOID_OPS to control iteration count (default: 100).

The 7 sparse-write tests appended to run_write_tests exercise the full range_upload + fill_sparse_holes pipeline (PR #41 + xet-core PR #717), which requires the new /v2/file-chunk-hashes endpoint. Until that endpoint ships, the flush retries fail in a loop ('caller said original_size=N but reconstruction info reports M') and the 5s sleeps make the test feel stuck. Skip the new tests by default; opt in with HF_MOUNT_SPARSE_TESTS=1 once the endpoint is live.

## Summary APIs for range-aware file writes: instead of re-uploading an entire file when only part of it changed, compose a new CAS file from stable segments + re-chunked dirty windows. Supports resize edits (insert / delete / arbitrary replace) in addition to in-place rewrites. ### API: `upload_ranges` ```rust pub async fn upload_ranges( config: Arc<TranslatorConfig>, cas_client: Arc<dyn Client>, original_hash: MerkleHash, original_size: u64, dirty_inputs: Vec<DirtyInput>, ) -> Result<XetFileInfo> ``` ```rust /// A single edit applied to the original file: replace `original_range` with /// `new_length` bytes from `reader`. Edits are expressed in original-file coordinates. pub struct DirtyInput { pub original_range: Range<u64>, pub reader: Pin<Box<dyn AsyncRead + Send>>, pub new_length: u64, } ``` The output file size is **derived** from the inputs (no `total_size` parameter): `original_size - removed + added`. ### Edit shapes (all expressible with the same struct) | Operation | `original_range` | `new_length` | |---|---|---| | In-place edit | `a..b` | `b - a` | | Resize replace | `a..b` | any | | Pure insert | `p..p` | `> 0` | | Pure delete | `a..b` | `0` | | Append | `original_size..original_size` | `> 0` | | Truncate to N | `N..original_size` | `0` | | No-op | empty `dirty_inputs` | — | Motivating example: ```text abc + upload_ranges([0..1), "foo", 3) = foobc abc + upload_ranges([0..0), "foo", 3) = fooabc abc + upload_ranges([0..1), "", 0) = bc ``` **Per-range `AsyncRead` instead of `ReadSeek` over the staging file.** The earlier prototype took `dirty_ranges: &[(u64, u64)] + dirty_source: &mut dyn ReadSeek`. That had a subtle bug: for truncation we silently extended the dirty set with a boundary chunk and read those bytes from the staging file, but if the file was never opened for write the staging file contains zeros at those positions (real bytes are in CAS) → silent corruption on the truncation boundary chunk. Pairing each edit with its own reader makes that structurally impossible: any byte not provided by the caller is fetched from CAS. <details> <summary>How it works</summary> ### High level ``` upload_ranges +----------------------+ | +----------------------+ | original file (CAS) |---+-->| composed file (CAS) | +----------------------+ +----------------------+ only the dirty windows are re-uploaded; everything else is reused as whole CAS segments. ``` ### Step 1 — coalesce + snap edits to segment boundaries Edits are user-coordinates (byte ranges). We snap each edit's `original_range` to the **enclosing CAS segments** so composition can swap whole segments instead of truncating one mid-chunk. Adjacent / overlapping snapped ranges are then coalesced. Pure inserts (`start == end`) snap to the segment that owns `start`; an insert at `original_size` snaps to the last segment. ### Step 2 — server returns windows + gap subtrees Single CAS call: `GET /v2/file-chunk-hashes/{file_id}` with the segment-aligned ranges in an `X-Range-Dirty: bytes=A-B,C-D` header. Response shape (xetcas#987): ```rust struct FileChunkHashesResponse { windows: Vec<ChunkWindow>, // one per dirty range hash_ranges: Vec<Option<MerkleHashSubtree>>, // N+1 entries: [gap0, gap1, ..., gapN] } ``` `windows[i].chunks` carries the chunk hashes the server actually owns for that window (we re-upload these bytes). `hash_ranges[i]` is the **MerkleHashSubtree** for the i-th unmodified gap, or `None` when there is no gap there. This is the key to composing the final file hash without touching unmodified bytes. ### Step 3 — for each window, stream `[CAS prefix | edits | CAS suffix]` through a fresh cleaner ``` window = [w_start ............................................. w_end] edits in this window: [edit_a] [edit_b] ^ ^ streamed input to the cleaner: CAS bytes [w_start, edit_a.start) reader bytes for edit_a (new_length bytes) CAS bytes [edit_a.end, edit_b.start) reader bytes for edit_b CAS bytes [edit_b.end, w_end) ``` Pure inserts contribute zero original bytes but still emit `new_length` reader bytes. Pure deletes contribute zero reader bytes. The cleaner produces a new `MDBFileInfo` per window and a `ChunkHashList`. ### Step 4 — compose the file hash via `MerkleHashSubtree::merge` ```text merge_seq = [gap0, w0, gap1, w1, ..., wN, gapN] // skip None gaps merged = MerkleHashSubtree::merge(merge_seq) aggregated_hash = merged.final_hash() combined_hash = aggregated_hash.hmac(zero) // matches cleaner's file_hash ``` Special-case: if `total_size == 0` (e.g. truncate to empty) the result is `MerkleHash::default()` *without* HMAC, mirroring `file_hash([])`. ### Step 5 — splice segments + register Walk the original `MDBFileInfo.segments` and replace any segment that falls inside a window with that window's freshly-uploaded segments. Verification entries follow segment-for-segment when present. `metadata_ext = None` (no SHA-256, see Limitations). Then `register_composed_file` + `finalize`. ### Multi-window example Two edits: replace `[50MB, 51MB)` and `[150MB, 151MB)` on a 200MB file: ``` +-----------+-------+------------+-------+-----------+ | GAP 0 | W0 | GAP 1 | W1 | GAP 2 | | reused |upload | reused |upload | reused | | (subtree) | ~1MB | (subtree) | ~1MB | (subtree) | +-----------+-------+------------+-------+-----------+ Wire transfer: ~2MB upload + a few hundred KB of CAS reads for window boundary chunks. Old approach: 200MB download + 200MB upload. ``` ### Empty original short-circuit When `original_size == 0` there is nothing to compose against — every edit's `original_range` must be `0..0` (validated). We just stream the new bytes through a fresh cleaner (`upload_fresh_file`). </details> ### Reviewer note: `chunk_window_builder` is a re-implementation of xetcas `xet_client/src/cas_client/chunk_window_builder.rs` is a port of the same window-building state machine that already lives in xetcas — it's only used by the local / in-memory simulation clients (`local_client`, `memory_client`) so the mock CAS server returns the same shape as the real one in tests. **No need to re-review it as part of this PR**: it mirrors logic already reviewed and merged in xetcas#987. A follow-up xetcas PR will deduplicate by removing the server-side copy and pulling this one in (or vice versa); the duplication is intentional and temporary. ### Limitations - **No SHA-256 metadata**: composed files have `metadata_ext = None` since recomputing SHA-256 would require reading the full file. Only suitable for contexts that don't require SHA-256 verification (HF buckets, xet-native repos), not for Git LFS-backed repos. - **Memory**: for very large files, the per-window in-memory state (chunk hash list + composed segments) is bounded by the dirty regions, not the whole file. The chunk-hashes response is paginated by the server-defined window granularity. ### Tests (27) Covering all edit shapes + edge cases. Notable: | Test | Purpose | |---|---| | `test_resize_edits_abc` | The 3 motivating FUSE examples | | `test_resize_large_replace_grows_file` | Replace `[a..b)` with much more data | | `test_resize_large_replace_shrinks_file` | Replace `[a..b)` with much less data | | `test_resize_mid_file_insert` | Pure insert in the middle | | `test_resize_mid_file_delete` | Pure delete in the middle | | `test_resize_multi_edit_mix` | Insert + replace + delete in one call | | `test_resize_insert_at_segment_boundary` | Snapping correctness for inserts | | `test_upload_ranges_mid_file_edit` | In-place edit | | `test_upload_ranges_truncation` | Pure truncate (sub-segment) | | `test_upload_ranges_truncation_empty_staging` | Truncate when staging is all-zero (boundary read from CAS) | | `test_upload_ranges_truncation_with_overlapping_dirty` | Truncate + dirty range overlapping the boundary | | `test_truncate_to_empty_matches_clean_empty` | Truncating to 0 hashes to `MerkleHash::default()` (matches a fresh empty cleaner) | | `test_upload_ranges_append` | Pure append | | `test_append_with_gap_before_dirty_range` | Append where reader covers a sparse gap too | | `test_append_sparse_staging_file` | Append on a sparse staging file | | `test_mid_edit_plus_append` | Mid-file edit *and* append in one call (P1 codex regression) | | `test_empty_original_append` | `original_size == 0` + append falls into the fresh-file path (P2 codex regression) | | `test_empty_original_validates_ranges` | `original_size == 0` still runs validation (reviewer regression) | | `test_upload_ranges_at_file_start` | Edit at offset 0 (no stable prefix) | | `test_upload_ranges_multiple_regions` | Two non-adjacent dirty windows with stable gap | | `test_single_input_spanning_many_chunks` | One edit covering many CDC chunks | | `test_data_integrity_scenarios` | 5 sub-scenarios covering composition correctness | | `test_noop_returns_original_hash` | Empty `dirty_inputs` → no CAS call, original hash returned | | `test_rejects_dirty_range_past_total_size` | Validation: range past `original_size` | | `test_rejects_overlapping_dirty_ranges` | Validation: overlapping edits | | `test_rejects_unsorted_dirty_ranges` | Validation: unsorted edits | | `test_upload_ranges_small_file_mid_edit` | Small files (single segment) | ### Dependencies - xetcas: `GET /v2/file-chunk-hashes/{file_id}` with `windows[] + hash_ranges[]` response shape — huggingface-internal/xetcas#987 (merged). - Consumer: huggingface/hf-mount#41.  --- > [!NOTE] > **High Risk** > High risk because it adds a new partial-upload composition path that splices CAS segments and recomputes file hashes from window subtrees, touching core data integrity and client/server chunk-boundary logic. > > **Overview** > Adds range-aware file writes via new `upload_ranges`, letting callers apply insert/delete/replace edits and upload only re-chunked dirty windows while reusing stable CAS segments. > > Introduces a new CAS API `get_file_chunk_hashes` (`GET /v2/file-chunk-hashes/{file_id}` with `X-Range-Dirty`) plus response types (`FileChunkHashesResponse`, `ChunkWindow`) and simulation support (`chunk_window_builder`) that extends dirty ranges to *stable* chunk boundaries and returns gap `MerkleHashSubtree` summaries + stable-segment verification. > > Refactors dedup/cleaning plumbing to expose per-chunk hash lists (`ChunkHashList`), adds detached cleaner/session completion and `register_composed_file` to avoid orphan shard entries, and moves/re-exports `next_stable_chunk_boundary` into `xet_core_structures` for shared stable-window computations. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 2f4cee4. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup>  --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Arpit Jain <arpitjain099@gmail.com> Co-authored-by: Hoyt Koepke <hoytak@huggingface.co> Co-authored-by: tison <wander4096@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Di Xiao <seanses@users.noreply.github.com> Co-authored-by: Arpit Jain <3242828+arpitjain099@users.noreply.github.com> Co-authored-by: Assaf Vayner <assaf@huggingface.co> Co-authored-by: Rajat Arya <rajatarya@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Restore main's tests/fsx.rs (canonical xfstests fsx binary, 50k random ops) and pull in the paranoid mode from PR #41's earlier fsx implementation as a separate test file. Paranoid mode does full CAS round-trip after each mutation: write/truncate, flush + sleep, re-open + read-back, verify against in-memory reference. This catches composition bugs in range_upload that the canonical fsx misses (since fsx reads from the local staging file, not from CAS). Slow (~1.5s per op for flush debounce + CAS propagation). Use FSX_PARANOID_OPS to control iteration count (default: 100).

The 7 sparse-write tests appended to run_write_tests exercise the full range_upload + fill_sparse_holes pipeline (PR #41 + xet-core PR #717), which requires the new /v2/file-chunk-hashes endpoint. Until that endpoint ships, the flush retries fail in a loop ('caller said original_size=N but reconstruction info reports M') and the 5s sleeps make the test feel stuck. Skip the new tests by default; opt in with HF_MOUNT_SPARSE_TESTS=1 once the endpoint is live.

#178) ## Summary NFS WRITEs on macOS were silently dropping bytes when the file had been READ first. Root cause is server-side: a read-only pooled handle from a prior NFS READ returns `EBADF` on `virtual_fs.write`, which `errno_to_nfs` maps to `NFS3ERR_STALE`. macOS NFS treats STALE on WRITE as fatal and silently flushes its write buffer — `dd` reports success but the bytes never reach the server. `fsync(2)` later returns ESTALE. ## Fix `nfs.rs::write()`: 1. **Fast path EBADF**: handle was opened read-only. Remove the now-stale entry from the pool (guarded by `peek == Some(fh)` so a concurrent successful upgrader isn't clobbered), evict, fall through. 2. **Slow path**: `open(writable=true)` → `pwrite` → only THEN `insert_handle`. The freshly-opened fh stays private to this task until its write commits. No other task can release it via `insert_handle`'s `replaced` eviction while we're between insert and pwrite — because we don't insert until after the pwrite is done. Mirrors the existing EBADF retry pattern in `read()` (nfs.rs:181-200). ## Reproducer (pre-fix) ```bash hf-mount-nfs bucket X/y /mnt ls /mnt # triggers READ → pools Lazy handle dd if=/dev/urandom of=/mnt/existing-file bs=1M count=1 seek=100 conv=notrunc,fsync # dd reports success; file in bucket is unchanged. python3 -c "import os; fd=os.open('/mnt/existing-file', os.O_RDWR); os.fsync(fd)" # OSError: [Errno 70] Stale NFS file handle ``` Server log shows: ``` open: ino=2, writable=false ← from the READ read: fh=1, offset=... write: ino=2, fh=1, offset=104857600, len=1048576 ← EBADF returned, mapped to STALE (no further server activity; macOS dropped the write buffer) ``` ## End-to-end validation (post-fix, macOS NFS mount) Sequential read+write: ``` write: ino=4, fh=1 → EBADF release: fh=1 ← pool entry removed open: writable=true ← slow path write: ino=4, fh=2, offset=0 ← succeeds on fresh fh ``` 8 concurrent dd at distinct offsets on same file (was previously hangbait): ``` all 8 dd done in .009s 8 writes server-side on fh=2 (fast path reused after initial upgrade) zero release intempestif, zero NFS3ERR_STALE ``` Python fsync: ``` fsync OK ✓ # before fix: [Errno 70] Stale NFS file handle ``` ## Concurrent-write race (worth calling out) An earlier draft of this PR added a per-inode `tokio::sync::Mutex` around the slow path. Adversarial review (Codex) pointed out the correct, lighter defense: **publish to the pool only after the first write succeeds**. With insert_handle deferred until after pwrite, the fh is unreachable by other tasks during its critical window — no mutex needed, race closed by invariant. Concretely, if two NFS WRITE RPCs on the same ino both peek a Lazy handle and both hit EBADF: - Both reach the slow path; `virtual_fs.open(writable=true)` is internally serialized by VirtualFs's per-ino staging lock, so they produce distinct fh_A and fh_B. - Pre-fix (insert-then-write): writer A inserts fh_A → writer B inserts fh_B, `replaced=fh_A` → evict_handle releases fh_A. Writer A then pwrites against fh_A → EBADF → STALE. Same silent-data-loss this PR is fixing. - Post-fix (write-then-insert): writer A pwrites to fh_A in private (no one else knows about it), then inserts. Writer B does the same with fh_B. Both pwrites succeed; the loser of the insert race has its fh released by the winner's insert, but its bytes are already in the staging file. ## Relation to other PRs - **PR #41 (sparse writes)** contains the same EBADF→upgrade fix in its nfs.rs. This PR isolates the change so it can land independently. When #41 merges, this PR becomes a no-op merge. The write-before-publish reordering here will need to be ported into #41's nfs.rs (the same race exists there). - **PR #177 (stable filehandles)** was an earlier misdiagnosis attributing the symptom to macOS client-side filehandle caching across umount. Closed. ## Tests - `write_after_read_upgrades_handle_instead_of_returning_stale` — main regression test (READ then WRITE). - `second_write_reuses_writable_handle` — fast path is reused after upgrade. - `write_without_prior_read_opens_writable_directly` — slow path standalone. - Verified by reverting just the `write()` body to main's version: tests fail with `NFS3ERR_STALE`. Tests are load-bearing for the regression. - `cargo test --features fuse,nfs --lib` → 342/342 pass. - `cargo clippy --features nfs --all-targets -- -D warnings` → clean. - End-to-end validated against `XciD/hf-mount-test` bucket on macOS NFS as described above.

Sparse staging: open for write creates a sparse file (set_len) instead of downloading the original CAS content. Dirty byte ranges are tracked in SparseWriteState and only modified regions are uploaded via range_upload. Key changes: - Sparse staging file on open (no CAS download) - SparseWriteState tracks dirty ranges with O(log n) merge - fill_sparse_holes reads CAS data on demand for read-after-write - flush_generation counter prevents stale flush from clearing dirty state - Rename re-enqueues dirty files for flush at new path - setattr truncate/grow handled via clip_to_size + gap tracking - write past original_size automatically tracks gap as dirty - file.metadata() guard against concurrent truncate vs write race New xet-core API (DirtyInput with AsyncRead per range): - range_upload builds DirtyInput per dirty range from staging file - upload_ranges handles truncation boundary from CAS directly - No download needed for any write/truncate path Testing: - 245 unit tests (47 new for sparse writes, flush races, edge cases) - fsx: 50k random ops (staging) + 100 paranoid CAS round-trip ops - xfstests: generic/quick suite with FUSE patches (167 pass) - pjdfstest: 8789 POSIX syscall tests - Integration smoke tests: mid-file edit, append, truncate, multi-write, large file (512KB) CAS round-trip

NFS handle pool opens handles read-only for reads. When a WRITE RPC arrives for an existing CAS file, the handle lacks a staging file and VFS rejects with EBADF. Fix: try write with existing handle, on EBADF evict it and reopen writable (creating sparse staging).

…/rename) Completes the sparse-write integration on top of the previous infrastructure commit. Without this, `entry.sparse_write` was never set and all flushes fell through the regular `upload_files` path. Changes: - open_advanced_write: replace `download_to_file` with a sparse staging file (`File::create + set_len`). Set `entry.sparse_write` so flush uses `range_upload`. `materializes_remote` excludes sparse so `staging_is_current` is not flagged. - open_streaming_write: clear `entry.sparse_write` on the new-stream path. - ReadTarget::LocalFd: carry `ino` so the read path can look up `sparse_write`. - read: new `fill_sparse_holes` overlays original CAS bytes onto pread results for regions in `[0, original_size)` outside `dirty_ranges`. The reconstruction cache amortizes repeated CAS fetches; we don't backfill the staging file (would require a separate fetched-ranges tracker since reusing dirty_ranges would make `range_upload` re-upload unmodified data). - write: track dirty range via `Arc::make_mut(sw).track_write`. Clean→dirty transitions (NFS handle upgrade) lazily set up `SparseWriteState`. Guard `entry.size` with `file.metadata().len()` to avoid concurrent setattr races. - setattr (truncate): drop CAS download, rely on `fill_sparse_holes` for reads and `range_upload` composition for flush. On shrink, `clip_to_size` trims dirty ranges; on grow, `track_write` extends to mark the zero gap dirty. - rename_apply_local: returns `(replaced_staging_ino, dirty_inos)`. Bump dirty_generation on dirty file rename (and dirty descendants of a dir rename) so an in-flight flush can't clear dirty state with the stale snapshot. Caller re-enqueues the inodes for flush. Drop the redundant `flush_generation` field from `InodeEntry`: main's `dirty_generation` already provides the same race protection (see `InodeEntry::clear_dirty_if`). Add `sparse_write = None` to `apply_commit` so a successful flush clears the sparse state alongside dirty. Tests: 8 new sparse-write integration tests on top of the 22 SparseWriteState unit tests. The xet-core endpoint for `range_upload` (PR #717) is not yet deployed, so runtime validation is deferred. Local builds: 355 unit tests pass, clippy clean with -D warnings.

open_advanced_write installed a SparseWriteState whenever the file was clean and had a CAS hash, even when can_reuse_staging was true and the staging file already held the full content. Reads would then trigger fill_sparse_holes to re-download CAS bytes already on disk on every read. Track whether this call actually created a fresh sparse staging file and only install sparse_write in that case.

After a sparse range_upload, the on-disk staging file still holds dirty patches over a sparse hole — it does NOT match the new CAS file. apply_commit was unconditionally clearing sparse_write and setting staging_is_current = true, so: - reads through the still-open handle would skip fill_sparse_holes and return zeros for the untouched (sparse) regions - the next open-for-write reused the stale staging as a clean cache and could re-upload zeros over unchanged ranges Pass a was_sparse_upload flag through apply_commit. On sparse commits, re-key sparse_write to the new hash with empty dirty ranges and leave staging_is_current = false, so reads on the open handle compose against the new CAS file and the next open rebuilds a fresh sparse staging. Caught by codex review.

Asserts that reading through the still-open handle after a sparse range_upload flush returns composed CAS bytes (not staging zeros) and that the inode state matches the post-fix invariants (staging_is_current=false, sparse_write re-keyed to new hash). Verified failure on the pre-fix HEAD (~32a0f1f^): test panics with 'staging must not be flagged current after a sparse flush'.

Codex follow-up review caught a race: open_advanced_write snapshots (xet_hash, size) up front, creates a sparse staging file of `size` zero bytes, then before installing sparse_write also required `entry.xet_hash == xet_hash`. If poll_remote_changes updated the inode in that window the check silently failed, leaving the file marked dirty with no sparse metadata. Flush would then treat it as a regular full upload and commit the zeros from the sparse hole. Drop the equality check and install sparse_write unconditionally when we created sparse staging. The user opened against the snapshot content and that's what range_upload composes against — any newer remote revision is reconciled through Hub commit semantics, not by silently overwriting bytes the user never touched. Includes a regression test that drives the race via the per-inode staging mutex; verified to panic on the previous HEAD (afa538e^): 'sparse_write must be installed even when inode hash drifts mid-open'.

Codex follow-up review: if poll_remote_changes updates entry.xet_hash between the open snapshot and flush, and the user releases without writing, range_upload returns the snapshot hash (no-op). The previous unchanged-path called apply_commit, which rewrote entry.xet_hash back to the snapshot — silently rolling the inode view back over the newer remote revision picked up by the poller. Add apply_noop_commit: clears dirty + sparse_write + pending_deletes and stamps last_revalidated, but leaves xet_hash/size alone so a concurrent poll update survives. flush_batch routes the unchanged slot through apply_noop_commit instead of apply_commit. Also extend the unchanged detection to compare range_upload's output against sparse_write.original_hash (the snapshot), not against prev_xet_hash (which may have drifted). This is what makes the no-op path fire in the drift case at all. Mirrored the real range_upload no-op short-circuit (dirty_ranges empty + size unchanged returns original hash) in MockXet so the test harness exercises the same path. Regression test sparse_open_close_no_writes_does_not_rollback_drifted_hash drives the race via the staging mutex and asserts: 1. no Hub batch op fires for a no-write sparse open 2. the drifted xet_hash on the inode is preserved Verified the test panics on the previous HEAD (4967ad3) with 'no Hub batch op should fire when a sparse open had no writes'.

Codex follow-up: handles both hash and size drift before set_dirty. Window of vulnerability: open_advanced_write captures the snapshot in open(), takes the per-inode staging mutex, then creates a sparse staging file sized to the snapshot. During that window poll_remote_changes can mutate entry.xet_hash AND entry.size to a newer remote revision (it skips dirty inodes via update_remote_file's early-return, so the window closes once we set_dirty). With size drift the previous fixes weren't enough: even with sparse_write installed against the snapshot, on no-write release range_upload sees file_size (= drifted entry.size) != original_size, adds a synthetic truncate, returns a new hash, and the flush commits that synthesized old-content-truncated file over the newer remote revision. Detect drift at the install branch (under the inode write lock, BEFORE set_dirty) and bail with EAGAIN. The orphan sparse staging is harmlessly overwritten on the next open (staging_is_current was already cleared, and the inode is not dirty since we never reached set_dirty). Regression test sparse_open_returns_eagain_on_inode_drift_mid_open drives the staging-mutex race with hash + size drift and asserts EAGAIN + preserved drifted state. The previous no-rollback test is retargeted to the simpler no-drift no-writes path (sparse_open_no_writes_no_op_flush).

…g EAGAIN open_advanced_write returns EAGAIN when it detects xet_hash/size drift between the snapshot and the install branch. Previously the EAGAIN bubbled up to userspace, which is correct but ugly UX — userland apps generally don't retry on EAGAIN from open(2). Wrap the inner call in a bounded retry loop (3 attempts) at the outer open() entry point. On EAGAIN we re-read the file entry (now updated by poll_remote_changes) and call open_advanced_write again with the fresh snapshot. The race window is tiny (it only exists between the snapshot read and the install branch under the per-inode staging mutex), so 3 attempts is overwhelmingly enough — and the unreachable!() panic catches a regression if drift ever becomes pathological. Test sparse_open_retries_on_inode_drift_mid_open (renamed from returns_eagain) now asserts the end-to-end invariant: user sees a successful open, sparse_write reflects the DRIFTED state (not the stale first-attempt snapshot), and the inode is properly dirty.

…k, write+setattr race Three new tests closing gaps spotted in the manual review: 1. sparse_truncate_shrink_then_flush_drops_tail — drives the synthetic truncate-DirtyInput path in range_upload (xet.rs:228-238): write a dirty patch inside the eventual range, setattr-truncate, fsync, verify the composed CAS file is the truncated prefix + dirty patch and the tail past the new boundary is gone. 2. sparse_pure_truncate_shrink_then_flush_drops_tail — same path with no writes, just truncate + flush, ensuring the synthetic delete still fires when dirty_inputs is otherwise empty. 3. write_lazy_installs_sparse_write_on_clean_inode_transition — exercises the defensive clean->dirty branch in write() (mod.rs:2392-2400) by force-clearing dirty/sparse_write after open and asserting the next write installs a fresh SparseWriteState pinned to entry.xet_hash / entry.size with the write range tracked. Through the public API this branch is unreachable today (open_advanced_write set_dirty's the inode first), but the test locks the documented behavior in case a future code path lands a write on a clean inode. 4. write_setattr_concurrent_keeps_size_consistent_with_staging — stress test for the guard at mod.rs:2384-2385 (effective_end = new_end.min (actual_size)). Two tasks race for 200 iterations each: one writes to advancing offsets, the other truncates to varying sizes. After the storm the invariant entry.size <= staging length must hold, which the guard is exactly there to preserve. Tight race window between pwrite and the metadata read makes deterministic reproduction infeasible, so the test exercises the path under contention and checks the post-condition. 369/369 tests + clippy clean.

…gging Style cleanup from the review pass: - Extract find_regular_run_end(&items, start, max) to replace the doubly-shadowed chunk_end / take_while().last() chain. Easier to read and easier to reason about (the helper documents that start must be non-sparse and that sparse items are dispatched separately). - Move the error! log inside abort_batch so both call sites get the same structured log line instead of one-off log + abort. The range upload error message now includes ino/path in the abort message itself for parity with the regular-upload path.

Code review follow-up addressing 5 nits: 1. write() recorded the dirty range with the unclamped `written` even when the post-pwrite metadata read showed the staging file had been shrunk by a concurrent setattr(truncate). At flush, range_upload would seek to the range start and read end - start bytes, short- reading on the smaller staging. Now use `effective_end.saturating_sub (offset)` so the tracked range matches what's actually on disk. 2. trim_dirty_ranges now drops ranges where the clipped end equals start. track_write doesn't produce zero-length ranges so this is defense, but keeping the (s < e) invariant downstream is cheap and avoids consumers having to think about it. 3. Comment on fill_sparse_holes: the CAS download grabs the whole [offset, orig_end) even when most of it overlaps dirty bytes. Worth a follow-up only if profiling shows it. 4. Comment on apply_commit's generation-mismatch path: the CAS upload and Hub commit fire even when local metadata can't be updated, and the next flush will re-upload. CAS dedup handles duplicate content; noting the sparse path makes redundant flushes more visible. 5. Replace the unreachable!() at the end of the retry loop in open() with an explicit `Err(libc::EAGAIN)`. Same observable behavior, cleaner control flow that the compiler can prove total. Plus a regression test (write_tracked_range_clamped_to_staging_after_ concurrent_shrink) that pwrites then setattr-shrinks past the dirty range and asserts the tracked dirty_ranges stay within entry.size.

…EIO) All three reproduced first as failing tests, then fixed: 1. fill_sparse_holes panicked on a short CAS download stream — the copy_from_slice calls indexed into cas_data assuming the full [offset, orig_end) range had arrived. If the stream ended early (transient server truncation, bad reconstruction), the slice index went OOB and crashed the FUSE/NFS read hot path. Now bounds-check cas_data.len() against the expected count and surface EIO. Repro: repro_fill_sparse_holes_panics_on_short_cas_stream — drives a short stream via the MockXet.empty_range_downloads hook. 2. apply_noop_commit cleared dirty + sparse_write but skipped the mtime/ctime bump that apply_commit performs unconditionally. Build systems, rsync-style sync tools, and any caller polling mtime missed the dirty cycle on a no-op flush. Now refreshes mtime/ctime on every call (whether or not the generation matched). Repro: repro_noop_flush_does_not_bump_mtime — open + fsync + release round-trip and asserts mtime advances. 3. abort_batch flagged every inode in to_flush with the failure message, including items whose CAS upload had already succeeded in a prior chunk of the same batch (their xorbs are in CAS, only the Hub commit didn't fire). Users saw spurious EIO on files whose bytes are fine and will be retried naturally on the next flush. abort_batch now takes a slice — sparse failures mark only the failing item; regular chunk failures mark only that chunk. Repro: repro_abort_batch_marks_already_uploaded_items — mixes a regular create+write with a sparse open+write, arms range_upload to fail, and asserts the regular inode has no flush error.

Previously each dirty range opened its own TokioFile, scaling FD usage with dirty_ranges.len() (N opens of the same path per flush). Introduce PreadReader, an AsyncRead over Arc<std::fs::File> using pread(2), so independent per-range positions are tracked without contending on a shared cursor. One open per range_upload regardless of fragment count.

Two related correctness issues uncovered by code review: 1. TOCTOU on staging length in write(). The pwrite path was reading file.metadata().len() outside the inode write lock, then updating entry.size under the lock. A concurrent setattr(truncate) running between metadata() and lock acquisition could leave entry.size exceeding the actual staging length, causing range_upload to short-read at flush time. Read the length under the lock so it is serialized with the setattr-side truncate. 2. Unsafe clean->dirty fallback in write(). The fallback installed sparse_write keyed to entry.xet_hash at write time, with no drift check. If poll_remote_changes had updated entry.xet_hash between the prior staging materialization and the first write, sparse_write would be keyed to a hash whose CAS object no longer matches the on-disk staging. range_upload would then compose a corrupted file (new prefix/suffix + stale dirty patches). Move the install into open_advanced_write under the existing drift guard, covering both freshly-sparse and reused-staging paths. Skip the install when is_dirty (staging holds in-flight modifications that no longer correspond to entry.xet_hash). The lazy fallback is now dead and removed; its dedicated regression test is removed too (the install-at-open path is covered by the broader sparse-write suite). Drive-by: MockStreamingWriter::finish_boxed now registers uploaded bytes in MockXet's file map, mirroring production CAS so post-streaming reads through fill_sparse_holes succeed (caught by the open_advanced_write_reuse_dirty_staging test under the new install path).

The prior commit installed sparse_write on the can_reuse_staging path to enable range_upload at flush time. That regressed reads: a writable handle that hadn't written anything yet would still call fill_sparse_holes, which downloaded CAS bytes for the whole read region even though the staging file already held the full original content. Add a `staging_holds_full_original` flag to SparseWriteState, set via `new_with_full_staging` on the reused-staging path only. When true, fill_sparse_holes short-circuits — staging is the source of truth for non-dirty bytes, and track_write keeps that invariant on subsequent writes. Fresh sparse opens (where staging is an actual hole) leave the flag false and continue to fill from CAS.

The paranoid variant does a full CAS round-trip after each mutation (close, wait for flush, reopen, read-back), so it catches composition bugs in range_upload that the canonical fsx misses — staging-only fsx reads from the local staging file, not from CAS. ~2.5 min for the default 100 ops, gated by HF_TOKEN like the other fsx job.

Three small fixes flagged by codex on PR #41: * `apply_noop_commit` no longer clears `sparse_write`. A no-op flush means content is unchanged; any still-open handle's reads need the sparse state to fill staging holes from CAS. Clearing it would make subsequent reads return zeros from the sparse staging file. * `update_remote_file` (poll path) now also clears `sparse_write` alongside `staging_is_current`. If we keep the old sparse state past a remote update, the next open-for-write reuses it and `range_upload` composes the new content against the stale hash. Updated test expectation accordingly. * `flush_batch` now re-enqueues dirty siblings after a partial-batch abort. Pre-fix, the surviving items stayed dirty in the inode table but had no signal queued for them — the bytes sat on disk indefinitely until the next external write triggered a flush. The re-enqueue uses a self-referencing sender clone passed into `flush_loop`; to avoid that clone keeping the channel open across shutdown (which would deadlock `FlushManager::shutdown` waiting on the join handle), introduce a `FlushSignal::Shutdown` variant that tells `flush_loop` to drop its clone before draining and exiting. Drive-by: removed a duplicate blank line in `cached_xet_client.rs` tests left over from the rebase.

Previous attempt to fix the three P2 findings introduced two symmetric bugs: clearing sparse_write in update_remote_file breaks reads on a still-open handle (poll arrives between flush and release), and keeping the old dirty_ranges across apply_noop_commit causes zeros-from-holes after release + reopen (idempotent write case). Reworked lifecycle: * `update_remote_file` no longer touches `sparse_write`. Open handles keep their snapshot for `fill_sparse_holes`; the next open will refresh if needed. * `open_advanced_write` now detects stale `sparse_write` (its `original_hash` not matching the current snapshot xet_hash) and re-installs against the snapshot. Covers the "poll updated hash while sparse_write was leftover from a prior open" case without invalidating live handles. * `apply_noop_commit` keeps `sparse_write` but clears its `dirty_ranges`. The no-op upload means the staging matches `original_hash` — nothing is dirty wrt it. Leaving stale ranges would mark zero-positions as "covered by dirty" on a subsequent reopen-with-fresh-staging, suppressing the CAS fill. The per-handle sparse_write refactor we briefly considered doesn't work cleanly because staging is shared per-inode: two open handles writing to the same file would each track separate dirty_ranges over the same on-disk bytes, and flush would need to merge them. The gated-lifecycle approach above gets the same correctness without that complexity.

Three coordinated changes that, together with the previous round, cover all four scenarios codex round-2 review identified plus a closely-related fourth I caught while validating. 1. update_remote_file (inode.rs): extend the existing `is_dirty` guard to also skip when `has_open_handles(ino)` is true. Poll no longer changes xet_hash/size out from under a live handle, so any sparse_write keyed to the open snapshot stays consistent for the full open lifetime. This is the structural fix for the concurrent-opens-with-poll class of races (Codex P2-a, P2-c). Trade-off: a long-running handle won't see remote updates until released — acceptable, matches POSIX-ish "open is a snapshot". 2. apply_noop_commit (inode.rs): also clear `staging_holds_full_original` (not just dirty_ranges). After a no-op flush the on-disk staging may be GC'd or recreated as a hole later; leaving the flag true would make fill_sparse_holes short-circuit and return zeros. Pessimistic but correct. 3. open_advanced_write (mod.rs): when staging is recreated fresh (`!can_reuse_staging`) but we kept an existing sparse_write (hash matched the snapshot), force its `staging_holds_full_original` to false. Otherwise the leftover true flag from a prior `new_with_full_staging` install would mislead fill_sparse_holes against the just-created sparse hole. All 387 unit tests pass. Behaviorally: - Codex P2-a (concurrent opens) closed by #1 (poll can't fire mid-open). - Codex P2-b (no-op + GC + reopen → zeros) closed by #2 and #3. - Codex P2-c (setattr stale state) closed by #1.

…ar flush NFSv3 has no CLOSE RPC, so the server-side handle pool keeps a writable fh alive across logical opens. After a flush that clears `sparse_write` (regular-upload commit), the next logical open through the reused fh skips `open_advanced_write`, and the write path runs with `sparse_write=None` even though the inode is CAS-backed. Without recording a dirty range, a subsequent `setattr(size)` would create a fresh empty SparseWriteState and `range_upload` would compose the new file from the CAS original only, silently dropping the bytes from the second write. Fix: in `write()`, when `sparse_write=None && !is_dirty && xet_hash=Some && size > 0 && !overlay`, lazily install `SparseWriteState::new_with_full_staging(hash, size)`. The `!is_dirty` guard prevents mis-keying on an empty-CAS file where a prior write extended staging without installing sparse_write. A debug_assert checks `staging_is_current` since the only reachable producer of the gated state is `apply_commit(was_sparse_upload=false)`. Two regression tests cover the patch-through-truncate flow and the empty-CAS-file edge.

Max-recall review of the sparse-write PR surfaced 15 distinct bugs. Each finding gets a regression test (16 tests total, D1+E1 share one). All tests fail pre-fix and pass post-fix; 390 lib tests pass, 0 regressions. Data-loss / persistent failure: * C3: setattr's Clean-file branch fired after O_TRUNC+write with a SparseWriteState keyed to the pre-truncate hash but original_size set to the post-write local size. Capture `was_dirty` before set_dirty and gate the Clean-file branch on !was_dirty so the staging file (the source of truth in this case) is uploaded as-is via the regular path. * B5: same root cause as C3, reached via ftruncate(0)+flush, reused fh write, then setattr-extend. Same fix. * B1: setattr-shrink on a non-Xet file (xet_hash=None) uploaded zeros, silently replacing the bucket's original content. Download the original via hub_client.download_file_http before set_len. Silent corruption / staleness: * D1+E1: read()/write() did not take the per-inode staging lock, leaving TOCTOU windows between pread/pwrite and the sparse_write state update, and against range_upload's PreadReader. Hold self.staging.lock(ino) around pread+sparse_write snapshot (async lock_owned().await) and around pwrite+track_write (blocking_lock_owned() since write() is sync). Also covers E3 (apply_commit Arc swap during in-flight read) since flush_batch already holds the same per-inode lock. * C1: setattr mutated sparse_write without checking sw.original_hash == entry.xet_hash. Drop a stale sparse_write before the setattr branches. * C2: update_remote_file preserved sparse_write across hash rotations (intentional, for still-open handles) but the has_open_handles guard meant no handle was actually open by the time we reached the body. Clear sparse_write in update_remote_file when applied; read() also skips fill_sparse_holes defensively when the hash mismatches. * C4: poll Phase 2 pushed update.ino to inos_to_invalidate even when update_remote_file returned false (was_dirty || open write handle). The invalidator then closed the pooled NFS handle and the next cycle applied the update with sparse_write stale. Bind the bool return and only push if the update applied. * B6/E7: update_remote_file's has_open_handles guard froze inodes under any long-lived NFS-pool read handle. Track open_write_handles separately; gate update_remote_file on is_dirty || has_write_handles. Incorrect mtime / errno: * B8: setattr(size=N) where N == prev_size on a clean file bumped local mtime but never propagated to Hub. Treat same-size setattr on a clean inode as a full no-op, consistent with how chmod/utime already behave. * C5/E2: apply_(noop_)commit bumped mtime/ctime/last_revalidated even when clear_dirty_if returned false. Move the timestamp updates inside the clear_dirty_if success branch. * B4: errno_to_nfs had no EAGAIN arm, so transient drift surfaced as EIO. Add libc::EAGAIN => NFS3ERR_JUKEBOX. Defensive: * E4: lazy sparse_write install used debug_assert! which compiles out in release. Replace with a runtime check that skips the install and logs an error if staging_is_current is false. * E5: set_dirty used saturating_add, pinning dirty_generation at u64::MAX and letting a stale flush snapshot collide with a concurrent writer's post-race value. Switch to wrapping_add with skip-0. Findings refuted during verification (not in this commit): A3 (rename gate still present), B2/E8 (dirty-guard prevents the race), B3 (EAGAIN path consistent), D3 (RwLock provides sync edge for Relaxed atomics), D7 (per-mount random staging dir), D8 (lengths match by construction), A4 (entry.size always tracks CAS).

Three of the previous round's fixes were either YAGNI or added complexity out of proportion to the bug they addressed. Reverting them in favor of simpler invariants: * E4 (debug_assert in lazy install): the precondition is not violated by any current code path. The debug_assert serves its purpose as a dev- time check; the runtime guard + error log replaces a 1-line assertion with a 10-line branch for a "future regression" scenario that's better caught by tests. Restore the original debug_assert!. * E5 (wrapping_add for dirty_generation): collision requires u64::MAX set_dirty calls on a single inode, unreachable in practice. The saturating_add behavior is documented and exercised by an existing test. Restore. * B6/E7 simplification: instead of tracking open_write_handles separately (added an AtomicU32 + parallel bump/drop signatures across 7 call sites), just drop the has_open_handles guard in update_remote_file entirely. Every open-for-write path calls set_dirty before publishing the handle, so is_dirty already covers the in-flight-write case. Read- only handles don't need the inode snapshot to stay stable: Lazy reads use a prefetch buffer bound to the open-time hash, and LocalFd reads defensively skip fill_sparse_holes when sparse_write.original_hash != entry.xet_hash (C2 fix). The has_open_handles guard was over-broad — removing it covers B6/E7 without the extra plumbing. The B6 regression test (b6_open_readonly_handle_freezes_update_remote_file) still passes: the assertion is "update_remote_file proceeds with a clean read-only handle open", which is now true via the simpler is_dirty-only guard rather than via the open_write_handles split. Net diff: -181 lines (110 from removed E4+E5 tests, 71 from B6/E7 plumbing). 388 lib tests pass.

The previous D1+E1 fix used `tokio::sync::Mutex::blocking_lock_owned` in write() to serialize pwrite/pread/range_upload's reads. That panics with "Cannot block the current thread from within a runtime" whenever write() is called directly from an async task — which is exactly how the NFS adapter calls it (nfs.rs:294, 338). CI caught this: ---- nfs::tests::second_write_reuses_writable_handle stdout ---- panicked at src/virtual_fs/mod.rs:2481: Cannot block the current thread from within a runtime. ---- nfs::tests::write_after_read_upgrades_handle_instead_of_returning_stale ---- nfs::tests::write_without_prior_read_opens_writable_directly Switch to a `std::sync::Mutex<()>` per inode held only across non-await syscalls. Lives alongside the existing async staging Mutex (which still serializes the long-held open/setattr/unlink/range_upload paths). * StagingCoordinator gains `io_locks` and `io_lock(ino)`. * write() locks across pwrite + track_write + entry.size update. * read() locks across sparse_write snapshot + pread, then drops the lock before awaiting fill_sparse_holes (the snapshot Arc captured under the lock keeps the read consistent). * PreadReader takes the lock per `read_at` call, so concurrent pwrites don't interleave with xet-core's streaming reads (E1). * XetOps::range_upload signature gains `io_lock: Arc<Mutex<()>>`, threaded from flush_batch via `staging.io_lock(item.ino)`. Tests: 401 pass with `--features nfs` (matches CI parity); the 3 NFS tests that panicked previously now pass. The d1_e1 structural test was updated to look for `staging.io_lock(` instead of `staging.lock(`.

The sparse-write path exposes a class of lifecycle edge cases (poll drift vs writable handles, NFS handle pool reuse, multi-handle reads against in-flight writers) that this PR's earlier rounds keep discovering. Ship it in beta: off by default so production gets the pre-PR download-then-write behavior, opt-in via `--sparse-writes` for testers who want the perf gain on large-file/small-edit workloads. New CLI flag `--sparse-writes` (off by default). Implies `--advanced-writes`; warns and is ignored if set without it. Code-path gating: * `open_advanced_write`: when sparse_writes=false and the inode has a Xet hash + size > 0, restore the pre-feature behavior of downloading the full CAS object into staging before returning the handle. The drift retry, sparse_write install, and "reused staging full-original shim" are all skipped when sparse_writes=false. * `setattr(size=N)`: when sparse_writes=false on a clean Xet file with no existing staging, download the CAS object first (so set_len(N) doesn't truncate an empty file to N zero bytes and the regular upload doesn't replace the remote with zeros — the same B1 hazard but for Xet files, not just bucket files). * `write()` lazy sparse_write install: gated on sparse_writes=true. Without this, a non-sparse fh that reaches the post-flush state (sparse_write=None, !is_dirty, xet_hash=Some, size>0) would lazy- install a SparseWriteState and corrupt the next flush. * `setattr` Clean-file branch: gated on sparse_writes=true. Same rationale as the lazy install gate. Tests: `TestOpts::sparse_writes` defaults to true so the 16 sparse regression tests keep running without modification. Tests targeting the non-sparse fallback can opt out via `TestOpts { sparse_writes: false, .. }`. 401 lib tests pass with `--features nfs`.

* P1 — update_remote_file freezes hash rotation while a WRITABLE handle is open. Re-introduce open_write_handles (separate from open_handles). A clean writable handle still relies on entry.sparse_write matching entry.xet_hash for its next write (lazy install + range_upload composition); poll-clearing it would corrupt that write. Read-only handles are exempt — Lazy prefetch is bound to open-time hash, LocalFd reads gate fill_sparse_holes on the hash match (C2 fix). So long-lived NFS-pool read handles still don't freeze metadata refreshes (the B6/E7 outcome). * P2 — rename of a dirty file pushes old_path to pending_deletes unconditionally, not only when xet_hash.is_some(). Even a brand-new dirty file (xet_hash=None) can have an in-flight flush that snapshotted with the old path and publishes AddFile{old_path} before the rename's re-enqueue runs. Without the delete, the re-enqueued flush adds the new path and the old one leaks. Hub tolerates DeleteFile on a non-existent path as a no-op. * P3 — same-size setattr on a clean file now skips only the SIZE work (no mtime bump, no flush), not the whole function. Falls through to the metadata-only block below so a SETATTR carrying size=current_size + mode/uid/atime/mtime still applies those — NFS clients that batch all attrs into one RPC otherwise saw their other changes silently dropped. Tests: 401 pass with --features nfs.

Three tests and one NFS errno mapping ported from feat/append-write (the predecessor branch). These pin invariants that the rewrite preserves but that weren't previously exercised end-to-end: - sparse_post_flush_read_returns_cas_bytes_not_zeros: after apply_commit_sparse the inode keeps its sparse_write (rekeyed to the new hash) with cleared dirty_ranges. A read of an unmodified region must fill holes from the just-committed CAS file, not return zeros from the staging holes. Pre-fix on PR #41 this returned '00XX000000' instead of '01XX456789'. - sparse_fill_short_cas_stream_returns_eio_not_panic: when a CAS download yields zero bytes for a non-empty hole range, fill_sparse_holes must surface EIO rather than panicking on an OOB copy_from_slice. The code already has the data.len() < span check; this test pins it. - sparse_noop_flush_advances_mtime: apply_noop_commit calls touch_commit_clocks so mtime advances even when the upload returns the unchanged hash. Required for POSIX-y fsync semantics: a successful fsync after an open-for-write should reflect modification time, not freeze it because the bytes happened to dedup. - nfs: errno_to_nfs maps libc::EAGAIN -> NFS3ERR_JUKEBOX defensively. EAGAIN is now the sparse-write drift sentinel (returned from open_advanced_write when poll rotates xet_hash between snapshot and install). The retry loop in open() normally upgrades it to EIO so userspace never sees it, but mapping it directly avoids degrading a future leak to NFS3ERR_IO on NFS exports. Tests not backported (already covered or not applicable to the rewrite): - 25 helper-level unit tests (sparse_merge_range_*, sparse_clip_*, sparse_two_*, sparse_engulf_*, etc.) duplicate what src/virtual_fs/inode.rs::tests already covers structurally. - b1/b5/b6/b8/c1/c2/c3/c4 numbered tests target specific bugs of the old approach (setattr loses bytes, stale sparse_write rollbacks). The rewrite addresses them structurally; equivalent regression coverage exists via sparse_setattr_after_hash_drift_uses_current_revision, sparse_otrunc_then_write_then_reopen_keeps_user_data, and the same-inode concurrent stress / verified-integrity tests. - D1/E1/E3 source-shape tests grep mod.rs for specific code patterns (e.g. 'self.staging.io_lock(' in read() before the pread). Equivalent invariants in the new code shape (which uses io_lock + dirty_generation guard slightly differently) would need full rewrite to be useful, and the behavioural integrity stress already catches the bugs these were guarding.

XciD mentioned this pull request Mar 16, 2026

feat: range aware file write huggingface/xet-core#717

Merged

XciD force-pushed the feat/append-write branch from 67b8d38 to 6a35987 Compare March 20, 2026 15:25

XciD changed the title ~~feat: append-aware writes with CDC-correct file composition~~ feat: sparse writes with range_upload, zero-download write path Mar 20, 2026

XciD marked this pull request as ready for review March 20, 2026 15:57

XciD force-pushed the main branch from 700989c to 5244d3a Compare March 31, 2026 20:47

XciD force-pushed the feat/append-write branch from 1ff47f5 to 0c1ce14 Compare May 4, 2026 20:39

XciD closed this May 4, 2026

XciD reopened this May 4, 2026

XciD force-pushed the feat/append-write branch 2 times, most recently from 51179ce to ebbc866 Compare May 4, 2026 20:45

XciD force-pushed the feat/append-write branch from 21a0a55 to 2665d27 Compare May 6, 2026 16:48

XciD force-pushed the feat/append-write branch from 212dd47 to dc8dd46 Compare May 21, 2026 21:02

XciD force-pushed the feat/append-write branch from 378ee72 to 5de5ed8 Compare May 22, 2026 00:09

This was referenced May 22, 2026

fix(nfs): stable filehandles across server restarts #177

Closed

fix(nfs): upgrade read-only handle on WRITE instead of returning STALE #178

Merged

XciD added 5 commits May 22, 2026 20:10

ci: trigger

5f1910a

fixup! feat: sparse writes with range_upload, zero-download write path

e5b818e

XciD added 16 commits May 22, 2026 20:12

style: cargo +nightly fmt

f5f5708

XciD force-pushed the feat/append-write branch from ccbce62 to 06dbe6e Compare May 22, 2026 18:13

XciD added 11 commits May 22, 2026 21:13

ci: trigger

d6c7408

chore: remove TODO.md

d1cf853

XciD mentioned this pull request Jun 12, 2026

feat(vfs): sparse writes v3, session-scoped coverage staging #187

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: sparse writes with range_upload, zero-download write path#41

feat: sparse writes with range_upload, zero-download write path#41
XciD wants to merge 36 commits into
mainfrom
feat/append-write

XciD commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 16, 2026

Uh oh!

github-actions Bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

XciD commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Key components

Tests

Dependencies

Uh oh!

github-actions Bot commented Mar 16, 2026

POSIX Compliance (pjdfstest)

Uh oh!

github-actions Bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

XciD commented Mar 16, 2026 •

edited

Loading

github-actions Bot commented Mar 16, 2026 •

edited

Loading