Skip to content

feat: sparse writes with range_upload, zero-download write path#41

Open
XciD wants to merge 36 commits into
mainfrom
feat/append-write
Open

feat: sparse writes with range_upload, zero-download write path#41
XciD wants to merge 36 commits into
mainfrom
feat/append-write

Conversation

@XciD

@XciD XciD commented Mar 16, 2026

Copy link
Copy Markdown
Member

Summary

Replaces the full-download write path with sparse staging files and dirty range tracking. When opening an existing file for write, we create a sparse staging file (set_len()) instead of downloading CAS content. Only the dirty byte ranges are tracked and uploaded.

How it works

  • Open for write: create sparse staging file at original size, no CAS download
  • Write: pwrite() to staging file, SparseWriteState::track_write() records dirty ranges (O(log n) merge with binary search)
  • Read: pread() from staging, fill_sparse_holes() downloads CAS data on demand for non-dirty regions
  • Truncate: clip_to_size() trims dirty ranges on shrink, track_write() marks extension as dirty on grow
  • Flush: range_upload composes a new CAS file from stable prefix/suffix + re-chunked dirty regions via upload_ranges() in xet-core

For a 200MB file with a 1KB mid-file edit, this downloads 0 bytes on open and uploads only the dirty chunks (not 200MB).

Key components

  • SparseWriteState (inode.rs): tracks original_hash, original_size, sorted dirty_ranges vec
  • fill_sparse_holes (mod.rs): downloads CAS data for sparse holes in read buffer
  • range_upload (xet.rs): builds DirtyInput vec from dirty ranges, each with its own async reader
  • FlushEntry/FlushSuccess structs (flush.rs): replace raw tuples for flush pipeline
  • dirty_generation counter (reused): prevents a stale flush from clearing dirty state after concurrent writes/renames
  • NFS handle upgrade (nfs.rs): upgrades read-only pool handles to writable on first WRITE RPC

Tests

  • 47 new unit tests covering sparse writes, flush races, POSIX edge cases
  • Sparse write smoke tests in integration suite (mid-file write, append past EOF, CAS round-trip, multi-write accumulation, large file range_upload)
  • fsx: 50k random ops (staging) + 100 paranoid ops (CAS round-trip per mutation)
  • xfstests generic/quick suite (167+ pass)

Dependencies

  • xet-core branch adrien/combined-hf-mount (PR #717) which adds upload_ranges() API for composing files from CAS prefix/suffix + dirty regions

@github-actions

Copy link
Copy Markdown
Contributor

POSIX Compliance (pjdfstest)

============================================================
  pjdfstest POSIX Compliance Results
------------------------------------------------------------
  Files: 130/130 passed    Tests: 832 total (0 subtests failed)
  Result: PASS
------------------------------------------------------------
  Category               Passed    Total   Status
  -------------------- -------- -------- --------
  chflags                     5        5       OK
  chmod                       8        8       OK
  chown                       6        6       OK
  ftruncate                  13       13       OK
  granular                    5        5       OK
  mkdir                       9        9       OK
  open                       19       19       OK
  posix_fallocate             1        1       OK
  rename                     10       10       OK
  rmdir                      11       11       OK
  symlink                    10       10       OK
  truncate                   13       13       OK
  unlink                     11       11       OK
  utimensat                   9        9       OK
============================================================

@github-actions

github-actions Bot commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

Benchmark Results

============================================================
  Benchmark — 50MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    186.0 MB/s     235.6 MB/s
  Sequential re-read                2222.7 MB/s    2322.0 MB/s
  Range read (1MB@25MB)                0.4 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1480.0 MB/s
  Close latency (CAS+Hub)            0.138 s
  Write end-to-end                   291.7 MB/s
  Dedup write                       1988.4 MB/s
  Dedup close latency                0.116 s
  Dedup end-to-end                   353.1 MB/s
============================================================
============================================================
  Benchmark — 200MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    766.1 MB/s     922.3 MB/s
  Sequential re-read                2217.5 MB/s    2482.6 MB/s
  Range read (1MB@25MB)                0.2 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1824.9 MB/s
  Close latency (CAS+Hub)            0.547 s
  Write end-to-end                   304.5 MB/s
  Dedup write                       1813.2 MB/s
  Dedup close latency                0.095 s
  Dedup end-to-end                   972.3 MB/s
============================================================
============================================================
  Benchmark — 500MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                   1348.6 MB/s    1370.3 MB/s
  Sequential re-read                2188.3 MB/s    2501.7 MB/s
  Range read (1MB@25MB)                0.2 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1594.6 MB/s
  Close latency (CAS+Hub)            0.165 s
  Write end-to-end                  1044.0 MB/s
  Dedup write                       1573.3 MB/s
  Dedup close latency                0.135 s
  Dedup end-to-end                  1103.8 MB/s
============================================================
============================================================
  fio Benchmark Results
------------------------------------------------------------
  Job                        FUSE MB/s   NFS MB/s  FUSE IOPS   NFS IOPS
  ------------------------- ---------- ---------- ---------- ----------
  seq-read-100M                  365.0      483.1                      
  seq-reread-100M               2702.7      420.2                      
  rand-read-4k-100M                0.1        0.1         14         19
  seq-read-5x10M                 793.7      806.5                      
  rand-read-10x1M                  0.1        0.1         32         35
  Random Read Latency           FUSE avg      NFS avg
  ------------------------- ------------ ------------
  rand-read-4k-100M           72859.0 us   51803.8 us
  rand-read-10x1M             31347.3 us   28543.2 us
============================================================

@XciD XciD force-pushed the feat/append-write branch from 67b8d38 to 6a35987 Compare March 20, 2026 15:25
@XciD XciD changed the title feat: append-aware writes with CDC-correct file composition feat: sparse writes with range_upload, zero-download write path Mar 20, 2026
@XciD XciD marked this pull request as ready for review March 20, 2026 15:57
@XciD XciD force-pushed the feat/append-write branch from 1ff47f5 to 0c1ce14 Compare May 4, 2026 20:39
@XciD XciD closed this May 4, 2026
@XciD XciD reopened this May 4, 2026
@XciD XciD force-pushed the feat/append-write branch 2 times, most recently from 51179ce to ebbc866 Compare May 4, 2026 20:45
XciD added a commit that referenced this pull request May 5, 2026
Restore main's tests/fsx.rs (canonical xfstests fsx binary, 50k random ops)
and pull in the paranoid mode from PR #41's earlier fsx implementation as a
separate test file.

Paranoid mode does full CAS round-trip after each mutation: write/truncate,
flush + sleep, re-open + read-back, verify against in-memory reference. This
catches composition bugs in range_upload that the canonical fsx misses (since
fsx reads from the local staging file, not from CAS).

Slow (~1.5s per op for flush debounce + CAS propagation). Use FSX_PARANOID_OPS
to control iteration count (default: 100).
XciD added a commit that referenced this pull request May 5, 2026
The 7 sparse-write tests appended to run_write_tests exercise the full
range_upload + fill_sparse_holes pipeline (PR #41 + xet-core PR #717),
which requires the new /v2/file-chunk-hashes endpoint. Until that endpoint
ships, the flush retries fail in a loop ('caller said original_size=N but
reconstruction info reports M') and the 5s sleeps make the test feel stuck.

Skip the new tests by default; opt in with HF_MOUNT_SPARSE_TESTS=1 once the
endpoint is live.
XciD added a commit that referenced this pull request May 6, 2026
Restore main's tests/fsx.rs (canonical xfstests fsx binary, 50k random ops)
and pull in the paranoid mode from PR #41's earlier fsx implementation as a
separate test file.

Paranoid mode does full CAS round-trip after each mutation: write/truncate,
flush + sleep, re-open + read-back, verify against in-memory reference. This
catches composition bugs in range_upload that the canonical fsx misses (since
fsx reads from the local staging file, not from CAS).

Slow (~1.5s per op for flush debounce + CAS propagation). Use FSX_PARANOID_OPS
to control iteration count (default: 100).
@XciD XciD force-pushed the feat/append-write branch from 21a0a55 to 2665d27 Compare May 6, 2026 16:48
XciD added a commit that referenced this pull request May 6, 2026
The 7 sparse-write tests appended to run_write_tests exercise the full
range_upload + fill_sparse_holes pipeline (PR #41 + xet-core PR #717),
which requires the new /v2/file-chunk-hashes endpoint. Until that endpoint
ships, the flush retries fail in a loop ('caller said original_size=N but
reconstruction info reports M') and the 5s sleeps make the test feel stuck.

Skip the new tests by default; opt in with HF_MOUNT_SPARSE_TESTS=1 once the
endpoint is live.
XciD added a commit to huggingface/xet-core that referenced this pull request May 21, 2026
## Summary

APIs for range-aware file writes: instead of re-uploading an entire file
when only part of it changed, compose a new CAS file from stable
segments + re-chunked dirty windows. Supports resize edits (insert /
delete / arbitrary replace) in addition to in-place rewrites.

### API: `upload_ranges`

```rust
pub async fn upload_ranges(
    config: Arc<TranslatorConfig>,
    cas_client: Arc<dyn Client>,
    original_hash: MerkleHash,
    original_size: u64,
    dirty_inputs: Vec<DirtyInput>,
) -> Result<XetFileInfo>
```

```rust
/// A single edit applied to the original file: replace `original_range` with
/// `new_length` bytes from `reader`. Edits are expressed in original-file coordinates.
pub struct DirtyInput {
    pub original_range: Range<u64>,
    pub reader: Pin<Box<dyn AsyncRead + Send>>,
    pub new_length: u64,
}
```

The output file size is **derived** from the inputs (no `total_size`
parameter): `original_size - removed + added`.

### Edit shapes (all expressible with the same struct)

| Operation | `original_range` | `new_length` |
|---|---|---|
| In-place edit | `a..b` | `b - a` |
| Resize replace | `a..b` | any |
| Pure insert | `p..p` | `> 0` |
| Pure delete | `a..b` | `0` |
| Append | `original_size..original_size` | `> 0` |
| Truncate to N | `N..original_size` | `0` |
| No-op | empty `dirty_inputs` | — |

Motivating example:

```text
abc + upload_ranges([0..1), "foo", 3) = foobc
abc + upload_ranges([0..0), "foo", 3) = fooabc
abc + upload_ranges([0..1), "",    0) = bc
```

**Per-range `AsyncRead` instead of `ReadSeek` over the staging file.**
The earlier prototype took `dirty_ranges: &[(u64, u64)] + dirty_source:
&mut dyn ReadSeek`. That had a subtle bug: for truncation we silently
extended the dirty set with a boundary chunk and read those bytes from
the staging file, but if the file was never opened for write the staging
file contains zeros at those positions (real bytes are in CAS) → silent
corruption on the truncation boundary chunk. Pairing each edit with its
own reader makes that structurally impossible: any byte not provided by
the caller is fetched from CAS.

<details>
<summary>How it works</summary>

### High level

```
                           upload_ranges
   +----------------------+   |   +----------------------+
   |  original file (CAS) |---+-->|  composed file (CAS) |
   +----------------------+       +----------------------+
   only the dirty windows are re-uploaded; everything else
   is reused as whole CAS segments.
```

### Step 1 — coalesce + snap edits to segment boundaries

Edits are user-coordinates (byte ranges). We snap each edit's
`original_range` to the **enclosing CAS segments** so composition can
swap whole segments instead of truncating one mid-chunk. Adjacent /
overlapping snapped ranges are then coalesced.

Pure inserts (`start == end`) snap to the segment that owns `start`; an
insert at `original_size` snaps to the last segment.

### Step 2 — server returns windows + gap subtrees

Single CAS call: `GET /v2/file-chunk-hashes/{file_id}` with the
segment-aligned ranges in an `X-Range-Dirty: bytes=A-B,C-D` header.
Response shape (xetcas#987):

```rust
struct FileChunkHashesResponse {
    windows:      Vec<ChunkWindow>,         // one per dirty range
    hash_ranges:  Vec<Option<MerkleHashSubtree>>, // N+1 entries: [gap0, gap1, ..., gapN]
}
```

`windows[i].chunks` carries the chunk hashes the server actually owns
for that window (we re-upload these bytes). `hash_ranges[i]` is the
**MerkleHashSubtree** for the i-th unmodified gap, or `None` when there
is no gap there. This is the key to composing the final file hash
without touching unmodified bytes.

### Step 3 — for each window, stream `[CAS prefix | edits | CAS suffix]`
through a fresh cleaner

```
window = [w_start ............................................. w_end]
edits in this window:        [edit_a]    [edit_b]
                                ^           ^
streamed input to the cleaner:
  CAS bytes [w_start, edit_a.start)
  reader bytes for edit_a (new_length bytes)
  CAS bytes [edit_a.end, edit_b.start)
  reader bytes for edit_b
  CAS bytes [edit_b.end, w_end)
```

Pure inserts contribute zero original bytes but still emit `new_length`
reader bytes. Pure deletes contribute zero reader bytes. The cleaner
produces a new `MDBFileInfo` per window and a `ChunkHashList`.

### Step 4 — compose the file hash via `MerkleHashSubtree::merge`

```text
merge_seq = [gap0, w0, gap1, w1, ..., wN, gapN]   // skip None gaps

merged          = MerkleHashSubtree::merge(merge_seq)
aggregated_hash = merged.final_hash()
combined_hash   = aggregated_hash.hmac(zero)      // matches cleaner's file_hash
```

Special-case: if `total_size == 0` (e.g. truncate to empty) the result
is `MerkleHash::default()` *without* HMAC, mirroring `file_hash([])`.

### Step 5 — splice segments + register

Walk the original `MDBFileInfo.segments` and replace any segment that
falls inside a window with that window's freshly-uploaded segments.
Verification entries follow segment-for-segment when present.
`metadata_ext = None` (no SHA-256, see Limitations). Then
`register_composed_file` + `finalize`.

### Multi-window example

Two edits: replace `[50MB, 51MB)` and `[150MB, 151MB)` on a 200MB file:

```
+-----------+-------+------------+-------+-----------+
|  GAP 0    |  W0   |   GAP 1    |  W1   |  GAP 2    |
|  reused   |upload |  reused    |upload |  reused   |
| (subtree) | ~1MB  | (subtree)  | ~1MB  | (subtree) |
+-----------+-------+------------+-------+-----------+

Wire transfer: ~2MB upload + a few hundred KB of CAS reads for window
boundary chunks. Old approach: 200MB download + 200MB upload.
```

### Empty original short-circuit

When `original_size == 0` there is nothing to compose against — every
edit's `original_range` must be `0..0` (validated). We just stream the
new bytes through a fresh cleaner (`upload_fresh_file`).

</details>

### Reviewer note: `chunk_window_builder` is a re-implementation of
xetcas

`xet_client/src/cas_client/chunk_window_builder.rs` is a port of the
same window-building state machine that already lives in xetcas — it's
only used by the local / in-memory simulation clients (`local_client`,
`memory_client`) so the mock CAS server returns the same shape as the
real one in tests. **No need to re-review it as part of this PR**: it
mirrors logic already reviewed and merged in xetcas#987. A follow-up
xetcas PR will deduplicate by removing the server-side copy and pulling
this one in (or vice versa); the duplication is intentional and
temporary.

### Limitations

- **No SHA-256 metadata**: composed files have `metadata_ext = None`
since recomputing SHA-256 would require reading the full file. Only
suitable for contexts that don't require SHA-256 verification (HF
buckets, xet-native repos), not for Git LFS-backed repos.
- **Memory**: for very large files, the per-window in-memory state
(chunk hash list + composed segments) is bounded by the dirty regions,
not the whole file. The chunk-hashes response is paginated by the
server-defined window granularity.

### Tests (27)

Covering all edit shapes + edge cases. Notable:

| Test | Purpose |
|---|---|
| `test_resize_edits_abc` | The 3 motivating FUSE examples |
| `test_resize_large_replace_grows_file` | Replace `[a..b)` with much
more data |
| `test_resize_large_replace_shrinks_file` | Replace `[a..b)` with much
less data |
| `test_resize_mid_file_insert` | Pure insert in the middle |
| `test_resize_mid_file_delete` | Pure delete in the middle |
| `test_resize_multi_edit_mix` | Insert + replace + delete in one call |
| `test_resize_insert_at_segment_boundary` | Snapping correctness for
inserts |
| `test_upload_ranges_mid_file_edit` | In-place edit |
| `test_upload_ranges_truncation` | Pure truncate (sub-segment) |
| `test_upload_ranges_truncation_empty_staging` | Truncate when staging
is all-zero (boundary read from CAS) |
| `test_upload_ranges_truncation_with_overlapping_dirty` | Truncate +
dirty range overlapping the boundary |
| `test_truncate_to_empty_matches_clean_empty` | Truncating to 0 hashes
to `MerkleHash::default()` (matches a fresh empty cleaner) |
| `test_upload_ranges_append` | Pure append |
| `test_append_with_gap_before_dirty_range` | Append where reader covers
a sparse gap too |
| `test_append_sparse_staging_file` | Append on a sparse staging file |
| `test_mid_edit_plus_append` | Mid-file edit *and* append in one call
(P1 codex regression) |
| `test_empty_original_append` | `original_size == 0` + append falls
into the fresh-file path (P2 codex regression) |
| `test_empty_original_validates_ranges` | `original_size == 0` still
runs validation (reviewer regression) |
| `test_upload_ranges_at_file_start` | Edit at offset 0 (no stable
prefix) |
| `test_upload_ranges_multiple_regions` | Two non-adjacent dirty windows
with stable gap |
| `test_single_input_spanning_many_chunks` | One edit covering many CDC
chunks |
| `test_data_integrity_scenarios` | 5 sub-scenarios covering composition
correctness |
| `test_noop_returns_original_hash` | Empty `dirty_inputs` → no CAS
call, original hash returned |
| `test_rejects_dirty_range_past_total_size` | Validation: range past
`original_size` |
| `test_rejects_overlapping_dirty_ranges` | Validation: overlapping
edits |
| `test_rejects_unsorted_dirty_ranges` | Validation: unsorted edits |
| `test_upload_ranges_small_file_mid_edit` | Small files (single
segment) |

### Dependencies

- xetcas: `GET /v2/file-chunk-hashes/{file_id}` with `windows[] +
hash_ranges[]` response shape — huggingface-internal/xetcas#987
(merged).
- Consumer: huggingface/hf-mount#41.


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **High Risk**
> High risk because it adds a new partial-upload composition path that
splices CAS segments and recomputes file hashes from window subtrees,
touching core data integrity and client/server chunk-boundary logic.
> 
> **Overview**
> Adds range-aware file writes via new `upload_ranges`, letting callers
apply insert/delete/replace edits and upload only re-chunked dirty
windows while reusing stable CAS segments.
> 
> Introduces a new CAS API `get_file_chunk_hashes` (`GET
/v2/file-chunk-hashes/{file_id}` with `X-Range-Dirty`) plus response
types (`FileChunkHashesResponse`, `ChunkWindow`) and simulation support
(`chunk_window_builder`) that extends dirty ranges to *stable* chunk
boundaries and returns gap `MerkleHashSubtree` summaries +
stable-segment verification.
> 
> Refactors dedup/cleaning plumbing to expose per-chunk hash lists
(`ChunkHashList`), adds detached cleaner/session completion and
`register_composed_file` to avoid orphan shard entries, and
moves/re-exports `next_stable_chunk_boundary` into `xet_core_structures`
for shared stable-window computations.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
2f4cee4. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Arpit Jain <arpitjain099@gmail.com>
Co-authored-by: Hoyt Koepke <hoytak@huggingface.co>
Co-authored-by: tison <wander4096@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Di Xiao <seanses@users.noreply.github.com>
Co-authored-by: Arpit Jain <3242828+arpitjain099@users.noreply.github.com>
Co-authored-by: Assaf Vayner <assaf@huggingface.co>
Co-authored-by: Rajat Arya <rajatarya@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
XciD added a commit that referenced this pull request May 21, 2026
Restore main's tests/fsx.rs (canonical xfstests fsx binary, 50k random ops)
and pull in the paranoid mode from PR #41's earlier fsx implementation as a
separate test file.

Paranoid mode does full CAS round-trip after each mutation: write/truncate,
flush + sleep, re-open + read-back, verify against in-memory reference. This
catches composition bugs in range_upload that the canonical fsx misses (since
fsx reads from the local staging file, not from CAS).

Slow (~1.5s per op for flush debounce + CAS propagation). Use FSX_PARANOID_OPS
to control iteration count (default: 100).
@XciD XciD force-pushed the feat/append-write branch from 212dd47 to dc8dd46 Compare May 21, 2026 21:02
XciD added a commit that referenced this pull request May 21, 2026
The 7 sparse-write tests appended to run_write_tests exercise the full
range_upload + fill_sparse_holes pipeline (PR #41 + xet-core PR #717),
which requires the new /v2/file-chunk-hashes endpoint. Until that endpoint
ships, the flush retries fail in a loop ('caller said original_size=N but
reconstruction info reports M') and the 5s sleeps make the test feel stuck.

Skip the new tests by default; opt in with HF_MOUNT_SPARSE_TESTS=1 once the
endpoint is live.
@XciD XciD force-pushed the feat/append-write branch from 378ee72 to 5de5ed8 Compare May 22, 2026 00:09
XciD added a commit that referenced this pull request May 22, 2026
#178)

## Summary

NFS WRITEs on macOS were silently dropping bytes when the file had been
READ first. Root cause is server-side: a read-only pooled handle from a
prior NFS READ returns `EBADF` on `virtual_fs.write`, which
`errno_to_nfs` maps to `NFS3ERR_STALE`. macOS NFS treats STALE on WRITE
as fatal and silently flushes its write buffer — `dd` reports success
but the bytes never reach the server. `fsync(2)` later returns ESTALE.

## Fix

`nfs.rs::write()`:

1. **Fast path EBADF**: handle was opened read-only. Remove the
now-stale entry from the pool (guarded by `peek == Some(fh)` so a
concurrent successful upgrader isn't clobbered), evict, fall through.
2. **Slow path**: `open(writable=true)` → `pwrite` → only THEN
`insert_handle`. The freshly-opened fh stays private to this task until
its write commits. No other task can release it via `insert_handle`'s
`replaced` eviction while we're between insert and pwrite — because we
don't insert until after the pwrite is done.

Mirrors the existing EBADF retry pattern in `read()` (nfs.rs:181-200).

## Reproducer (pre-fix)

```bash
hf-mount-nfs bucket X/y /mnt
ls /mnt                                              # triggers READ → pools Lazy handle
dd if=/dev/urandom of=/mnt/existing-file bs=1M count=1 seek=100 conv=notrunc,fsync
# dd reports success; file in bucket is unchanged.
python3 -c "import os; fd=os.open('/mnt/existing-file', os.O_RDWR); os.fsync(fd)"
# OSError: [Errno 70] Stale NFS file handle
```

Server log shows:
```
open: ino=2, writable=false   ← from the READ
read: fh=1, offset=...
write: ino=2, fh=1, offset=104857600, len=1048576   ← EBADF returned, mapped to STALE
(no further server activity; macOS dropped the write buffer)
```

## End-to-end validation (post-fix, macOS NFS mount)

Sequential read+write:
```
write: ino=4, fh=1 → EBADF
release: fh=1                ← pool entry removed
open: writable=true           ← slow path
write: ino=4, fh=2, offset=0  ← succeeds on fresh fh
```

8 concurrent dd at distinct offsets on same file (was previously
hangbait):
```
all 8 dd done in .009s
8 writes server-side on fh=2 (fast path reused after initial upgrade)
zero release intempestif, zero NFS3ERR_STALE
```

Python fsync:
```
fsync OK ✓     # before fix: [Errno 70] Stale NFS file handle
```

## Concurrent-write race (worth calling out)

An earlier draft of this PR added a per-inode `tokio::sync::Mutex`
around the slow path. Adversarial review (Codex) pointed out the
correct, lighter defense: **publish to the pool only after the first
write succeeds**. With insert_handle deferred until after pwrite, the fh
is unreachable by other tasks during its critical window — no mutex
needed, race closed by invariant.

Concretely, if two NFS WRITE RPCs on the same ino both peek a Lazy
handle and both hit EBADF:
- Both reach the slow path; `virtual_fs.open(writable=true)` is
internally serialized by VirtualFs's per-ino staging lock, so they
produce distinct fh_A and fh_B.
- Pre-fix (insert-then-write): writer A inserts fh_A → writer B inserts
fh_B, `replaced=fh_A` → evict_handle releases fh_A. Writer A then
pwrites against fh_A → EBADF → STALE. Same silent-data-loss this PR is
fixing.
- Post-fix (write-then-insert): writer A pwrites to fh_A in private (no
one else knows about it), then inserts. Writer B does the same with
fh_B. Both pwrites succeed; the loser of the insert race has its fh
released by the winner's insert, but its bytes are already in the
staging file.

## Relation to other PRs

- **PR #41 (sparse writes)** contains the same EBADF→upgrade fix in its
nfs.rs. This PR isolates the change so it can land independently. When
#41 merges, this PR becomes a no-op merge. The write-before-publish
reordering here will need to be ported into #41's nfs.rs (the same race
exists there).
- **PR #177 (stable filehandles)** was an earlier misdiagnosis
attributing the symptom to macOS client-side filehandle caching across
umount. Closed.

## Tests

- `write_after_read_upgrades_handle_instead_of_returning_stale` — main
regression test (READ then WRITE).
- `second_write_reuses_writable_handle` — fast path is reused after
upgrade.
- `write_without_prior_read_opens_writable_directly` — slow path
standalone.
- Verified by reverting just the `write()` body to main's version: tests
fail with `NFS3ERR_STALE`. Tests are load-bearing for the regression.
- `cargo test --features fuse,nfs --lib` → 342/342 pass.
- `cargo clippy --features nfs --all-targets -- -D warnings` → clean.
- End-to-end validated against `XciD/hf-mount-test` bucket on macOS NFS
as described above.
XciD added 5 commits May 22, 2026 20:10
Sparse staging: open for write creates a sparse file (set_len) instead of
downloading the original CAS content. Dirty byte ranges are tracked in
SparseWriteState and only modified regions are uploaded via range_upload.

Key changes:
- Sparse staging file on open (no CAS download)
- SparseWriteState tracks dirty ranges with O(log n) merge
- fill_sparse_holes reads CAS data on demand for read-after-write
- flush_generation counter prevents stale flush from clearing dirty state
- Rename re-enqueues dirty files for flush at new path
- setattr truncate/grow handled via clip_to_size + gap tracking
- write past original_size automatically tracks gap as dirty
- file.metadata() guard against concurrent truncate vs write race

New xet-core API (DirtyInput with AsyncRead per range):
- range_upload builds DirtyInput per dirty range from staging file
- upload_ranges handles truncation boundary from CAS directly
- No download needed for any write/truncate path

Testing:
- 245 unit tests (47 new for sparse writes, flush races, edge cases)
- fsx: 50k random ops (staging) + 100 paranoid CAS round-trip ops
- xfstests: generic/quick suite with FUSE patches (167 pass)
- pjdfstest: 8789 POSIX syscall tests
- Integration smoke tests: mid-file edit, append, truncate, multi-write,
  large file (512KB) CAS round-trip
NFS handle pool opens handles read-only for reads. When a WRITE RPC
arrives for an existing CAS file, the handle lacks a staging file and
VFS rejects with EBADF. Fix: try write with existing handle, on EBADF
evict it and reopen writable (creating sparse staging).
…/rename)

Completes the sparse-write integration on top of the previous infrastructure
commit. Without this, `entry.sparse_write` was never set and all flushes fell
through the regular `upload_files` path.

Changes:
- open_advanced_write: replace `download_to_file` with a sparse staging file
  (`File::create + set_len`). Set `entry.sparse_write` so flush uses
  `range_upload`. `materializes_remote` excludes sparse so `staging_is_current`
  is not flagged.
- open_streaming_write: clear `entry.sparse_write` on the new-stream path.
- ReadTarget::LocalFd: carry `ino` so the read path can look up `sparse_write`.
- read: new `fill_sparse_holes` overlays original CAS bytes onto pread results
  for regions in `[0, original_size)` outside `dirty_ranges`. The reconstruction
  cache amortizes repeated CAS fetches; we don't backfill the staging file
  (would require a separate fetched-ranges tracker since reusing dirty_ranges
  would make `range_upload` re-upload unmodified data).
- write: track dirty range via `Arc::make_mut(sw).track_write`. Clean→dirty
  transitions (NFS handle upgrade) lazily set up `SparseWriteState`. Guard
  `entry.size` with `file.metadata().len()` to avoid concurrent setattr races.
- setattr (truncate): drop CAS download, rely on `fill_sparse_holes` for reads
  and `range_upload` composition for flush. On shrink, `clip_to_size` trims
  dirty ranges; on grow, `track_write` extends to mark the zero gap dirty.
- rename_apply_local: returns `(replaced_staging_ino, dirty_inos)`. Bump
  dirty_generation on dirty file rename (and dirty descendants of a dir
  rename) so an in-flight flush can't clear dirty state with the stale
  snapshot. Caller re-enqueues the inodes for flush.

Drop the redundant `flush_generation` field from `InodeEntry`: main's
`dirty_generation` already provides the same race protection (see
`InodeEntry::clear_dirty_if`). Add `sparse_write = None` to `apply_commit`
so a successful flush clears the sparse state alongside dirty.

Tests: 8 new sparse-write integration tests on top of the 22 SparseWriteState
unit tests.

The xet-core endpoint for `range_upload` (PR #717) is not yet deployed, so
runtime validation is deferred. Local builds: 355 unit tests pass, clippy
clean with -D warnings.
XciD added 16 commits May 22, 2026 20:12
open_advanced_write installed a SparseWriteState whenever the file
was clean and had a CAS hash, even when can_reuse_staging was true
and the staging file already held the full content. Reads would
then trigger fill_sparse_holes to re-download CAS bytes already on
disk on every read.

Track whether this call actually created a fresh sparse staging
file and only install sparse_write in that case.
After a sparse range_upload, the on-disk staging file still holds
dirty patches over a sparse hole — it does NOT match the new CAS
file. apply_commit was unconditionally clearing sparse_write and
setting staging_is_current = true, so:

- reads through the still-open handle would skip fill_sparse_holes
  and return zeros for the untouched (sparse) regions
- the next open-for-write reused the stale staging as a clean cache
  and could re-upload zeros over unchanged ranges

Pass a was_sparse_upload flag through apply_commit. On sparse
commits, re-key sparse_write to the new hash with empty dirty
ranges and leave staging_is_current = false, so reads on the open
handle compose against the new CAS file and the next open rebuilds
a fresh sparse staging.

Caught by codex review.
Asserts that reading through the still-open handle after a sparse
range_upload flush returns composed CAS bytes (not staging zeros)
and that the inode state matches the post-fix invariants
(staging_is_current=false, sparse_write re-keyed to new hash).

Verified failure on the pre-fix HEAD (~32a0f1f^): test panics with
'staging must not be flagged current after a sparse flush'.
Codex follow-up review caught a race: open_advanced_write snapshots
(xet_hash, size) up front, creates a sparse staging file of `size`
zero bytes, then before installing sparse_write also required
`entry.xet_hash == xet_hash`. If poll_remote_changes updated the
inode in that window the check silently failed, leaving the file
marked dirty with no sparse metadata. Flush would then treat it as a
regular full upload and commit the zeros from the sparse hole.

Drop the equality check and install sparse_write unconditionally
when we created sparse staging. The user opened against the snapshot
content and that's what range_upload composes against — any newer
remote revision is reconciled through Hub commit semantics, not by
silently overwriting bytes the user never touched.

Includes a regression test that drives the race via the per-inode
staging mutex; verified to panic on the previous HEAD (afa538e^):
'sparse_write must be installed even when inode hash drifts mid-open'.
Codex follow-up review: if poll_remote_changes updates entry.xet_hash
between the open snapshot and flush, and the user releases without
writing, range_upload returns the snapshot hash (no-op). The
previous unchanged-path called apply_commit, which rewrote
entry.xet_hash back to the snapshot — silently rolling the inode
view back over the newer remote revision picked up by the poller.

Add apply_noop_commit: clears dirty + sparse_write + pending_deletes
and stamps last_revalidated, but leaves xet_hash/size alone so a
concurrent poll update survives. flush_batch routes the unchanged
slot through apply_noop_commit instead of apply_commit.

Also extend the unchanged detection to compare range_upload's output
against sparse_write.original_hash (the snapshot), not against
prev_xet_hash (which may have drifted). This is what makes the
no-op path fire in the drift case at all.

Mirrored the real range_upload no-op short-circuit (dirty_ranges
empty + size unchanged returns original hash) in MockXet so the
test harness exercises the same path.

Regression test sparse_open_close_no_writes_does_not_rollback_drifted_hash
drives the race via the staging mutex and asserts:
 1. no Hub batch op fires for a no-write sparse open
 2. the drifted xet_hash on the inode is preserved

Verified the test panics on the previous HEAD (4967ad3) with
'no Hub batch op should fire when a sparse open had no writes'.
Codex follow-up: handles both hash and size drift before set_dirty.

Window of vulnerability: open_advanced_write captures the snapshot in
open(), takes the per-inode staging mutex, then creates a sparse
staging file sized to the snapshot. During that window
poll_remote_changes can mutate entry.xet_hash AND entry.size to a
newer remote revision (it skips dirty inodes via update_remote_file's
early-return, so the window closes once we set_dirty).

With size drift the previous fixes weren't enough: even with
sparse_write installed against the snapshot, on no-write release
range_upload sees file_size (= drifted entry.size) != original_size,
adds a synthetic truncate, returns a new hash, and the flush
commits that synthesized old-content-truncated file over the newer
remote revision.

Detect drift at the install branch (under the inode write lock, BEFORE
set_dirty) and bail with EAGAIN. The orphan sparse staging is
harmlessly overwritten on the next open (staging_is_current was
already cleared, and the inode is not dirty since we never reached
set_dirty).

Regression test sparse_open_returns_eagain_on_inode_drift_mid_open
drives the staging-mutex race with hash + size drift and asserts
EAGAIN + preserved drifted state. The previous no-rollback test is
retargeted to the simpler no-drift no-writes path
(sparse_open_no_writes_no_op_flush).
…g EAGAIN

open_advanced_write returns EAGAIN when it detects xet_hash/size drift
between the snapshot and the install branch. Previously the EAGAIN
bubbled up to userspace, which is correct but ugly UX — userland apps
generally don't retry on EAGAIN from open(2).

Wrap the inner call in a bounded retry loop (3 attempts) at the outer
open() entry point. On EAGAIN we re-read the file entry (now updated
by poll_remote_changes) and call open_advanced_write again with the
fresh snapshot. The race window is tiny (it only exists between the
snapshot read and the install branch under the per-inode staging
mutex), so 3 attempts is overwhelmingly enough — and the
unreachable!() panic catches a regression if drift ever becomes
pathological.

Test sparse_open_retries_on_inode_drift_mid_open (renamed from
returns_eagain) now asserts the end-to-end invariant: user sees a
successful open, sparse_write reflects the DRIFTED state (not the
stale first-attempt snapshot), and the inode is properly dirty.
…k, write+setattr race

Three new tests closing gaps spotted in the manual review:

1. sparse_truncate_shrink_then_flush_drops_tail — drives the synthetic
   truncate-DirtyInput path in range_upload (xet.rs:228-238): write a
   dirty patch inside the eventual range, setattr-truncate, fsync,
   verify the composed CAS file is the truncated prefix + dirty patch
   and the tail past the new boundary is gone.

2. sparse_pure_truncate_shrink_then_flush_drops_tail — same path with
   no writes, just truncate + flush, ensuring the synthetic delete
   still fires when dirty_inputs is otherwise empty.

3. write_lazy_installs_sparse_write_on_clean_inode_transition — exercises
   the defensive clean->dirty branch in write() (mod.rs:2392-2400) by
   force-clearing dirty/sparse_write after open and asserting the next
   write installs a fresh SparseWriteState pinned to entry.xet_hash /
   entry.size with the write range tracked. Through the public API this
   branch is unreachable today (open_advanced_write set_dirty's the
   inode first), but the test locks the documented behavior in case a
   future code path lands a write on a clean inode.

4. write_setattr_concurrent_keeps_size_consistent_with_staging — stress
   test for the guard at mod.rs:2384-2385 (effective_end = new_end.min
   (actual_size)). Two tasks race for 200 iterations each: one writes
   to advancing offsets, the other truncates to varying sizes. After
   the storm the invariant entry.size <= staging length must hold,
   which the guard is exactly there to preserve. Tight race window
   between pwrite and the metadata read makes deterministic
   reproduction infeasible, so the test exercises the path under
   contention and checks the post-condition.

369/369 tests + clippy clean.
…gging

Style cleanup from the review pass:

- Extract find_regular_run_end(&items, start, max) to replace the
  doubly-shadowed chunk_end / take_while().last() chain. Easier to
  read and easier to reason about (the helper documents that start
  must be non-sparse and that sparse items are dispatched separately).

- Move the error! log inside abort_batch so both call sites get the
  same structured log line instead of one-off log + abort. The range
  upload error message now includes ino/path in the abort message
  itself for parity with the regular-upload path.
Code review follow-up addressing 5 nits:

1. write() recorded the dirty range with the unclamped `written` even
   when the post-pwrite metadata read showed the staging file had been
   shrunk by a concurrent setattr(truncate). At flush, range_upload
   would seek to the range start and read end - start bytes, short-
   reading on the smaller staging. Now use `effective_end.saturating_sub
   (offset)` so the tracked range matches what's actually on disk.

2. trim_dirty_ranges now drops ranges where the clipped end equals
   start. track_write doesn't produce zero-length ranges so this is
   defense, but keeping the (s < e) invariant downstream is cheap and
   avoids consumers having to think about it.

3. Comment on fill_sparse_holes: the CAS download grabs the whole
   [offset, orig_end) even when most of it overlaps dirty bytes. Worth
   a follow-up only if profiling shows it.

4. Comment on apply_commit's generation-mismatch path: the CAS upload
   and Hub commit fire even when local metadata can't be updated, and
   the next flush will re-upload. CAS dedup handles duplicate content;
   noting the sparse path makes redundant flushes more visible.

5. Replace the unreachable!() at the end of the retry loop in open()
   with an explicit `Err(libc::EAGAIN)`. Same observable behavior,
   cleaner control flow that the compiler can prove total.

Plus a regression test (write_tracked_range_clamped_to_staging_after_
concurrent_shrink) that pwrites then setattr-shrinks past the dirty
range and asserts the tracked dirty_ranges stay within entry.size.
…EIO)

All three reproduced first as failing tests, then fixed:

1. fill_sparse_holes panicked on a short CAS download stream — the
   copy_from_slice calls indexed into cas_data assuming the full
   [offset, orig_end) range had arrived. If the stream ended early
   (transient server truncation, bad reconstruction), the slice index
   went OOB and crashed the FUSE/NFS read hot path. Now bounds-check
   cas_data.len() against the expected count and surface EIO. Repro:
   repro_fill_sparse_holes_panics_on_short_cas_stream — drives a
   short stream via the MockXet.empty_range_downloads hook.

2. apply_noop_commit cleared dirty + sparse_write but skipped the
   mtime/ctime bump that apply_commit performs unconditionally. Build
   systems, rsync-style sync tools, and any caller polling mtime
   missed the dirty cycle on a no-op flush. Now refreshes mtime/ctime
   on every call (whether or not the generation matched). Repro:
   repro_noop_flush_does_not_bump_mtime — open + fsync + release
   round-trip and asserts mtime advances.

3. abort_batch flagged every inode in to_flush with the failure
   message, including items whose CAS upload had already succeeded in
   a prior chunk of the same batch (their xorbs are in CAS, only the
   Hub commit didn't fire). Users saw spurious EIO on files whose
   bytes are fine and will be retried naturally on the next flush.
   abort_batch now takes a slice — sparse failures mark only the
   failing item; regular chunk failures mark only that chunk. Repro:
   repro_abort_batch_marks_already_uploaded_items — mixes a regular
   create+write with a sparse open+write, arms range_upload to fail,
   and asserts the regular inode has no flush error.
Previously each dirty range opened its own TokioFile, scaling FD usage with
dirty_ranges.len() (N opens of the same path per flush). Introduce PreadReader,
an AsyncRead over Arc<std::fs::File> using pread(2), so independent per-range
positions are tracked without contending on a shared cursor. One open per
range_upload regardless of fragment count.
Two related correctness issues uncovered by code review:

1. TOCTOU on staging length in write(). The pwrite path was reading
   file.metadata().len() outside the inode write lock, then updating
   entry.size under the lock. A concurrent setattr(truncate) running
   between metadata() and lock acquisition could leave entry.size
   exceeding the actual staging length, causing range_upload to
   short-read at flush time. Read the length under the lock so it is
   serialized with the setattr-side truncate.

2. Unsafe clean->dirty fallback in write(). The fallback installed
   sparse_write keyed to entry.xet_hash at write time, with no drift
   check. If poll_remote_changes had updated entry.xet_hash between the
   prior staging materialization and the first write, sparse_write
   would be keyed to a hash whose CAS object no longer matches the
   on-disk staging. range_upload would then compose a corrupted file
   (new prefix/suffix + stale dirty patches).

   Move the install into open_advanced_write under the existing drift
   guard, covering both freshly-sparse and reused-staging paths. Skip
   the install when is_dirty (staging holds in-flight modifications
   that no longer correspond to entry.xet_hash). The lazy fallback is
   now dead and removed; its dedicated regression test is removed too
   (the install-at-open path is covered by the broader sparse-write
   suite).

Drive-by: MockStreamingWriter::finish_boxed now registers uploaded
bytes in MockXet's file map, mirroring production CAS so post-streaming
reads through fill_sparse_holes succeed (caught by the
open_advanced_write_reuse_dirty_staging test under the new install
path).
The prior commit installed sparse_write on the can_reuse_staging path
to enable range_upload at flush time. That regressed reads: a writable
handle that hadn't written anything yet would still call
fill_sparse_holes, which downloaded CAS bytes for the whole read region
even though the staging file already held the full original content.

Add a `staging_holds_full_original` flag to SparseWriteState, set via
`new_with_full_staging` on the reused-staging path only. When true,
fill_sparse_holes short-circuits — staging is the source of truth for
non-dirty bytes, and track_write keeps that invariant on subsequent
writes. Fresh sparse opens (where staging is an actual hole) leave the
flag false and continue to fill from CAS.
The paranoid variant does a full CAS round-trip after each mutation
(close, wait for flush, reopen, read-back), so it catches composition
bugs in range_upload that the canonical fsx misses — staging-only fsx
reads from the local staging file, not from CAS. ~2.5 min for the
default 100 ops, gated by HF_TOKEN like the other fsx job.
@XciD XciD force-pushed the feat/append-write branch from ccbce62 to 06dbe6e Compare May 22, 2026 18:13
XciD added 11 commits May 22, 2026 21:13
Three small fixes flagged by codex on PR #41:

* `apply_noop_commit` no longer clears `sparse_write`. A no-op flush
  means content is unchanged; any still-open handle's reads need the
  sparse state to fill staging holes from CAS. Clearing it would make
  subsequent reads return zeros from the sparse staging file.

* `update_remote_file` (poll path) now also clears `sparse_write`
  alongside `staging_is_current`. If we keep the old sparse state past
  a remote update, the next open-for-write reuses it and `range_upload`
  composes the new content against the stale hash. Updated test
  expectation accordingly.

* `flush_batch` now re-enqueues dirty siblings after a partial-batch
  abort. Pre-fix, the surviving items stayed dirty in the inode table
  but had no signal queued for them — the bytes sat on disk
  indefinitely until the next external write triggered a flush. The
  re-enqueue uses a self-referencing sender clone passed into
  `flush_loop`; to avoid that clone keeping the channel open across
  shutdown (which would deadlock `FlushManager::shutdown` waiting on
  the join handle), introduce a `FlushSignal::Shutdown` variant that
  tells `flush_loop` to drop its clone before draining and exiting.

Drive-by: removed a duplicate blank line in `cached_xet_client.rs`
tests left over from the rebase.
Previous attempt to fix the three P2 findings introduced two symmetric
bugs: clearing sparse_write in update_remote_file breaks reads on a
still-open handle (poll arrives between flush and release), and
keeping the old dirty_ranges across apply_noop_commit causes
zeros-from-holes after release + reopen (idempotent write case).

Reworked lifecycle:

* `update_remote_file` no longer touches `sparse_write`. Open handles
  keep their snapshot for `fill_sparse_holes`; the next open will
  refresh if needed.

* `open_advanced_write` now detects stale `sparse_write` (its
  `original_hash` not matching the current snapshot xet_hash) and
  re-installs against the snapshot. Covers the "poll updated hash
  while sparse_write was leftover from a prior open" case without
  invalidating live handles.

* `apply_noop_commit` keeps `sparse_write` but clears its
  `dirty_ranges`. The no-op upload means the staging matches
  `original_hash` — nothing is dirty wrt it. Leaving stale ranges
  would mark zero-positions as "covered by dirty" on a subsequent
  reopen-with-fresh-staging, suppressing the CAS fill.

The per-handle sparse_write refactor we briefly considered doesn't
work cleanly because staging is shared per-inode: two open handles
writing to the same file would each track separate dirty_ranges over
the same on-disk bytes, and flush would need to merge them. The
gated-lifecycle approach above gets the same correctness without that
complexity.
Three coordinated changes that, together with the previous round,
cover all four scenarios codex round-2 review identified plus a
closely-related fourth I caught while validating.

1. update_remote_file (inode.rs): extend the existing `is_dirty`
   guard to also skip when `has_open_handles(ino)` is true. Poll
   no longer changes xet_hash/size out from under a live handle,
   so any sparse_write keyed to the open snapshot stays consistent
   for the full open lifetime. This is the structural fix for the
   concurrent-opens-with-poll class of races (Codex P2-a, P2-c).
   Trade-off: a long-running handle won't see remote updates until
   released — acceptable, matches POSIX-ish "open is a snapshot".

2. apply_noop_commit (inode.rs): also clear
   `staging_holds_full_original` (not just dirty_ranges). After a
   no-op flush the on-disk staging may be GC'd or recreated as a
   hole later; leaving the flag true would make fill_sparse_holes
   short-circuit and return zeros. Pessimistic but correct.

3. open_advanced_write (mod.rs): when staging is recreated fresh
   (`!can_reuse_staging`) but we kept an existing sparse_write
   (hash matched the snapshot), force its
   `staging_holds_full_original` to false. Otherwise the leftover
   true flag from a prior `new_with_full_staging` install would
   mislead fill_sparse_holes against the just-created sparse hole.

All 387 unit tests pass. Behaviorally:
- Codex P2-a (concurrent opens) closed by #1 (poll can't fire mid-open).
- Codex P2-b (no-op + GC + reopen → zeros) closed by #2 and #3.
- Codex P2-c (setattr stale state) closed by #1.
…ar flush

NFSv3 has no CLOSE RPC, so the server-side handle pool keeps a writable fh
alive across logical opens. After a flush that clears `sparse_write`
(regular-upload commit), the next logical open through the reused fh skips
`open_advanced_write`, and the write path runs with `sparse_write=None`
even though the inode is CAS-backed. Without recording a dirty range, a
subsequent `setattr(size)` would create a fresh empty SparseWriteState and
`range_upload` would compose the new file from the CAS original only,
silently dropping the bytes from the second write.

Fix: in `write()`, when `sparse_write=None && !is_dirty && xet_hash=Some
&& size > 0 && !overlay`, lazily install
`SparseWriteState::new_with_full_staging(hash, size)`. The `!is_dirty`
guard prevents mis-keying on an empty-CAS file where a prior write
extended staging without installing sparse_write. A debug_assert checks
`staging_is_current` since the only reachable producer of the gated state
is `apply_commit(was_sparse_upload=false)`.

Two regression tests cover the patch-through-truncate flow and the
empty-CAS-file edge.
Max-recall review of the sparse-write PR surfaced 15 distinct bugs. Each
finding gets a regression test (16 tests total, D1+E1 share one). All
tests fail pre-fix and pass post-fix; 390 lib tests pass, 0 regressions.

Data-loss / persistent failure:
* C3: setattr's Clean-file branch fired after O_TRUNC+write with a
  SparseWriteState keyed to the pre-truncate hash but original_size set
  to the post-write local size. Capture `was_dirty` before set_dirty and
  gate the Clean-file branch on !was_dirty so the staging file (the
  source of truth in this case) is uploaded as-is via the regular path.
* B5: same root cause as C3, reached via ftruncate(0)+flush, reused fh
  write, then setattr-extend. Same fix.
* B1: setattr-shrink on a non-Xet file (xet_hash=None) uploaded zeros,
  silently replacing the bucket's original content. Download the
  original via hub_client.download_file_http before set_len.

Silent corruption / staleness:
* D1+E1: read()/write() did not take the per-inode staging lock, leaving
  TOCTOU windows between pread/pwrite and the sparse_write state update,
  and against range_upload's PreadReader. Hold self.staging.lock(ino)
  around pread+sparse_write snapshot (async lock_owned().await) and
  around pwrite+track_write (blocking_lock_owned() since write() is
  sync). Also covers E3 (apply_commit Arc swap during in-flight read)
  since flush_batch already holds the same per-inode lock.
* C1: setattr mutated sparse_write without checking sw.original_hash ==
  entry.xet_hash. Drop a stale sparse_write before the setattr branches.
* C2: update_remote_file preserved sparse_write across hash rotations
  (intentional, for still-open handles) but the has_open_handles guard
  meant no handle was actually open by the time we reached the body.
  Clear sparse_write in update_remote_file when applied; read() also
  skips fill_sparse_holes defensively when the hash mismatches.
* C4: poll Phase 2 pushed update.ino to inos_to_invalidate even when
  update_remote_file returned false (was_dirty || open write handle).
  The invalidator then closed the pooled NFS handle and the next cycle
  applied the update with sparse_write stale. Bind the bool return and
  only push if the update applied.
* B6/E7: update_remote_file's has_open_handles guard froze inodes under
  any long-lived NFS-pool read handle. Track open_write_handles
  separately; gate update_remote_file on is_dirty || has_write_handles.

Incorrect mtime / errno:
* B8: setattr(size=N) where N == prev_size on a clean file bumped local
  mtime but never propagated to Hub. Treat same-size setattr on a clean
  inode as a full no-op, consistent with how chmod/utime already behave.
* C5/E2: apply_(noop_)commit bumped mtime/ctime/last_revalidated even
  when clear_dirty_if returned false. Move the timestamp updates inside
  the clear_dirty_if success branch.
* B4: errno_to_nfs had no EAGAIN arm, so transient drift surfaced as
  EIO. Add libc::EAGAIN => NFS3ERR_JUKEBOX.

Defensive:
* E4: lazy sparse_write install used debug_assert! which compiles out
  in release. Replace with a runtime check that skips the install and
  logs an error if staging_is_current is false.
* E5: set_dirty used saturating_add, pinning dirty_generation at
  u64::MAX and letting a stale flush snapshot collide with a concurrent
  writer's post-race value. Switch to wrapping_add with skip-0.

Findings refuted during verification (not in this commit): A3
(rename gate still present), B2/E8 (dirty-guard prevents the race),
B3 (EAGAIN path consistent), D3 (RwLock provides sync edge for
Relaxed atomics), D7 (per-mount random staging dir), D8 (lengths
match by construction), A4 (entry.size always tracks CAS).
Three of the previous round's fixes were either YAGNI or added complexity
out of proportion to the bug they addressed. Reverting them in favor of
simpler invariants:

* E4 (debug_assert in lazy install): the precondition is not violated by
  any current code path. The debug_assert serves its purpose as a dev-
  time check; the runtime guard + error log replaces a 1-line assertion
  with a 10-line branch for a "future regression" scenario that's better
  caught by tests. Restore the original debug_assert!.
* E5 (wrapping_add for dirty_generation): collision requires u64::MAX
  set_dirty calls on a single inode, unreachable in practice. The
  saturating_add behavior is documented and exercised by an existing
  test. Restore.
* B6/E7 simplification: instead of tracking open_write_handles separately
  (added an AtomicU32 + parallel bump/drop signatures across 7 call
  sites), just drop the has_open_handles guard in update_remote_file
  entirely. Every open-for-write path calls set_dirty before publishing
  the handle, so is_dirty already covers the in-flight-write case. Read-
  only handles don't need the inode snapshot to stay stable: Lazy reads
  use a prefetch buffer bound to the open-time hash, and LocalFd reads
  defensively skip fill_sparse_holes when sparse_write.original_hash !=
  entry.xet_hash (C2 fix). The has_open_handles guard was over-broad —
  removing it covers B6/E7 without the extra plumbing.

The B6 regression test (b6_open_readonly_handle_freezes_update_remote_file)
still passes: the assertion is "update_remote_file proceeds with a clean
read-only handle open", which is now true via the simpler is_dirty-only
guard rather than via the open_write_handles split.

Net diff: -181 lines (110 from removed E4+E5 tests, 71 from B6/E7
plumbing). 388 lib tests pass.
The previous D1+E1 fix used `tokio::sync::Mutex::blocking_lock_owned` in
write() to serialize pwrite/pread/range_upload's reads. That panics with
"Cannot block the current thread from within a runtime" whenever write()
is called directly from an async task — which is exactly how the NFS
adapter calls it (nfs.rs:294, 338). CI caught this:

  ---- nfs::tests::second_write_reuses_writable_handle stdout ----
  panicked at src/virtual_fs/mod.rs:2481: Cannot block the current thread
  from within a runtime.
  ---- nfs::tests::write_after_read_upgrades_handle_instead_of_returning_stale
  ---- nfs::tests::write_without_prior_read_opens_writable_directly

Switch to a `std::sync::Mutex<()>` per inode held only across non-await
syscalls. Lives alongside the existing async staging Mutex (which still
serializes the long-held open/setattr/unlink/range_upload paths).

* StagingCoordinator gains `io_locks` and `io_lock(ino)`.
* write() locks across pwrite + track_write + entry.size update.
* read() locks across sparse_write snapshot + pread, then drops the
  lock before awaiting fill_sparse_holes (the snapshot Arc captured
  under the lock keeps the read consistent).
* PreadReader takes the lock per `read_at` call, so concurrent pwrites
  don't interleave with xet-core's streaming reads (E1).
* XetOps::range_upload signature gains `io_lock: Arc<Mutex<()>>`,
  threaded from flush_batch via `staging.io_lock(item.ino)`.

Tests: 401 pass with `--features nfs` (matches CI parity); the 3 NFS
tests that panicked previously now pass. The d1_e1 structural test was
updated to look for `staging.io_lock(` instead of `staging.lock(`.
The sparse-write path exposes a class of lifecycle edge cases (poll
drift vs writable handles, NFS handle pool reuse, multi-handle reads
against in-flight writers) that this PR's earlier rounds keep
discovering. Ship it in beta: off by default so production gets the
pre-PR download-then-write behavior, opt-in via `--sparse-writes` for
testers who want the perf gain on large-file/small-edit workloads.

New CLI flag `--sparse-writes` (off by default). Implies
`--advanced-writes`; warns and is ignored if set without it.

Code-path gating:

* `open_advanced_write`: when sparse_writes=false and the inode has a
  Xet hash + size > 0, restore the pre-feature behavior of downloading
  the full CAS object into staging before returning the handle. The
  drift retry, sparse_write install, and "reused staging full-original
  shim" are all skipped when sparse_writes=false.
* `setattr(size=N)`: when sparse_writes=false on a clean Xet file with
  no existing staging, download the CAS object first (so set_len(N)
  doesn't truncate an empty file to N zero bytes and the regular
  upload doesn't replace the remote with zeros — the same B1 hazard
  but for Xet files, not just bucket files).
* `write()` lazy sparse_write install: gated on sparse_writes=true.
  Without this, a non-sparse fh that reaches the post-flush state
  (sparse_write=None, !is_dirty, xet_hash=Some, size>0) would lazy-
  install a SparseWriteState and corrupt the next flush.
* `setattr` Clean-file branch: gated on sparse_writes=true. Same
  rationale as the lazy install gate.

Tests: `TestOpts::sparse_writes` defaults to true so the 16 sparse
regression tests keep running without modification. Tests targeting
the non-sparse fallback can opt out via `TestOpts { sparse_writes:
false, .. }`. 401 lib tests pass with `--features nfs`.
* P1 — update_remote_file freezes hash rotation while a WRITABLE handle
  is open. Re-introduce open_write_handles (separate from open_handles).
  A clean writable handle still relies on entry.sparse_write matching
  entry.xet_hash for its next write (lazy install + range_upload
  composition); poll-clearing it would corrupt that write. Read-only
  handles are exempt — Lazy prefetch is bound to open-time hash,
  LocalFd reads gate fill_sparse_holes on the hash match (C2 fix). So
  long-lived NFS-pool read handles still don't freeze metadata
  refreshes (the B6/E7 outcome).
* P2 — rename of a dirty file pushes old_path to pending_deletes
  unconditionally, not only when xet_hash.is_some(). Even a brand-new
  dirty file (xet_hash=None) can have an in-flight flush that
  snapshotted with the old path and publishes AddFile{old_path} before
  the rename's re-enqueue runs. Without the delete, the re-enqueued
  flush adds the new path and the old one leaks. Hub tolerates
  DeleteFile on a non-existent path as a no-op.
* P3 — same-size setattr on a clean file now skips only the SIZE work
  (no mtime bump, no flush), not the whole function. Falls through to
  the metadata-only block below so a SETATTR carrying size=current_size
  + mode/uid/atime/mtime still applies those — NFS clients that batch
  all attrs into one RPC otherwise saw their other changes silently
  dropped.

Tests: 401 pass with --features nfs.
XciD added a commit that referenced this pull request Jun 2, 2026
Three tests and one NFS errno mapping ported from feat/append-write (the
predecessor branch). These pin invariants that the rewrite preserves but
that weren't previously exercised end-to-end:

- sparse_post_flush_read_returns_cas_bytes_not_zeros: after apply_commit_sparse
  the inode keeps its sparse_write (rekeyed to the new hash) with cleared
  dirty_ranges. A read of an unmodified region must fill holes from the
  just-committed CAS file, not return zeros from the staging holes. Pre-fix
  on PR #41 this returned '00XX000000' instead of '01XX456789'.

- sparse_fill_short_cas_stream_returns_eio_not_panic: when a CAS download
  yields zero bytes for a non-empty hole range, fill_sparse_holes must
  surface EIO rather than panicking on an OOB copy_from_slice. The code
  already has the data.len() < span check; this test pins it.

- sparse_noop_flush_advances_mtime: apply_noop_commit calls
  touch_commit_clocks so mtime advances even when the upload returns the
  unchanged hash. Required for POSIX-y fsync semantics: a successful
  fsync after an open-for-write should reflect modification time, not
  freeze it because the bytes happened to dedup.

- nfs: errno_to_nfs maps libc::EAGAIN -> NFS3ERR_JUKEBOX defensively.
  EAGAIN is now the sparse-write drift sentinel (returned from
  open_advanced_write when poll rotates xet_hash between snapshot and
  install). The retry loop in open() normally upgrades it to EIO so
  userspace never sees it, but mapping it directly avoids degrading a
  future leak to NFS3ERR_IO on NFS exports.

Tests not backported (already covered or not applicable to the rewrite):

- 25 helper-level unit tests (sparse_merge_range_*, sparse_clip_*,
  sparse_two_*, sparse_engulf_*, etc.) duplicate what
  src/virtual_fs/inode.rs::tests already covers structurally.

- b1/b5/b6/b8/c1/c2/c3/c4 numbered tests target specific bugs of the
  old approach (setattr loses bytes, stale sparse_write rollbacks). The
  rewrite addresses them structurally; equivalent regression coverage
  exists via sparse_setattr_after_hash_drift_uses_current_revision,
  sparse_otrunc_then_write_then_reopen_keeps_user_data, and the
  same-inode concurrent stress / verified-integrity tests.

- D1/E1/E3 source-shape tests grep mod.rs for specific code patterns
  (e.g. 'self.staging.io_lock(' in read() before the pread). Equivalent
  invariants in the new code shape (which uses io_lock + dirty_generation
  guard slightly differently) would need full rewrite to be useful, and
  the behavioural integrity stress already catches the bugs these were
  guarding.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant