Status: v0.3 phase 1. Design draft.
Tracking: ROADMAP.md → "Cut pause-window without forking Firecracker".
Depends on: Firecracker v1.10.1 (already in use); track_dirty_pages: true on /machine-config (already set).
Reduce BRANCH pause-window from "write full memory.bin" (4 s for 513 MiB on SATA SSD, 1.04 s for 4 GiB on tmpfs) to "write only the pages dirtied since the previous snapshot". Expected 5–10× win for typical agent workloads where source has touched <10 % of its RAM between BRANCHes.
Firecracker's PUT /snapshot/create already accepts
snapshot_type: "Diff". With track_dirty_pages enabled on
/machine-config (forkd does this by default since v0.1.x), the
guest kernel's per-page dirty bits flow into a sparse file:
Fullsnapshot: writes every page (mem_size_mib× 1 MiB).Diffsnapshot: writes a sparse file the same logical size as the full memory image, but withlseek(SEEK_HOLE)gaps over clean pages. Only dirtied pages get bytes; the rest is holes.
After a Diff snapshot, the dirty bitmap is cleared. The next
Diff snapshot starts from a fresh bitmap and only writes what's
been dirtied since the last snapshot.
Children need a memory.bin that reflects the source's current state at BRANCH-pause time, not the source's BOOT state. A diff file alone isn't enough — children would see boot state for any page not in the diff.
Two sub-options:
When the daemon creates a source sandbox, it cps the source tag's
memory.bin into a per-sandbox shadow file. The source firecracker
still mmaps the original tag's memory.bin (read-only path), but the
daemon now has a writable copy it can merge diffs into.
At each BRANCH:
- Pause source (fast).
snapshot/createwithsnapshot_type: "Diff"→ writes a sparsediff-<timestamp>.bincontaining only the pages dirtied since the last snapshot.- Resume source.
- Background thread merges
diff-<timestamp>.binonto the shadow file: walks the sparse file, copies non-hole pages into the shadow at the same offsets. Cost O(dirty page bytes). - Children spawn and
mmap(shadow_file, MAP_PRIVATE)for guest RAM.
Pause-window cost is O(dirty pages), not O(memory size). For a typical agent that's touched 50 MB of working set, pause drops from 4 s (full write of 4 GiB on SATA) to ~60 ms (diff write of 50 MB).
Sandbox-creation cost gains one full memory.bin copy upfront. On SATA this is ~4 s for 513 MiB or ~30 s for 4 GiB — amortized across however many BRANCHes the source produces. Break-even is 2 BRANCHes even for the worst-case (full-memory) source diff; typical agent workloads break even on the first BRANCH because the diff is much smaller than the full memory.
Could merge diff onto shadow synchronously inside BRANCH, before returning. Doubles the pause-window over option A. Rejected.
Have children mmap the source tag's memory.bin as base and the
diff file separately, with a custom page-fault handler that picks
base or diff per address. This is essentially what we deferred in
docs/design/userfaultfd.md — requires a uffd
handler and either a Firecracker patch or a clever userspace dance.
Phase 1 stays on the simple MAP_PRIVATE path; this is a candidate
for phase 2 if shadow-file storage becomes a bottleneck.
Firecracker's dirty bitmap starts populated at restore time — every page that's been touched since the snapshot was loaded counts as "dirty". For a freshly-restored source that hasn't executed anything, the bitmap is empty (no pages dirtied yet). For a source that's been running, the bitmap reflects everything touched since boot OR since the last snapshot.
So the FIRST BRANCH on a long-running source can have a near-full dirty set, making Diff degrade to Full performance. This is acceptable: the worst case is the existing v0.2 cost; subsequent BRANCHes will be fast because the bitmap was cleared.
The "freshly restored source, immediately BRANCH" case is best: dirty bitmap empty, Diff writes ~0 bytes, pause-window is just the API round-trip plus vCPU state save (sub-100 ms).
impl Vm {
/// Existing.
pub fn snapshot_to(&self, vmstate: PathBuf, memory: PathBuf, ...) -> Result<Snapshot>;
/// NEW: write a Diff snapshot. memory_diff is a sparse file the
/// same logical size as the source's mem_size_mib; only the
/// dirtied pages are written. Caller is responsible for merging
/// the diff onto a base before any restore.
pub fn snapshot_diff_to(&self, vmstate: PathBuf, memory_diff: PathBuf, ...) -> Result<DiffSnapshot>;
}
pub struct DiffSnapshot {
pub vmstate: PathBuf,
pub memory_diff: PathBuf,
/// `memory_diff` is logically this size; physically it's
/// (count of dirty pages × page_size).
pub logical_size_bytes: u64,
pub volumes: Vec<VolumeSpec>,
}
/// Merge a diff sparse file onto a base memory.bin in place.
/// Copies non-hole pages from `diff` into `base` at the same offsets.
/// Returns the number of bytes copied (= dirty page bytes).
pub fn apply_diff(diff: &Path, base: &Path) -> Result<u64>;AppState keeps a per-sandbox shadow file path. branch_sandbox
handler:
- If sandbox has no shadow yet (first BRANCH after sandbox creation), the shadow file is the source tag's memory.bin and we degrade to Full snapshot.
- Otherwise:
- Call
Vm::snapshot_diff_to(diff_file). - Spawn background task that applies the diff onto the shadow.
- Update the source's tag's
memory.binreference to point at the shadow file for the children's restore. - Children's
mem_file_pathis the shadow file, MAP_PRIVATE.
- Call
For v0.3 phase 1a (MVP) we DON'T thread the shadow path through the API — we measure the diff snapshot mechanism in isolation against the existing Full path. Phase 1b wires the shadow file.
Reuse bench/pause-window/sweep-prewarm.sh shape. New experiment:
- For each memory size: spawn source, BRANCH 5 times in a row.
- prewarm=false, diff=false → today's behavior (baseline).
- prewarm=false, diff=true → diff snapshots.
- Measure
pause_msper BRANCH. Expect BRANCH 1 ≈ baseline (cold dirty set), BRANCH 2–5 ≪ baseline (small diffs).
Publish in bench/pause-window/RESULTS-v0.3.md.
- Diff-of-diff (BRANCH N+1's diff is relative to BRANCH N's shadow, not BRANCH N's diff). Phase 1's shadow-file merge handles this transparently.
- Cross-host diff transport (interesting for the Hub but separate).
- Compression of diff files before merge (zstd; useful for storage, not for pause-window).
| Phase | Scope | Status |
|---|---|---|
| 1a | Vm::snapshot_diff_to + apply_diff + unit tests + measurement on isolated source. |
Landed. |
| 1b | branch_sandbox with diff: true mode (parallel cp + diff during pause + apply on resume). Initially restricted to first BRANCH per sandbox. |
Landed. |
| 1c | Bench sweep-diff-real.sh + sweep-agent.sh + RESULTS-v0.3.md threshold curve. |
Landed. |
| 1d | Multi-BRANCH diff via the previous-BRANCH-output chain. No separate shadow file: each BRANCH's memory.bin is, by construction, source's state at that BRANCH's pause time — exactly the base the next diff needs. Tracked as SandboxInfo.last_branch_memory_path. Falls back to source-tag (with a warning) if the user deletes an intermediate BRANCH. |
Landed. |
Firecracker's dirty bitmap is cleared on EVERY snapshot/create,
Full or Diff. So once any BRANCH has been taken from a sandbox, the
NEXT diff would only see pages dirtied between BRANCH N and
BRANCH N+1 — missing everything dirtied before BRANCH N. If we
applied that diff onto source_tag/memory.bin (the boot state),
the resulting snapshot would be missing N batches of dirty pages.
Key insight: each successful BRANCH's snap_dir/memory.bin is,
by construction, the source's complete memory state at that BRANCH's
pause time. That's already the base the next diff needs. The
"shadow file" we'd otherwise have to maintain is just the previous
BRANCH's output.
Phase 1d implementation:
SandboxInfo.last_branch_memory_path: Option<PathBuf>tracks whichever BRANCH most recently completed (Full or Diff — both clear the bitmap, so either works as the next chain head).- On every successful BRANCH, the daemon calls
Registry::mark_branched(id, snap_dir/memory.bin)to update the chain head. - On every
diff: truerequest, the daemon picks the cp source as the chain head if set AND the file still exists. Otherwise (first BRANCH on a sandbox, or chain broken by user deletion) it falls back tosource_tag/memory.binwith a logged warning.
Trade-offs:
- Zero extra storage cost. Each BRANCH's output already lives
in
<snapshot_root>/<tag>/memory.bin; we just point at it. - No background tasks. No shadow-update thread, no JoinHandle bookkeeping. The chain head is just metadata.
- Chain breaks cleanly on deletion. If the user
DELETEs a BRANCH snapshot whosememory.binis the current chain head, the next diff BRANCH degrades to "fall back to source tag" with a warning. This is semantically lossy (loses pages dirtied before deletion) but doesn't crash; the operator can choose to switch to Full mode for that BRANCH if they need correctness.
Lifted in v0.3.1. The previously-shipped has_branched: bool flag
stays in SandboxInfo as a diagnostic but the daemon no longer
gates on it.