Skip to content

feat(sdk): parallel pipeline fan-out + typed JSON report with metrics#156

Merged
ZhiXiao-Lin merged 5 commits into
mainfrom
feat/sdk-pipeline-parallel-report
Jun 29, 2026
Merged

feat(sdk): parallel pipeline fan-out + typed JSON report with metrics#156
ZhiXiao-Lin merged 5 commits into
mainfrom
feat/sdk-pipeline-parallel-report

Conversation

@ZhiXiao-Lin

Copy link
Copy Markdown
Contributor

What

Makes the a3s-box-sdk programmable-CI pipeline usable for matrix / parallel CI and produces a machine-readable result.

  • Base::run_parallel(steps, max_concurrency) -> Report — runs steps concurrently as isolated copy-on-write MicroVM forks, bounded by max_concurrency, collect-all (every step runs; results returned in input order). Built on std::thread::scope + a work queue — still dependency-free.
  • Typed Report / StepResult with to_json() (dep-free encoder): separated stdout/stderr, duration_ms, cached, allow_failure, and metrics parsed from ::metric <key>=<value> lines a step prints to stdout (a scoring channel for matrix/selection workloads).
  • Steps now take &self (atomic fork counter). The previous &mut self made the documented "fan out with threads" impossible through the borrow checker.
  • Leak-free RAII: each fork is removed on every path (incl. panic) via a guard; the base snapshot is removed on Drop with --force.
  • Collision-safe names: box/snapshot names carry per-process + per-instance entropy, so two pipelines from the same image+setup (or two processes on one host) can no longer collide and tear down each other's live boxes.
  • Bounded output (WarmBase::max_output, default 1 MiB) so one chatty/looping step can't balloon an in-memory report during large fan-out.
  • Distinct INFRA_FAILURE exit sentinel so a scorer can tell "never ran (infra)" from "ran and failed".

Why

The pipeline could previously only run steps sequentially (step took &mut self), so a3s-box's cheap (~110 ms) CoW fork couldn't be used for matrix/parallel CI, and a result was only per-call exit_code + concatenated logs — not a comparable, machine-readable signal. This is the building block for fan-out CI and selection/scoring workloads.

Breaking

  • StepResult.logs (field) is removed, replaced by stdout / stderr. Use StepResult::combined() for the old concatenated view. (Documented under ### Changed.)

Validation

  • cargo test -p a3s-box-sdk15 unit tests + doctest; cargo clippy clean.
  • End-to-end on a real /dev/kvm host (a3s-box 2.5.1):
read   -> code=0 out="DEPS-INSTALLED"        # real boot + snapshot + CoW fork + exec
read#2 -> cached=true                         # content-addressed cache hit
report -> {"passed":false,"total_ms":2018,"steps":[
  {"name":"ok",  "exit_code":0,"duration_ms":62,"metrics":{},"stdout":"fine\n"},
  {"name":"perf","exit_code":0,"duration_ms":67,"metrics":{"duration_ms":12.5},...},
  {"name":"fail","exit_code":7,"duration_ms":94,...}]}
passed=false failures=["fail"]
leak check: 0 ci-base boxes, 0 snapshots

Real microVM boot, CoW fork-per-step, parallel fan-out, ::metric extraction from a live guest, typed JSON report, and zero leaked boxes/snapshots.

Dependency-free throughout; no new crates.

Roy Lin added 5 commits June 28, 2026 15:27
Base::run_parallel runs steps as concurrent copy-on-write MicroVM forks
(bounded, collect-all, input-ordered) and returns a dependency-free JSON
Report. StepResult gains separated stdout/stderr, duration_ms, and metrics
parsed from `::metric <key>=<value>` guest-stdout lines (a scoring channel
for matrix/selection workloads).

Steps now take &self (atomic fork counter), so fan-out no longer needs
hand-rolled threads. RAII removes each fork on every path and the base
snapshot on Drop (--force). Box/snapshot names carry per-process+instance
entropy to prevent cross-pipeline name collisions; output is capped to bound
report memory under large fan-out; infra failures use a distinct sentinel.

15 unit tests + doctest, clippy clean; validated end-to-end on a real
/dev/kvm host (real boot, CoW fork-per-step, metrics, JSON report, 0 leaks).
… + infra retry

Backs the parallel pipeline with real-microVM coverage and hardens it for
sustained, highly-concurrent churn:

- sweep_orphans(): reclaim ci-base-* boxes/snapshots left by a SIGKILL/OOM'd
  pipeline process (its RAII never ran), matched by the dead owner pid embedded
  in the resource name; never touches a live peer's resources.
- WarmBase::infra_retries (default 2): retry a fork that hits a TRANSIENT infra
  failure (restore/start/boot); the step's command never ran, so re-forking is
  idempotent. Keeps sustained high-concurrency churn green.
- tests/integration_kvm.rs (5 #[ignore] tests): warm+fork+exec, cache hit,
  parallel order/metrics, fork isolation, leak-freeness, sweep crash-recovery.
- tests/soak_kvm.rs: sustained fork-eval churn stays leak-free and RSS-stable;
  leak gates are process-scoped (robust to a concurrent pipeline on the host).
- ci.yml: run both under the integration-kvm (real /dev/kvm) gate.

Validated on a real KVM host (a3s-box 2.5.1): integration 5/5; soak 1500
fork-evals across 75 generations leak-free, RSS +512 KiB; 0 orphans after.
- a3s-box-ci: a dep-free bin (in a3s-box-sdk) bridging any agent/tool to the
  pipeline. `run [SPEC|-]` parses a line-based spec -> run_parallel -> JSON
  Report (exit 0 iff passed); `sweep` reclaims crashed-pipeline orphans. This
  is what lets a3s-code / Claude Code / Codex drive the pipeline from a script.
- warm_base now retries a transient infrastructure failure too (DRY'd with the
  per-step fork via a shared retry_infra), so concurrent same-image warms stay
  robust under load.

Validated end-to-end on real KVM (runner: real pipeline + sweep; a3s-code drives
it via session.program through the QuickJS runtime). 19 unit/bin tests + clippy.
…e corruption

RootfsCache::prune (called after a cache-miss put) evicted LRU entries with no
in-use guard, so it could remove_dir_all a cache entry that a CONCURRENT box was
using as its overlayfs lowerdir — that box's mount(2) then failed with ENOENT
('No such file or directory (os error 2)'), persisting through retries since the
backing was gone. Two pipelines from the same image collapse onto one cache key,
so this hit any concurrent same-image workload.

Fix: the same in-use guard SnapshotStore::prune already applies to live CoW
lowers. Each overlay box records the cache key it holds in <box_dir>/.rootfs-
cache-key (removed with the box dir); prune skips any still-referenced key
(prune_protecting, with prune as the empty-set wrapper).

Found via a concurrent-pipeline chaos test driven through a3s-code; verified on
a real /dev/kvm host (concurrency scenario: ~50% failure -> reliably green).
41 rootfs-cache unit tests + clippy clean.
@ZhiXiao-Lin ZhiXiao-Lin merged commit e00ce35 into main Jun 29, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant