Skip to content

feat(xet-console): compile-time-gated observability portal for live transfers#874

Draft
assafvayner wants to merge 81 commits into
mainfrom
assaf/xet-console
Draft

feat(xet-console): compile-time-gated observability portal for live transfers#874
assafvayner wants to merge 81 commits into
mainfrom
assaf/xet-console

Conversation

@assafvayner

@assafvayner assafvayner commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

tl;dr progress tui on a xet session upload/download components. Goal increase observability on hf_xet/xet session components.

xet-console — a compile-time-gated observability portal for live transfers

Debugging xet-core internals during a live transfer means logs and print statements today. There is no way to look inside a running XetSession — which files are chunking, which terms are enqueued vs. in flight, how dedup is performing mid-commit, what the adaptive concurrency controllers believe about the network — while it's happening.

This PR adds an opt-in dev tool that fixes that. When hf-xet/xet_pkg is built with the console feature, each process exposes a read-only HTTP+JSON view of every XetSession: upload commits (files, chunking progress, dedup, xorbs, shards), download groups (files, reconstruction terms, prefetch state), and adaptive concurrency monitors. It's consumed by a human through a terminal UI and by AI agents through the same API plus a skill document.

Design doc: docs/design/2026-06-11-xet-console-design.md

What's in here

  • xet_runtime::console — the core. model (serde wire types = registry snapshot output = server response bodies = TUI deserialization target), a process-global weak-ref ConsoleRegistry with bounded retention of completed/ended entities, an axum server (optional dep) on a dedicated thread + current-thread runtime, started lazily by the first console-enabled session, bound strictly to loopback 127.0.0.1:6660 (XET_CONSOLE_PORT overrides; 0 = ephemeral for tests).
  • Instrumentation threaded through xet_client / xet_data / xet_pkg via the existing XetContext, so no new dependency edges. Per-term, per-xorb, per-shard, and per-file-pipeline state hooks; session/commit/group + concurrency-monitor registration. With the feature off, a zero-sized ConsoleHandle<T> makes every call site a no-op that compiles to nothing.
  • xet_console — new workspace bin crate (xet-console): a ratatui/crossterm TUI that polls the API. Overview page, upload-commit detail, download-group detail, concurrency-monitor page (gauges, model states, limit-history sparklines). Rendered via TestBackend snapshot tests from fixture JSON.
  • Agent skill at docs/skills/xet-console/SKILL.md (curl+jq recipes + symptom→diagnosis guidance), with a README pointer.
  • Feature chain xet_pkg/console → xet-data/console → xet-client/console → xet-runtime/console mirroring the existing python/tokio-console chains; wheel passes the feature through; xet_console excluded from default workspace members; CI builds/lints/tests the feature so it can't rot.

Guarantees

  • The console never breaks the transfer. Every console-side failure (bind failure, poisoned lock, dangling weak ref mid-snapshot) degrades to a tracing::warn and a partial/absent view — no console error ever surfaces on the host data path.
  • Read-only, loopback-only, dev-only. No write/control ops, no remote access, nothing written to disk, structurally excluded on wasm.
  • Bounded memory. Live maps hold in-flight items only; completions fold into counters + bounded recent rings. Console state never grows with file count or file size.

Testing

  • Unit: registry lifecycle / weak-ref cleanup / bounded retention, ring buffers, model serde round-trips.
  • Integration (behind console): scripted upload/download against the local/mock CAS test infra while polling the real axum server; asserts state progressions appear and finished commits are retained. Tests bind port 0 to avoid CI collisions.
  • TUI: ratatui snapshot tests from fixtures; no live server.

Notes

  • Draft: opening for early review of the API/registry shape and the instrumentation footprint. v1 non-goals (SSE, web UI, MCP, persistence, production builds) are documented in the design doc's "Future work".
Screenshot 2026-06-15 at 10 44 14 AM

Gate imports and items that depend on non-WASM modules (xet_data::processing,
xet_data::file_reconstruction, xet_runtime::core::xet_cache_root) behind
#[cfg(not(target_family = "wasm"))]:

- error.rs: gate FileReconstructionError import, from_file_reconstruction_error_ref,
  DataError::FileReconstructionError match arm, and From<FileReconstructionError> impl
- lib.rs: gate init_logging (uses xet_cache_root which is non-WASM)
- legacy/mod.rs: gate data_client, progress_tracking mods and all re-exports from
  xet_data::processing
- xet_session/mod.rs: gate common, download_stream_group, download_stream_handle mods
  and their re-exports, and xet_data::processing re-exports
- xet_session/session.rs: gate download_stream_group imports, active_download_stream_groups
  field, new_download_stream_group, register_download_stream_group, and abort's stream
  group cleanup
…imit sparkline

Replace placeholder with full implementation: monitor cards showing title line,
permit Gauge, success model with AdjustmentRecommendation, latency line, and
limit-history Sparkline. Also drops module-level #![allow(dead_code)] from
widgets.rs — all helpers are now consumed across the UI pages.
Clamp app.main_row against the full table (in-flight + completed) before
resolving the side-pane file. When the clamped index lands in the
completed range, show "file complete — no live blocks" / "file complete —
no prefetch state" instead of silently displaying the last in-flight
file. The "no in-flight file selected" message is reserved for a
genuinely empty table.
- CI: add `cargo clippy -p xet-console` and `cargo test -p xet-console` lines
  to the console-feature lint and test steps (xet-console has no console/simulation
  features of its own so it gets separate lines).
- xet_console/Cargo.toml: drop redundant version from xet-runtime path dep.
- main.rs: bail with a friendly error before entering raw mode when stdout is
  not a tty (uses std::io::IsTerminal); nightly fmt reformatted mod declarations.
- docs/skills/xet-console/SKILL.md: add Human TUI section after Connect.
- README.md: append TUI pointer sentence to the xet-console debugging paragraph.
- docs/design/2026-06-12-xet-console-tui-plan.md: include progress-tracking
  checkbox edits accumulated during implementation.
SPACE on a commit/group row toggles an indented list of that entry's
files; ENTER on a file row drills into the parent detail page with that
file pre-selected.
…n, triangles for expand

Transfer direction now uses up/down arrows (↑ upload, ↓ download) and the
expand state uses right/down triangles (▸ collapsed, ▾ expanded) on overview
commit/group rows, shown only when the row has files. The selection cursor
becomes a chevron (❯) so the only triangle on a row is the genuine expand
caret. Adds → expand / ← collapse keys (idempotent) alongside the existing
space toggle, surfaced in the key bar and help.

Also formats the crate against the current nightly (it had been committed
not-fmt-clean).
cargo-machete flagged it: the crate uses serde (via reqwest's .json()
deserialization) but never serde_json directly. Removing it fixes the
detect-unused-dependencies CI check.
…, label+scale the limit sparkline

The full-width permit-utilization gauge duplicated the 'permits N/M (X free)'
text line and visually dominated each card; remove it. Label the limit-history
sparkline as 'allowed permits' and print its scale (now / observed min-max /
hard cap / adjustment count) since the sparkline auto-scales with no axis.

The implausible rtt/bw values shown in the latency line are an upstream
adaptive-concurrency predictor issue (see #875), not a console bug; the console
already renders None as '–' once that lands.
…s written

The readahead pane plotted the term-block manager's active_byte_position as
the "writer" and keyed its "complete" verdict off it. That cursor tracks how
far the reconstruction plan has been consumed, which races ahead of bytes
actually written, so for whole-file downloads the pane flipped to
"writer @ 100% / complete" while the files table still read e.g. 53%.

Use fl.bytes_completed (the same value the files table shows) for the writer
line and the done check. "buffered" is now the genuine prefetch-ahead-of-writer
gap, and the pane can no longer contradict the files table.
The term-blocks pane dropped each block on consume, leaving only a count, so
a finished file showed just 'consumed N blocks total'. Retain a bounded recent
ring of consumed blocks (lightweight — term count + fetch duration, heavy terms
list dropped) plus cumulative consumed terms/bytes that keep counting blocks
evicted past the ring bound. The completion snapshot freezes this, and the
download UI now resolves completed-file selection so the history stays visible
after the file finishes.

Server-side retention lives in the console state/registry (xet_runtime); the
ring reuses the existing RECENT_RING_CAPACITY pattern, so console memory stays
independent of file size.
Resolves the controller.rs conflict from #871 (shared adaptive-concurrency
controllers per (session, endpoint)). #871 dropped the stored XetContext in
favor of Arc<XetConfig> to avoid a runtime-keepalive cycle; the console
instrumentation only needs ctx at construction (ctx.common.console_scope())
and holds a leaf monitor Arc, not ctx, so it composes cleanly. Updated both
constructors and new_testing_with_monitor to the config-only form.

Verified: console feature builds; xet-client (incl. the no-cycle test),
console integration tests across runtime/data/hf-xet, and the xet-console TUI
tests all pass; fmt + console clippy clean.
Resolves conflicts from main #841 (wasm32 support), #876 (pyo3 bump), and
internal-tools/tracing-appender additions:

- task_runtime.rs, session.rs build(): take main's wasm structure
  (MaybeSend trait, unified build); console block preserved.
- file_download_session.rs: adopt main's wasm-gated impl-block layout and
  re-apply the xet-console instrumentation onto the relocated download methods.
- CI: keep the console test step alongside main's internal-tools test feature;
  use main's xet_pkg/build_wasm.sh wasm gate.
- Cargo manifests: union console features with main's analysis/internal-tools.
- Cargo.lock regenerated against the merged manifests.

Claude-Session: https://claude.ai/code/session_01CFjcwD6aVnjnmhSD37feD8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant