perf: stream per-file aggregation to eliminate graph recompute RSS sawtooth#3
Conversation
…RSS sawtooth Replace full-history Vec<UnifiedMessage> materialization (~1M msgs, ~1GB transient per graph recompute) with per-file streaming fold: - StreamingAggregator: dedup-aware daily fold (cross-file seen-set + trae keep-latest buffer), feed/finalize API - SessionizeAccumulator: streaming session intervals (timestamps + token sums only), replaces full-slice sessionize() on the hot path - scan_messages_streaming: cache-aware per-file reference iteration, driver-level dedup gate, dual-sink single pass; preserves load_or_parse_source caching and Gemini invalidate_cache semantics - All report consumers (graph/model/monthly/hourly/time_metrics) switched off the materializing path; legacy path retained only for FFI parse_local_unified_messages_resolved (TODO-tagged) Measured on the 16k-file setup over a 50-min window: RSS sawtooth 1.25-1.72GB -> steady 0.6-1.2GB envelope (settles ~645MB), sampling stddev 162.5MB -> 23.2MB, no >180% CPU bursts. FFI surface unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Follow-up to PR Nanako0129#3. scan_messages_streaming shared one seen_keys set across copilot/codebuff/kimi/gemini and all simple file lanes, where the old report path did not dedup these lanes at all. Two problems: - Cross-client false collisions: copilot keys are namespaced (trace:span), but codebuff (upstream message id) and kimi (message_id) use raw upstream ids with no client prefix, so an identical id across two clients would silently drop one real message. - Inconsistency: claude/codex/hermes/opencode already use per-client sets; the simple lanes did not follow that design. Each lane now owns its dedup set. This preserves the (correct, and likely intended) intra-client dedup — copilot spans that appear across multiple telemetry sources still collapse — while removing the cross-client coupling. New regression test builds a kimi and a codebuff message sharing dedup_key "COLLIDE" and asserts both survive; it fails against the shared-set version. Also: the new test fixtures (streaming_msg, parity_msg, snapshot_msg) tripped clippy::too_many_arguments, which is deny-level via #![deny(clippy::all)], breaking `cargo clippy --all-targets`. Added the #[allow] the existing fixtures already carry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks @aiexkwan. Validated on a rebase onto I found one correctness issue and pushed a fix straight to this branch (
The fix gives each lane its own dedup set, matching the claude/codex/hermes/opencode lanes that already work this way. That keeps the copilot dedup and removes the cross-client coupling. Added a regression test (a kimi and a codebuff message sharing dedup_key One heads-up worth recording: this changes historical graph/model/hourly counts for anyone who had duplicate copilot spans (now deduped). The agents report still runs the old materialized path, so it won't match for those clients until it migrates too — I'll open a separate issue to track that. This aggregation path and the |
…able PR #3 added a large divergence from upstream tokscale (streaming per-file aggregation across aggregator.rs/lib.rs/sessionize.rs) that the merge missed adding to vendor/README.md. A future re-vendor from junhoyeo/tokscale must re-apply it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Problem
Follow-up to #1 / #2. With the v1.0.2 caching fixes in place, the remaining resource issue is the graph recompute path: every refresh materializes the full session history as a
Vec<UnifiedMessage>(~1M messages on this setup) before aggregating, producing a transient ~1GB allocation. RSS sawtooths between ~1.25GB and ~1.72GB all day (sampling stddev 162.5MB over 5 min).Change
Replace materialize-then-aggregate with a per-file streaming fold — one scan pass, no full-history Vec:
StreamingAggregator(aggregator.rs): folds&UnifiedMessageinto daily buckets one at a time. Dedup semantics preserved exactly: cross-file first-seen-wins seen-sets (claude/opencode/codex/hermes), trae keep-latest-per-session buffer (timestamp, dedup_key tiebreak) flushed after all lanes.SessionizeAccumulator(sessionize.rs): streaming replacement forsessionize()on the hot path — per-(client, session_id) timestamp vectors + token sums instead of holding every message. Output is parity-tested againstsessionize().scan_messages_streaming(lib.rs): cache-aware driver that iterates each file's cached messages by reference (nocached.messages.clone()), applies pricing on the fly, runs the dedup gate once at the driver level, and feeds both accumulators in a single pass.load_or_parse_sourcecaching, cache writeback, and Geminiinvalidate_cachesemantics are preserved (verified lane-by-lane against the old path).parse_all_messages_with_pricing_with_env_strategyremains only for the FFIparse_local_unified_messages_resolved(intentional Vec, TODO-tagged).git diff main -- crates/tb_core_ffi/src/has zeroextern "C"line changes.Measurements (16k-file setup, active Claude session writing JSONL throughout)
5s sampling, 5-minute windows:
Extended 50-minute monitor (5 rounds, 10-min spacing, final build): RSS envelope 551MB–1213MB, settling at ~645MB by rounds 4–5; no
>180%CPU bursts; no sawtooth. The residual floor is the resident message cache (STORE_MEMO), which is prior design and out of scope here.Also re-ran steady-state on v1.0.2 main as requested in #2: CPU mean 17.1%, 82% of samples ≤22% — the round-1 numbers hold.
Tests
cargo testworkspace: 872 passed, 0 failed (24 new: dedup scenario tests with hardcoded expectations, A/B parity vsaggregate_by_date, snapshot-isolation mid-iteration mutation test,SessionizeAccumulatorparity suite)make bundle+nmsymbol check on the shipped binary🤖 Generated with Claude Code