Skip to content

perf: stream per-file aggregation to eliminate graph recompute RSS sawtooth#3

Merged
Nanako0129 merged 2 commits into
Nanako0129:mainfrom
aiexkwan:perf/streaming-aggregation
Jun 13, 2026
Merged

perf: stream per-file aggregation to eliminate graph recompute RSS sawtooth#3
Nanako0129 merged 2 commits into
Nanako0129:mainfrom
aiexkwan:perf/streaming-aggregation

Conversation

@aiexkwan

Copy link
Copy Markdown
Contributor

Problem

Follow-up to #1 / #2. With the v1.0.2 caching fixes in place, the remaining resource issue is the graph recompute path: every refresh materializes the full session history as a Vec<UnifiedMessage> (~1M messages on this setup) before aggregating, producing a transient ~1GB allocation. RSS sawtooths between ~1.25GB and ~1.72GB all day (sampling stddev 162.5MB over 5 min).

Change

Replace materialize-then-aggregate with a per-file streaming fold — one scan pass, no full-history Vec:

  • StreamingAggregator (aggregator.rs): folds &UnifiedMessage into daily buckets one at a time. Dedup semantics preserved exactly: cross-file first-seen-wins seen-sets (claude/opencode/codex/hermes), trae keep-latest-per-session buffer (timestamp, dedup_key tiebreak) flushed after all lanes.
  • SessionizeAccumulator (sessionize.rs): streaming replacement for sessionize() on the hot path — per-(client, session_id) timestamp vectors + token sums instead of holding every message. Output is parity-tested against sessionize().
  • scan_messages_streaming (lib.rs): cache-aware driver that iterates each file's cached messages by reference (no cached.messages.clone()), applies pricing on the fly, runs the dedup gate once at the driver level, and feeds both accumulators in a single pass. load_or_parse_source caching, cache writeback, and Gemini invalidate_cache semantics are preserved (verified lane-by-lane against the old path).
  • All report consumers (graph / model / monthly / hourly / time_metrics) switched off the materializing path. The legacy parse_all_messages_with_pricing_with_env_strategy remains only for the FFI parse_local_unified_messages_resolved (intentional Vec, TODO-tagged).
  • FFI surface unchanged: git diff main -- crates/tb_core_ffi/src/ has zero extern "C" line changes.

Measurements (16k-file setup, active Claude session writing JSONL throughout)

5s sampling, 5-minute windows:

v1.0.2 (baseline) this PR
RSS range 1.25GB ↔ 1.72GB sawtooth 576–639MB steady
RSS sampling stddev 162.5MB 23.2MB
CPU mean / max 17.1% / 181.4% 13.2% / 130.5%

Extended 50-minute monitor (5 rounds, 10-min spacing, final build): RSS envelope 551MB–1213MB, settling at ~645MB by rounds 4–5; no >180% CPU bursts; no sawtooth. The residual floor is the resident message cache (STORE_MEMO), which is prior design and out of scope here.

Also re-ran steady-state on v1.0.2 main as requested in #2: CPU mean 17.1%, 82% of samples ≤22% — the round-1 numbers hold.

Tests

  • cargo test workspace: 872 passed, 0 failed (24 new: dedup scenario tests with hardcoded expectations, A/B parity vs aggregate_by_date, snapshot-isolation mid-iteration mutation test, SessionizeAccumulator parity suite)
  • clippy clean across the workspace; make bundle + nm symbol check on the shipped binary

🤖 Generated with Claude Code

…RSS sawtooth

Replace full-history Vec<UnifiedMessage> materialization (~1M msgs, ~1GB
transient per graph recompute) with per-file streaming fold:

- StreamingAggregator: dedup-aware daily fold (cross-file seen-set +
  trae keep-latest buffer), feed/finalize API
- SessionizeAccumulator: streaming session intervals (timestamps +
  token sums only), replaces full-slice sessionize() on the hot path
- scan_messages_streaming: cache-aware per-file reference iteration,
  driver-level dedup gate, dual-sink single pass; preserves
  load_or_parse_source caching and Gemini invalidate_cache semantics
- All report consumers (graph/model/monthly/hourly/time_metrics)
  switched off the materializing path; legacy path retained only for
  FFI parse_local_unified_messages_resolved (TODO-tagged)

Measured on the 16k-file setup over a 50-min window: RSS sawtooth
1.25-1.72GB -> steady 0.6-1.2GB envelope (settles ~645MB), sampling
stddev 162.5MB -> 23.2MB, no >180% CPU bursts. FFI surface unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Follow-up to PR Nanako0129#3. scan_messages_streaming shared one seen_keys set
across copilot/codebuff/kimi/gemini and all simple file lanes, where the
old report path did not dedup these lanes at all. Two problems:

- Cross-client false collisions: copilot keys are namespaced
  (trace:span), but codebuff (upstream message id) and kimi (message_id)
  use raw upstream ids with no client prefix, so an identical id across
  two clients would silently drop one real message.
- Inconsistency: claude/codex/hermes/opencode already use per-client
  sets; the simple lanes did not follow that design.

Each lane now owns its dedup set. This preserves the (correct, and
likely intended) intra-client dedup — copilot spans that appear across
multiple telemetry sources still collapse — while removing the
cross-client coupling. New regression test builds a kimi and a codebuff
message sharing dedup_key "COLLIDE" and asserts both survive; it fails
against the shared-set version.

Also: the new test fixtures (streaming_msg, parity_msg, snapshot_msg)
tripped clippy::too_many_arguments, which is deny-level via
#![deny(clippy::all)], breaking `cargo clippy --all-targets`. Added the
#[allow] the existing fixtures already carry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Nanako0129

Copy link
Copy Markdown
Owner

Thanks @aiexkwan. Validated on a rebase onto main: full cargo test workspace green, clippy --all-targets clean, plus our --selftest/--smoke. The RSS sawtooth is gone here too — nice result.

I found one correctness issue and pushed a fix straight to this branch (0a3cc6a) so it stays part of this PR. scan_messages_streaming shares a single seen_keys set across the copilot/codebuff/kimi/gemini and simple file lanes, but the old report path didn't dedup those lanes at all — so "dedup semantics preserved exactly" isn't quite right for them. It splits into two cases:

  • copilot: applying its trace:span dedup_key is actually a fix — the old path double-counted spans that show up in more than one telemetry source. Good change.
  • codebuff / kimi: these use raw upstream message ids as the dedup_key, with no client namespace. With one shared set, an identical id across two clients silently drops a real message.

The fix gives each lane its own dedup set, matching the claude/codex/hermes/opencode lanes that already work this way. That keeps the copilot dedup and removes the cross-client coupling. Added a regression test (a kimi and a codebuff message sharing dedup_key "COLLIDE", both must survive — it fails against the shared set). Also added the #[allow(clippy::too_many_arguments)] the new test fixtures needed; without it clippy --all-targets fails under the crate's #![deny(clippy::all)].

One heads-up worth recording: this changes historical graph/model/hourly counts for anyone who had duplicate copilot spans (now deduped). The agents report still runs the old materialized path, so it won't match for those clients until it migrates too — I'll open a separate issue to track that.

This aggregation path and the message_cache.rs work from #2 are both good candidates to forward upstream to junhoyeo/tokscale. Merging via rebase-and-merge to keep your authorship.

@Nanako0129 Nanako0129 merged commit 0752e35 into Nanako0129:main Jun 13, 2026
1 check passed
Nanako0129 added a commit that referenced this pull request Jun 13, 2026
…able

PR #3 added a large divergence from upstream tokscale (streaming
per-file aggregation across aggregator.rs/lib.rs/sessionize.rs) that
the merge missed adding to vendor/README.md. A future re-vendor from
junhoyeo/tokscale must re-apply it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants