Skip to content

fix(checkpoint): recalibrate global counters against ground truth on startup#253

Open
L2ncE wants to merge 1 commit into
TencentCloud:mainfrom
L2ncE:fix/checkpoint-recalibrate
Open

fix(checkpoint): recalibrate global counters against ground truth on startup#253
L2ncE wants to merge 1 commit into
TencentCloud:mainfrom
L2ncE:fix/checkpoint-recalibrate

Conversation

@L2ncE

@L2ncE L2ncE commented Jun 25, 2026

Copy link
Copy Markdown

Description

Fixes checkpoint counter drift (issue #157). The four additive-only global counters in CheckpointManagertotal_memories_extracted, l0_conversations_count, total_processed, memories_since_last_persona — only ever increment. When memory-cleaner deletes expired data or JSONL files are pruned manually, these counters stay permanently inflated above the actual data and never self-correct. The inflated memories_since_last_persona can also spuriously trigger persona generation.

Adds CheckpointManager.recalibrate(), invoked once on gateway startup (per the issue's suggested fix), which re-syncs the four counters against authoritative sources:

Counter Source of truth
total_memories_extracted records/*.jsonl total line count (append-only, includes dedup stale rows — same semantics as the counter)
l0_conversations_count distinct recordedAt in conversations/*.jsonl (= capture count)
total_processed store.countL0() (falls back to JSONL line count when the store is degraded)
memories_since_last_persona records/*.jsonl lines with updatedAt > last_persona_time

Why JSONL for counters 1/2/4: they accumulate write events (each store/update/merge appends a JSONL line), so JSONL line count is same-sourced; the store dedups and would under-count. Counter 2 uses distinct recordedAt because the TCVDB adapter has no DISTINCT/group-by capability, so it must read the JSONL files. Counter 3 is a message count, same in store and JSONL, so the existing countL0() is reused.

Trigger timing: recalibrate runs inside the coreReady promise chain and is awaited, so the first agent_end (which does await coreReady) cannot overtake it. This keeps recalibration free of concurrency with the L2 repair path in pipeline-factory (which would otherwise Math.max the counters back to stale values).

Incremental extraction gates on per-session cursors (last_l1_cursor / last_extraction_updated_time), not on these global counters, so counter drift cannot cause records to be skipped — issue acceptance #3 is satisfied by the existing design.

Related Issue

Fix #157

Change Type

  • Bug fix | Bug 修复
  • New feature | 新功能
  • Documentation update | 文档更新
  • Code optimization | 代码优化

Self-test Checklist

  • Verified locally | 本地验证通过
  • No existing features affected | 无影响现有功能

Additional Notes

  • Issue repro coverage: the manual-JSONL-pruning repro from the issue is covered by a unit test (issue repro: manual JSONL pruning): seed 5 L1 lines → counter = 5 → prune to 2 lines → counter still 5 (drift) → recalibrate → counter = 2. A cleaner-style "delete expired shard file" case is also covered.
  • Scope note: issue [good first issue]🎯 fix(data): checkpoint counters never decrease — drift from actual data after cleanup #157's repro also mentions deleting pipeline_states entries. pipeline_states holds per-session incremental cursors (positional semantics), not global counters — deleting them causes re-processing, not counter drift, and is a separate concern outside this PR. Only the JSONL pruning affects the four counters and is fixed here.
  • Known trade-off: recalibrate runs only at startup (as the issue requests). Between a memory-cleaner run and the next restart the counters remain inflated; this is acceptable since drift is gradual and the next startup corrects it.
  • Test suite: 84 tests pass (npx vitest run), including new unit tests for the count helpers (shard filtering, malformed-line handling, B2 bad-row invariant) and integration tests reproducing the drift and correction.
  • CHANGELOG: updated under [Unreleased]🐛 Bug 修复.

…startup

Signed-off-by: L2ncE <llance_24@foxmail.com>
@Maxwell-Code07

Copy link
Copy Markdown
Collaborator

@L2ncE Welcome as a first-time contributor! Checkpoint counter drift (#157) is a long-standing issue — the recalibrate-on-startup approach is clean and well-implemented. Thanks for the contribution! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[good first issue]🎯 fix(data): checkpoint counters never decrease — drift from actual data after cleanup

2 participants