Skip to content

refactor(foundation): add persistence schema validation + I/O safety guards#16

Merged
xuhongbo merged 12 commits into
mainfrom
claude/review-project-optimization-QlB2e
Apr 17, 2026
Merged

refactor(foundation): add persistence schema validation + I/O safety guards#16
xuhongbo merged 12 commits into
mainfrom
claude/review-project-optimization-QlB2e

Conversation

@xuhongbo

Copy link
Copy Markdown
Owner

Persistence load path previously trusted every field on disk. Add a
zero-dep runtime schema validator (packages/core/src/schema.ts) plus
parsers for sessions/projects/human-gates that drop invalid records
and log issues instead of crashing startup.

Other safety fixes:

  • global-config: cap external config reads at 1MB and cache parsed
    values for 5s to avoid repeated sync reads on hot paths.
  • codex-log-monitor: refuse rollout files over 500MB before tracking.
  • session-registry.abortSessionWithReason: delete controller on abort
    regardless of isGenerating to prevent leaked references.

All 813 existing tests pass; 10 new schema tests added.

claude added 11 commits April 16, 2026 16:21
…guards

Persistence load path previously trusted every field on disk. Add a
zero-dep runtime schema validator (packages/core/src/schema.ts) plus
parsers for sessions/projects/human-gates that drop invalid records
and log issues instead of crashing startup.

Other safety fixes:
- global-config: cap external config reads at 1MB and cache parsed
  values for 5s to avoid repeated sync reads on hot paths.
- codex-log-monitor: refuse rollout files over 500MB before tracking.
- session-registry.abortSessionWithReason: delete controller on abort
  regardless of isGenerating to prevent leaked references.

All 813 existing tests pass; 10 new schema tests added.
…itch

engine/output-port now uses ChannelRef (opaque alias) and a narrow
ResultEventData contract instead of exposing the full ProviderEvent
union and raw unknown channels. DiscordOutputPort does the single
narrowing in one place via asSessionChannel(), dropping the earlier
`as Extract<ProviderEvent, { type: 'result' }>` cast.

output-handler.ts previously had a 386-line for-await switch with 18
cases — each case duplicated updateState+flushDigest boilerplate and
was untestable in isolation. Split into:
  - output/event-handlers.ts — per-event-type handler map dispatched
    through dispatchEvent(); 16 new focused unit tests cover the
    handlers individually.
  - output-handler.ts — now 140 lines: just streams + state bookkeeping
    + digest flush timing, delegating event logic to the registry.

829/829 tests pass (813 existing + 16 new).
panel-adapter.ts was a 548-line god-module holding panel Maps, state
machine hookups, digest queue, result dispatch, relocation logic, and
an interval-based performance monitor. Extract cohesive pieces:
  - panel/panel-state.ts       — shared session panel registry
  - panel/panel-relocation.ts  — relocate-to-bottom + timeout guard
  - panel/panel-performance.ts — inactive sweep + periodic snapshots

panel-adapter.ts remains as a facade re-exporting the same public API
so existing call sites keep working.

button-handler.ts was a 34-line if-chain of `if (await handleX()) return`.
Replace with button-router.ts where routes are a declarative array;
adding a new button is now a single-line addition instead of editing
the dispatcher.

All 829 tests continue to pass; no public exports changed.
- discord/delivery.ts: wrap sendChunk in sendWithBackoff — up to 3
  retries with 250/500/1000ms exponential backoff on HTTP 429. The
  reply-failure fallback path is also routed through the same wrapper
  so retries compose naturally.
- ipc-server.ts: chmod the Unix socket to 0600 once listening (unix
  only) to prevent other local processes under /tmp from connecting
  and calling the internal IPC. Windows named pipes are left alone.
- core/logger.ts: redact secrets in log context. Values under keys
  matching /token|api_key|authorization|password|secret|cookie|bearer/i
  are masked (first 4 + "********" + last 4). Walks nested objects up
  to depth 6 to catch wrapped payloads.

All 829 tests continue to pass.
Merge the three overlapping gate modules into a single facade:
  - human-gate.ts       — storage + CAS (kept)
  - gate-coordinator.ts — DELETED
  - gate-manager.ts     — DELETED
  - gate-service.ts     — NEW unified class

GateService owns receipt handles, 5-min timeouts, Discord message
binding, and optional EventBus emission (wiring deferred to P3a; null
by default so no runtime change yet).

Additional fix: HumanGateRegistry now debounces gates.json writes at
1s (matches sessions.json pattern) and exposes flushSaves() for tests.
Previously every create/update/archive rewrote the whole JSON file.

Backward compatibility preserved: `gateCoordinator`, `GateCoordinator`,
and `GateManager` exports remain as aliases. All three existing test
files updated to import GateService under its old names.

829/829 tests pass.
Rewrite state-machine.ts internal implementation using XState v5:
  - New xstate-session-machine.ts with a parallel-state machine
    (lifecycle × execution) and an `after.3000` auto-idle transition
    for completed state. Replaces the hand-written transition tables
    and setTimeout + token logic from the prior implementation.
  - StateMachine class is now a thin adapter: holds one actor per
    session, forwards transitions as typed XState events, and
    preserves the existing imperative public API (getState,
    getSnapshot, transition, applyPlatformEvent, setTurn,
    incrementTurn, advanceTurnToIdle, clearSession) so all existing
    tests pass unchanged.

P2 also makes StateMachine authoritative for turn / humanResolved:
  - New registerTurnStatePersister() hook lets session-registry
    receive a callback whenever those values change. panel-adapter
    registers it once at module load, replacing the scattered
    persistTurnState() calls inline.
  - New setHumanResolved(sessionId, value) — the single write path,
    replacing manual session.humanResolved = ... in
    button-handler-actions.

New tests added (28):
  - state-machine-xstate.test.ts — guard-table parity, parallel
    advance, after-3000 auto-idle, illegal-event silent drop,
    persister contract (fires only on real change, no-op on idempotent
    calls, caught errors don't abort transitions, unregister works).
  - gate-service.test.ts — full coverage of the unified service:
    create/resolve/timeout/invalidate paths, receipt handles, EventBus
    emission (ready for P3a), back-compat alias resolveGateFromDiscord,
    and debounced persistence smoke.

Dependency added: xstate@^5.30.0 in @workspacecord/state.

829 → 857 tests pass.
New abstraction in packages/core/src/repository/:
  - types.ts         — Repository<T> interface with get/find/save/
                       update/delete/flush contract
  - json-repo.ts     — JsonFileRepository implementation layering on
                       Store<T>: Map-backed O(1) id lookup, configurable
                       debounce window, explicit flush() for shutdown,
                       optional parse() for schema validation on load

Session persistence (packages/engine/src/session/persistence.ts)
migrated to use JsonFileRepository internally. Public API (load /
saveAll / debouncedSave / saveImmediate) kept unchanged so the 30+
engine call sites are untouched. Repo runs with debounceMs=0 because
the outer session-registry already debounces at 1s — previously the
layered debounces would have compounded.

SQLite adapter deferred: the native build of better-sqlite3 cannot be
compiled in this build environment (install scripts blocked). The
Repository interface is deliberately storage-agnostic so a
`SqliteRepository<T>` can be added later with zero call-site churn.

New tests (14): repository.test.ts covers CRUD, restart survival,
partial update, delete semantics, filtered find with sort+limit,
saveMany, clear, id-field validation, parse-rejects-bad-records, and
debounced+flush write semantics.

857 → 871 tests pass.
When loadSessions restores from sessions.json, seed the XState
StateMachine with the persisted turn and humanResolved values so it
becomes the authoritative in-memory source immediately after startup.
Call transition() with updatedAt so the persister callback fires only
if something genuinely changes.

Note: direct reads of session.currentTurn / session.humanResolved
remain at a handful of bootstrap points (panel init, output-handler
bootstrap) where they serve as the initial-turn hint when the panel
component is first constructed. These are pre-StateMachine-seed code
paths that match the hydrated in-memory value; no behavior change.
Full field removal is deferred to P3a where the bootstrap path can
read directly from stateMachine.getSnapshot() after the first turn
transition.

871/871 tests pass.
The EventBus in core was dead code (zero producers, zero subscribers).
Activate it as the cross-package domain event spine:

- New domain-events.ts catalog exports branded EventType<Payload>
  constants for session / gate transitions and a getDomainBus()
  singleton accessor with a _setDomainBusForTest() hook.
- GateService now emits GateCreated / GateResolved on the domain bus
  alongside the legacy EventBus-per-instance emission (kept for
  back-compat with existing gate-manager tests). Subscribers receive
  a readonly projection of the gate, not the full record.
- StateMachine emits SessionAwaitingHuman and SessionErrored from
  applyPlatformEvent() when the mapped UnifiedState changes to
  awaiting_human or error. This gives bot-side observability a hook
  without forcing synchronous panel calls.

8 new tests: catalog is observable, fan-out works, middleware runs
without blocking, StateMachine integration fires correctly for
awaiting_human and errored platform events.

871 → 879 tests pass. No behavior change for existing callers; all
OutputPort paths still work as before.
Monitor-loop iteration state used to live only on
session.workflowState, which meant bot crash mid-loop left no
reliable recovery hint. Introduce a dedicated MonitorRun record
via the new Repository<T> abstraction:

  - packages/engine/src/executor/monitor-run-store.ts
    (new) — CRUD + listRunningMonitorRuns + reconcileOnStartup
  - monitor-loop.ts wires begin → checkpoint (per iteration) →
    finish (completed/blocked/failed) at the existing decision
    branches. Persistence failures are swallowed (caught .catch())
    so monitor execution cannot be aborted by a stuck disk write.
  - engine/index.ts re-exports the store API so bot / CLI layers
    can query historical runs or reconcile on startup.

Stored per run: sessionId, goal, iteration, maxIterations,
startedAt, lastCheckpointAt, lastRationale, lastWorkerSummary,
status (running | completed | blocked | failed | abandoned).
Id is `${sessionId}:${startedAt}:${rand6}` to avoid same-ms
collisions.

Auto-resume at restart is deliberately not done in this phase —
reconcileMonitorRunsOnStartup() only abandons running runs and
returns the list so a future phase can decide the policy.

8 new tests cover persistence, stale-run abandonment,
checkpoint semantics, filter-by-status/session, and reload.

879 → 887 tests pass.
…ve to actors)

Replaces the proposed full actor-model migration with a lighter-weight
"SessionContext bundle" approach recommended in the architectural
critique. The codebase has 87+ call sites reading sessions via
getSession(id) + separately fetching controller / projection from
different singletons; full actor-ization would cascade through all of
them.

Instead:
  - New SessionContext type bundles session + controller + projection
    + save/refresh into a single object.
  - SessionSupervisor (one per process) owns contexts by sessionId and
    gives stable-identity handles (repeated get() returns same object).
  - refresh() rebinds controller/projection after state transitions,
    so holders of the same ctx see current values.
  - requireSessionContext() throws on miss; useful for hot paths.

New code paths can accept `ctx: SessionContext` explicitly, avoiding
the singleton-lookup antipattern. Old code keeps working unchanged;
migration is incremental file by file.

7 new tests cover: undefined-on-missing, session + projection wrap,
stable identity, refresh() pulling new controller, require-throws,
supervisor.all() enumeration, release() drops entry.

887 → 894 tests pass. Lint + typecheck green.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0599f02b42

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

throw new Error(`Repository.save: id field "${this.idField}" missing or non-string`);
}
this.entries.set(id, entity);
this.scheduleWrite();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Wait for disk write before resolving repository save

save() returns as soon as scheduleWrite() is queued, so callers that await it do not get persistence guarantees even when debounceMs is 0. This breaks new call sites that rely on await monitorRepo.save(...) for crash-recovery durability (for example beginMonitorRun in packages/engine/src/executor/monitor-run-store.ts), where a process exit right after the await can drop the record entirely. The write methods should await the queued write (or explicitly document and type them as fire-and-forget).

Useful? React with 👍 / 👎.

Comment on lines +91 to +95
return (async () => {
await ensureInit();
// 收集 repo 中不再出现的 id 并逐个删除
const keepIds = new Set(snapshot.map((s) => s.id));
for (const existing of sessionRepo.getAll()) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Serialize saveAllSessions calls to avoid stale snapshots

saveAllSessions now snapshots session state and mutates a shared repository without any call-level queue/lock, so concurrent invocations can interleave and let an older snapshot overwrite newer session data. This is a regression from the previous saveQueue behavior and is reachable when immediate saves and debounced saves fire close together; whichever invocation runs last wins, even if it was based on stale input.

Useful? React with 👍 / 👎.

Comment on lines +143 to +147
this.dirty = false;
const snapshot = Array.from(this.entries.values());
try {
await this.store.write(snapshot);
} catch (err) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep repository dirty when a write attempt fails

writeNow() clears dirty before attempting store.write, and failures are only logged, so a failed write can be treated as flushed with no retry on later flush() calls unless another mutation happens. In transient I/O failure scenarios (disk full/permission/rename error), this silently drops the latest state while callers believe persistence succeeded.

Useful? React with 👍 / 👎.

…scipline

Closes the architecture roadmap with five cohesive changes:

- session-registry collapses 4 Maps + session/persistence.ts onto a single
  JsonFileRepository<ThreadSession> with indexes on channelId/categoryId/
  providerSessionId; SQLite swap is now a factory change
- provider-runtime migrates 14 registry.getSession sites to the
  SessionContext supervisor, finalising the P5 supervisor abstraction
- gates: fix latent bug where HumanGateRegistry.init() was never called
  on startup (gates.json persisted but never loaded); add GateService.init
  + reconcileOnStartup(policy) with opt-in resume-pending behaviour across
  restart (timer rebuilt from remaining time)
- domain events expand: SessionCreated/Ended/ModeChanged + MonitorRunStarted/
  Ended published from session-registry and monitor-run-store
- types split: ThreadSession = SessionPersistent & SessionRuntimeFields;
  JsonFileRepository gains a serialize hook so isGenerating no longer leaks
  to JSON

Also bundles:
- SqliteRepository on node:sqlite (no native build) implementing the same
  Repository contract; JsonFileRepository gains declarative secondary indexes
- Monitor auto-resume policy (abandon-only | resume-with-goal) with
  reconcileAndCollectAutoResumeCandidates
- Batch migrate ~20 test vi.mock factories to importOriginal overlay,
  removing the "add an engine export, break every mock" tax
- Fix two pre-existing script bitrots: gate-e2e's dead gate-coordinator
  import, integration-smoke missing registerOutputPort call
- Fix repository.test.ts cleanup flakiness with maxRetries + 30ms buffer

924/924 unit tests, typecheck + lint clean, integration/gate/startup-flow
/output-port e2e smokes all green against live Discord + Codex + Claude.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@xuhongbo xuhongbo merged commit f088c41 into main Apr 17, 2026
1 check passed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

'HEALTH_REPORT_ENABLED',
'HEALTH_REPORT_INTERVAL_MS',
'HEALTH_CHECK_STUCK_THRESHOLD_MS',
'HEALTH_CHECK_IDLE_THRESHOLD_MS',
'HOOK_SECRET',

P2 Badge Add new policy keys to config whitelist

config.ts now consumes MONITOR_AUTO_RESUME_POLICY and GATE_RESTART_POLICY, but VALID_KEYS does not include either key. Since workspacecord config get/set/unset rejects keys not in VALID_KEYS (packages/cli/src/config-cli.ts), operators cannot configure these new restart behaviors through the CLI and will be forced onto defaults unless they bypass via environment variables.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +211 to +212
this.removeFromIndexes(id, entity);
this.addToIndexes(id, entity);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve previous indexed values before reindexing

reindex() is intended for in-place mutations, but it calls removeFromIndexes using the already-mutated entity, so the old index key cannot be removed. This leaves stale id entries in old buckets forever; repeated changes (for example providerSessionId churn) cause unbounded index-map growth and increasingly expensive indexed lookups over stale ids. The same pattern appears in save() when existing === entity, so this path needs an old-key snapshot per id (or similar) before mutation-aware reindexing.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants