refactor(foundation): add persistence schema validation + I/O safety guards#16
Conversation
…guards Persistence load path previously trusted every field on disk. Add a zero-dep runtime schema validator (packages/core/src/schema.ts) plus parsers for sessions/projects/human-gates that drop invalid records and log issues instead of crashing startup. Other safety fixes: - global-config: cap external config reads at 1MB and cache parsed values for 5s to avoid repeated sync reads on hot paths. - codex-log-monitor: refuse rollout files over 500MB before tracking. - session-registry.abortSessionWithReason: delete controller on abort regardless of isGenerating to prevent leaked references. All 813 existing tests pass; 10 new schema tests added.
…itch
engine/output-port now uses ChannelRef (opaque alias) and a narrow
ResultEventData contract instead of exposing the full ProviderEvent
union and raw unknown channels. DiscordOutputPort does the single
narrowing in one place via asSessionChannel(), dropping the earlier
`as Extract<ProviderEvent, { type: 'result' }>` cast.
output-handler.ts previously had a 386-line for-await switch with 18
cases — each case duplicated updateState+flushDigest boilerplate and
was untestable in isolation. Split into:
- output/event-handlers.ts — per-event-type handler map dispatched
through dispatchEvent(); 16 new focused unit tests cover the
handlers individually.
- output-handler.ts — now 140 lines: just streams + state bookkeeping
+ digest flush timing, delegating event logic to the registry.
829/829 tests pass (813 existing + 16 new).
panel-adapter.ts was a 548-line god-module holding panel Maps, state machine hookups, digest queue, result dispatch, relocation logic, and an interval-based performance monitor. Extract cohesive pieces: - panel/panel-state.ts — shared session panel registry - panel/panel-relocation.ts — relocate-to-bottom + timeout guard - panel/panel-performance.ts — inactive sweep + periodic snapshots panel-adapter.ts remains as a facade re-exporting the same public API so existing call sites keep working. button-handler.ts was a 34-line if-chain of `if (await handleX()) return`. Replace with button-router.ts where routes are a declarative array; adding a new button is now a single-line addition instead of editing the dispatcher. All 829 tests continue to pass; no public exports changed.
- discord/delivery.ts: wrap sendChunk in sendWithBackoff — up to 3 retries with 250/500/1000ms exponential backoff on HTTP 429. The reply-failure fallback path is also routed through the same wrapper so retries compose naturally. - ipc-server.ts: chmod the Unix socket to 0600 once listening (unix only) to prevent other local processes under /tmp from connecting and calling the internal IPC. Windows named pipes are left alone. - core/logger.ts: redact secrets in log context. Values under keys matching /token|api_key|authorization|password|secret|cookie|bearer/i are masked (first 4 + "********" + last 4). Walks nested objects up to depth 6 to catch wrapped payloads. All 829 tests continue to pass.
Merge the three overlapping gate modules into a single facade: - human-gate.ts — storage + CAS (kept) - gate-coordinator.ts — DELETED - gate-manager.ts — DELETED - gate-service.ts — NEW unified class GateService owns receipt handles, 5-min timeouts, Discord message binding, and optional EventBus emission (wiring deferred to P3a; null by default so no runtime change yet). Additional fix: HumanGateRegistry now debounces gates.json writes at 1s (matches sessions.json pattern) and exposes flushSaves() for tests. Previously every create/update/archive rewrote the whole JSON file. Backward compatibility preserved: `gateCoordinator`, `GateCoordinator`, and `GateManager` exports remain as aliases. All three existing test files updated to import GateService under its old names. 829/829 tests pass.
Rewrite state-machine.ts internal implementation using XState v5:
- New xstate-session-machine.ts with a parallel-state machine
(lifecycle × execution) and an `after.3000` auto-idle transition
for completed state. Replaces the hand-written transition tables
and setTimeout + token logic from the prior implementation.
- StateMachine class is now a thin adapter: holds one actor per
session, forwards transitions as typed XState events, and
preserves the existing imperative public API (getState,
getSnapshot, transition, applyPlatformEvent, setTurn,
incrementTurn, advanceTurnToIdle, clearSession) so all existing
tests pass unchanged.
P2 also makes StateMachine authoritative for turn / humanResolved:
- New registerTurnStatePersister() hook lets session-registry
receive a callback whenever those values change. panel-adapter
registers it once at module load, replacing the scattered
persistTurnState() calls inline.
- New setHumanResolved(sessionId, value) — the single write path,
replacing manual session.humanResolved = ... in
button-handler-actions.
New tests added (28):
- state-machine-xstate.test.ts — guard-table parity, parallel
advance, after-3000 auto-idle, illegal-event silent drop,
persister contract (fires only on real change, no-op on idempotent
calls, caught errors don't abort transitions, unregister works).
- gate-service.test.ts — full coverage of the unified service:
create/resolve/timeout/invalidate paths, receipt handles, EventBus
emission (ready for P3a), back-compat alias resolveGateFromDiscord,
and debounced persistence smoke.
Dependency added: xstate@^5.30.0 in @workspacecord/state.
829 → 857 tests pass.
New abstraction in packages/core/src/repository/:
- types.ts — Repository<T> interface with get/find/save/
update/delete/flush contract
- json-repo.ts — JsonFileRepository implementation layering on
Store<T>: Map-backed O(1) id lookup, configurable
debounce window, explicit flush() for shutdown,
optional parse() for schema validation on load
Session persistence (packages/engine/src/session/persistence.ts)
migrated to use JsonFileRepository internally. Public API (load /
saveAll / debouncedSave / saveImmediate) kept unchanged so the 30+
engine call sites are untouched. Repo runs with debounceMs=0 because
the outer session-registry already debounces at 1s — previously the
layered debounces would have compounded.
SQLite adapter deferred: the native build of better-sqlite3 cannot be
compiled in this build environment (install scripts blocked). The
Repository interface is deliberately storage-agnostic so a
`SqliteRepository<T>` can be added later with zero call-site churn.
New tests (14): repository.test.ts covers CRUD, restart survival,
partial update, delete semantics, filtered find with sort+limit,
saveMany, clear, id-field validation, parse-rejects-bad-records, and
debounced+flush write semantics.
857 → 871 tests pass.
When loadSessions restores from sessions.json, seed the XState StateMachine with the persisted turn and humanResolved values so it becomes the authoritative in-memory source immediately after startup. Call transition() with updatedAt so the persister callback fires only if something genuinely changes. Note: direct reads of session.currentTurn / session.humanResolved remain at a handful of bootstrap points (panel init, output-handler bootstrap) where they serve as the initial-turn hint when the panel component is first constructed. These are pre-StateMachine-seed code paths that match the hydrated in-memory value; no behavior change. Full field removal is deferred to P3a where the bootstrap path can read directly from stateMachine.getSnapshot() after the first turn transition. 871/871 tests pass.
The EventBus in core was dead code (zero producers, zero subscribers). Activate it as the cross-package domain event spine: - New domain-events.ts catalog exports branded EventType<Payload> constants for session / gate transitions and a getDomainBus() singleton accessor with a _setDomainBusForTest() hook. - GateService now emits GateCreated / GateResolved on the domain bus alongside the legacy EventBus-per-instance emission (kept for back-compat with existing gate-manager tests). Subscribers receive a readonly projection of the gate, not the full record. - StateMachine emits SessionAwaitingHuman and SessionErrored from applyPlatformEvent() when the mapped UnifiedState changes to awaiting_human or error. This gives bot-side observability a hook without forcing synchronous panel calls. 8 new tests: catalog is observable, fan-out works, middleware runs without blocking, StateMachine integration fires correctly for awaiting_human and errored platform events. 871 → 879 tests pass. No behavior change for existing callers; all OutputPort paths still work as before.
Monitor-loop iteration state used to live only on
session.workflowState, which meant bot crash mid-loop left no
reliable recovery hint. Introduce a dedicated MonitorRun record
via the new Repository<T> abstraction:
- packages/engine/src/executor/monitor-run-store.ts
(new) — CRUD + listRunningMonitorRuns + reconcileOnStartup
- monitor-loop.ts wires begin → checkpoint (per iteration) →
finish (completed/blocked/failed) at the existing decision
branches. Persistence failures are swallowed (caught .catch())
so monitor execution cannot be aborted by a stuck disk write.
- engine/index.ts re-exports the store API so bot / CLI layers
can query historical runs or reconcile on startup.
Stored per run: sessionId, goal, iteration, maxIterations,
startedAt, lastCheckpointAt, lastRationale, lastWorkerSummary,
status (running | completed | blocked | failed | abandoned).
Id is `${sessionId}:${startedAt}:${rand6}` to avoid same-ms
collisions.
Auto-resume at restart is deliberately not done in this phase —
reconcileMonitorRunsOnStartup() only abandons running runs and
returns the list so a future phase can decide the policy.
8 new tests cover persistence, stale-run abandonment,
checkpoint semantics, filter-by-status/session, and reload.
879 → 887 tests pass.
…ve to actors)
Replaces the proposed full actor-model migration with a lighter-weight
"SessionContext bundle" approach recommended in the architectural
critique. The codebase has 87+ call sites reading sessions via
getSession(id) + separately fetching controller / projection from
different singletons; full actor-ization would cascade through all of
them.
Instead:
- New SessionContext type bundles session + controller + projection
+ save/refresh into a single object.
- SessionSupervisor (one per process) owns contexts by sessionId and
gives stable-identity handles (repeated get() returns same object).
- refresh() rebinds controller/projection after state transitions,
so holders of the same ctx see current values.
- requireSessionContext() throws on miss; useful for hot paths.
New code paths can accept `ctx: SessionContext` explicitly, avoiding
the singleton-lookup antipattern. Old code keeps working unchanged;
migration is incremental file by file.
7 new tests cover: undefined-on-missing, session + projection wrap,
stable identity, refresh() pulling new controller, require-throws,
supervisor.all() enumeration, release() drops entry.
887 → 894 tests pass. Lint + typecheck green.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0599f02b42
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| throw new Error(`Repository.save: id field "${this.idField}" missing or non-string`); | ||
| } | ||
| this.entries.set(id, entity); | ||
| this.scheduleWrite(); |
There was a problem hiding this comment.
Wait for disk write before resolving repository save
save() returns as soon as scheduleWrite() is queued, so callers that await it do not get persistence guarantees even when debounceMs is 0. This breaks new call sites that rely on await monitorRepo.save(...) for crash-recovery durability (for example beginMonitorRun in packages/engine/src/executor/monitor-run-store.ts), where a process exit right after the await can drop the record entirely. The write methods should await the queued write (or explicitly document and type them as fire-and-forget).
Useful? React with 👍 / 👎.
| return (async () => { | ||
| await ensureInit(); | ||
| // 收集 repo 中不再出现的 id 并逐个删除 | ||
| const keepIds = new Set(snapshot.map((s) => s.id)); | ||
| for (const existing of sessionRepo.getAll()) { |
There was a problem hiding this comment.
Serialize saveAllSessions calls to avoid stale snapshots
saveAllSessions now snapshots session state and mutates a shared repository without any call-level queue/lock, so concurrent invocations can interleave and let an older snapshot overwrite newer session data. This is a regression from the previous saveQueue behavior and is reachable when immediate saves and debounced saves fire close together; whichever invocation runs last wins, even if it was based on stale input.
Useful? React with 👍 / 👎.
| this.dirty = false; | ||
| const snapshot = Array.from(this.entries.values()); | ||
| try { | ||
| await this.store.write(snapshot); | ||
| } catch (err) { |
There was a problem hiding this comment.
Keep repository dirty when a write attempt fails
writeNow() clears dirty before attempting store.write, and failures are only logged, so a failed write can be treated as flushed with no retry on later flush() calls unless another mutation happens. In transient I/O failure scenarios (disk full/permission/rename error), this silently drops the latest state while callers believe persistence succeeded.
Useful? React with 👍 / 👎.
…scipline Closes the architecture roadmap with five cohesive changes: - session-registry collapses 4 Maps + session/persistence.ts onto a single JsonFileRepository<ThreadSession> with indexes on channelId/categoryId/ providerSessionId; SQLite swap is now a factory change - provider-runtime migrates 14 registry.getSession sites to the SessionContext supervisor, finalising the P5 supervisor abstraction - gates: fix latent bug where HumanGateRegistry.init() was never called on startup (gates.json persisted but never loaded); add GateService.init + reconcileOnStartup(policy) with opt-in resume-pending behaviour across restart (timer rebuilt from remaining time) - domain events expand: SessionCreated/Ended/ModeChanged + MonitorRunStarted/ Ended published from session-registry and monitor-run-store - types split: ThreadSession = SessionPersistent & SessionRuntimeFields; JsonFileRepository gains a serialize hook so isGenerating no longer leaks to JSON Also bundles: - SqliteRepository on node:sqlite (no native build) implementing the same Repository contract; JsonFileRepository gains declarative secondary indexes - Monitor auto-resume policy (abandon-only | resume-with-goal) with reconcileAndCollectAutoResumeCandidates - Batch migrate ~20 test vi.mock factories to importOriginal overlay, removing the "add an engine export, break every mock" tax - Fix two pre-existing script bitrots: gate-e2e's dead gate-coordinator import, integration-smoke missing registerOutputPort call - Fix repository.test.ts cleanup flakiness with maxRetries + 30ms buffer 924/924 unit tests, typecheck + lint clean, integration/gate/startup-flow /output-port e2e smokes all green against live Discord + Codex + Claude. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
WorkspaceCord/packages/core/src/global-config.ts
Lines 76 to 80 in d3a8eea
config.ts now consumes MONITOR_AUTO_RESUME_POLICY and GATE_RESTART_POLICY, but VALID_KEYS does not include either key. Since workspacecord config get/set/unset rejects keys not in VALID_KEYS (packages/cli/src/config-cli.ts), operators cannot configure these new restart behaviors through the CLI and will be forced onto defaults unless they bypass via environment variables.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| this.removeFromIndexes(id, entity); | ||
| this.addToIndexes(id, entity); |
There was a problem hiding this comment.
Preserve previous indexed values before reindexing
reindex() is intended for in-place mutations, but it calls removeFromIndexes using the already-mutated entity, so the old index key cannot be removed. This leaves stale id entries in old buckets forever; repeated changes (for example providerSessionId churn) cause unbounded index-map growth and increasingly expensive indexed lookups over stale ids. The same pattern appears in save() when existing === entity, so this path needs an old-key snapshot per id (or similar) before mutation-aware reindexing.
Useful? React with 👍 / 👎.
Persistence load path previously trusted every field on disk. Add a
zero-dep runtime schema validator (packages/core/src/schema.ts) plus
parsers for sessions/projects/human-gates that drop invalid records
and log issues instead of crashing startup.
Other safety fixes:
values for 5s to avoid repeated sync reads on hot paths.
regardless of isGenerating to prevent leaked references.
All 813 existing tests pass; 10 new schema tests added.