fix(queue): improve worker resilience#69
Conversation
WalkthroughThe PR adds injectable file-store abstractions for documents and fragments, updates SQLite opening and serialized transaction handling, and refactors sdpub archive access to use overlay-backed workspace file stores. It also adds snapshot-based chapter summary generation, routes the CLI to pass snapshot paths, updates chapter lookup logic, and changes build-queue persistence to use a new database path with legacy migration. Tests were updated for the archive, snapshot, and queue flows. Possibly related PRs
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches✨ Simplify code
Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/document/document.ts (1)
574-584: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winMake rollback use the injected file store too.
New files are now created via
#fileStore.writeFile(), but rollback still deletes registered paths with localunlink(). Archive-backed paths like<archive>.sdpub/book-meta.jsonare not real filesystem files, so failed sessions can leave overlays behind and mask the original error.Suggested rollback change
- await unlink(path); + await this.#fileStore.deleteFile(path);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/document/document.ts` around lines 574 - 584, Rollback cleanup in Document still uses local unlink logic instead of the injected `#fileStore`, so archive-backed paths can remain behind after failures. Update the rollback path in document.ts to delete registered created files through `#fileStore.delete/remove` (the same abstraction used by `#fileStore.writeFile`), and keep the registered-path cleanup in sync with the existing registerCreatedFile flow. Use the existing Document class methods and `#contextScope` store to locate the rollback branch that currently unlinks paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cli/queue.ts`:
- Line 179: The graph step is being started unconditionally in queue processing,
which leaves already-graphed or summary-only jobs with a
started-but-never-completed step. Update the flow in the queue handling logic
around reporter.stepStarted("graph") so it only runs after you’ve determined
graph work is actually needed, and keep the step initiation inside the sourced
branch or equivalent graph-work path that will also complete it.
In `@src/document/database.ts`:
- Around line 89-121: In `Database`'s transaction handling, make cleanup
single-path so `#transactionDepth` is decremented exactly once per transaction
and not again if `COMMIT` fails after a successful operation. Move the decrement
into a shared cleanup path that runs for both success and error cases, and keep
the root transaction cleanup around
`#executeSql("COMMIT")`/`#executeSql("ROLLBACK")` while ensuring
`#activeTransactionScope` is always cleared. Also, if `BEGIN IMMEDIATE` fails
after `#activeTransactionScope` is assigned in the root transaction branch,
clear that scope before rethrowing so later calls can start new root
transactions.
In `@src/document/document.ts`:
- Around line 394-397: The release flow in Document.release currently skips
closing the file store if flush() or `#database.close`() throws, which can leak
sdpub-backed resources. Update release() to always attempt `#fileStore.close`() by
moving it into a finally path around the flush()/#database.close() work, and
make sure errors from the primary cleanup are still surfaced appropriately while
the file store is closed regardless. Use the release, flush, `#database.close`,
and `#fileStore.close` symbols to locate the cleanup block.
- Around line 196-218: When opening a document in document.ts, the fileStore
lease acquired by resolveDatabasePath() must be cleaned up if Database.open()
fails, since DirectoryDocument.release() will never run. Add a cleanup guard
around the open path in the document opening flow (near resolveDatabasePath,
fragments.ensureCreated, and Database.open) so any exception triggers
fileStore.close()/release logic before rethrowing, preserving archive-backed
store flushing behavior.
In `@src/facade/archive.ts`:
- Around line 209-235: The archive writer in archive.ts is adding overlay entry
paths without the same allow-list validation used for source entries. Update the
archive-building flow around validateArchiveManifest, normalizeArchivePath, and
outputZipFile.addFile/addBuffer to reject or skip any overlay whose entryPath
does not pass isSdpubArchivePath() before it is written. Keep the manifest entry
allowed as-is, but ensure only valid SDPUB archive paths from overlayByPath are
emitted so a persisted overlay cannot create unexpected ZIP entries or overwrite
unrelated paths.
In `@src/facade/build-queue.ts`:
- Around line 1090-1100: The legacy queue migration in
openBuildQueueDatabase/build-queue migration logic is swallowing all failures,
which can hide real SQLite, permission, and copy errors. Update the try/catch
around legacyDatabaseHasTable and copyFile so it only ignores expected “missing
legacy DB/table” and COPYFILE_EXCL race conditions, and rethrow any other error.
Use a helper like isNodeError to inspect the error code and keep the existing
migration flow in build-queue.ts from silently falling back to a fresh database.
In `@src/facade/chapter-build.ts`:
- Around line 63-72: The summary input validation currently uses z.custom<T>()
placeholders in summaryInputSnapshotSchema, which allows arbitrary data to pass
through. Replace each placeholder with the real Zod schema for the corresponding
record type (chunks, fragmentGroups, fragments, knowledgeEdges, serial,
snakeChunks, snakeEdges, snakes) so summary-input.json is fully validated at
parse time instead of deferring bad data to later snapshot logic.
In `@src/facade/sdpub-coordinator.ts`:
- Around line 810-837: The stale-state cleanup in cleanupStaleState is deleting
active entry_locks, entry_sqlite_leases, and archive_commit_locks based only on
heartbeat_at age, but heartbeat_at is never refreshed. Update the lock/lease
lifecycle so held records are either periodically renewed while active or only
removed after verifying the owner process is dead, and make the cleanup logic in
cleanupStaleState use that stronger check before deleting rows.
- Around line 416-437: The flush logic in sdpub-coordinator should only operate
on overlays that are actually covered by the acquired locks, because
`currentOverlays` is currently re-listed after `acquireArchiveCommitLock` and
can include newly added unlocked entries. Update the flush path around
`listOverlays`, `acquireArchiveCommitLock`, and `writeSdpubArchiveWithOverlays`
so it reuses or filters to the originally locked overlay set before
rewriting/deleting anything, ensuring only locked workspaces are flushed and
removed.
In `@test/facade/build-queue.test.ts`:
- Around line 433-443: The timeout is being applied only around the awaited
result of waitForRunningJob() in the build queue test, so when it rejects the
underlying waitForRunningJobCount polling keeps running and scheduling timers in
the background. Move the timeout handling into the polling helper itself (the
waitForRunningJobCount / waitForRunningJob logic used by the queue-slot
assertions) so each poll cycle is bounded and any rejection stops further
scheduling, and apply the same change to the related queue-list assertion block
that uses listBuildJobs({ all: true }).
In `@test/facade/spine-digest-file.test.ts`:
- Around line 404-408: The assertion helper database setup in
readCoordinatorOverlays() is creating the entry_overlays table, which can hide
missing-initialization regressions. Update the Database.open usage in the test
helper to open the coordinator SQLite file read-only and remove the CREATE TABLE
statement so the assertion only reads existing state; keep the fix localized to
readCoordinatorOverlays and its Database.open call.
---
Outside diff comments:
In `@src/document/document.ts`:
- Around line 574-584: Rollback cleanup in Document still uses local unlink
logic instead of the injected `#fileStore`, so archive-backed paths can remain
behind after failures. Update the rollback path in document.ts to delete
registered created files through `#fileStore.delete/remove` (the same abstraction
used by `#fileStore.writeFile`), and keep the registered-path cleanup in sync with
the existing registerCreatedFile flow. Use the existing Document class methods
and `#contextScope` store to locate the rollback branch that currently unlinks
paths.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c5313664-3129-46d3-9aa3-47137b3a8123
📒 Files selected for processing (16)
src/cli/queue.tssrc/document/database.tssrc/document/document.tssrc/document/fragments.tssrc/facade/archive.tssrc/facade/build-queue.tssrc/facade/chapter-build.tssrc/facade/chapter.tssrc/facade/sdpub-coordinator.tssrc/facade/spine-digest-file.tstest/cli/queue.test.tstest/facade/app.test.tstest/facade/build-queue.test.tstest/facade/chapter-graph.test.tstest/facade/chapter.test.tstest/facade/spine-digest-file.test.ts
Summary
Validation