Skip to content

fix(queue): improve worker resilience#69

Merged
Moskize91 merged 11 commits into
mainfrom
fix/queue-worker-resilience
Jun 26, 2026
Merged

fix(queue): improve worker resilience#69
Moskize91 merged 11 commits into
mainfrom
fix/queue-worker-resilience

Conversation

@Moskize91

Copy link
Copy Markdown
Contributor

Summary

  • make queue jobs fail and release slots when worker tasks throw
  • split queue and sdpub coordinator state to avoid cross-database lock contention
  • coordinate sdpub entry overlays and flush only the current archive
  • recover stale queue jobs atomically and expose preparation step state

Validation

  • pnpm verify
  • pnpm test:run
  • large local sdpub queue run: 114 succeeded, 0 failed, coordinator locks/overlays cleared

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

The PR adds injectable file-store abstractions for documents and fragments, updates SQLite opening and serialized transaction handling, and refactors sdpub archive access to use overlay-backed workspace file stores. It also adds snapshot-based chapter summary generation, routes the CLI to pass snapshot paths, updates chapter lookup logic, and changes build-queue persistence to use a new database path with legacy migration. Tests were updated for the archive, snapshot, and queue flows.

Possibly related PRs

  • oomol-lab/spinedigest#65: Shares the sdpub coordinator and queue execution area, including the src/facade/sdpub-coordinator.ts and src/facade/spine-digest-file.ts paths that this PR extends.
  • oomol-lab/spinedigest#67: Also updates src/cli/queue.ts around snapshot-based chapter summary generation and the buildChapterSummaryArtifactFromSnapshot(...) call path.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description matches the changeset, covering queue resilience, coordinator splitting, overlay handling, and stale job recovery.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title matches the required format and accurately summarizes the queue worker resilience changes.
✨ Finishing Touches
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch fix/queue-worker-resilience

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/document/document.ts (1)

574-584: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Make rollback use the injected file store too.

New files are now created via #fileStore.writeFile(), but rollback still deletes registered paths with local unlink(). Archive-backed paths like <archive>.sdpub/book-meta.json are not real filesystem files, so failed sessions can leave overlays behind and mask the original error.

Suggested rollback change
-        await unlink(path);
+        await this.#fileStore.deleteFile(path);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/document/document.ts` around lines 574 - 584, Rollback cleanup in
Document still uses local unlink logic instead of the injected `#fileStore`, so
archive-backed paths can remain behind after failures. Update the rollback path
in document.ts to delete registered created files through
`#fileStore.delete/remove` (the same abstraction used by `#fileStore.writeFile`),
and keep the registered-path cleanup in sync with the existing
registerCreatedFile flow. Use the existing Document class methods and
`#contextScope` store to locate the rollback branch that currently unlinks paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cli/queue.ts`:
- Line 179: The graph step is being started unconditionally in queue processing,
which leaves already-graphed or summary-only jobs with a
started-but-never-completed step. Update the flow in the queue handling logic
around reporter.stepStarted("graph") so it only runs after you’ve determined
graph work is actually needed, and keep the step initiation inside the sourced
branch or equivalent graph-work path that will also complete it.

In `@src/document/database.ts`:
- Around line 89-121: In `Database`'s transaction handling, make cleanup
single-path so `#transactionDepth` is decremented exactly once per transaction
and not again if `COMMIT` fails after a successful operation. Move the decrement
into a shared cleanup path that runs for both success and error cases, and keep
the root transaction cleanup around
`#executeSql("COMMIT")`/`#executeSql("ROLLBACK")` while ensuring
`#activeTransactionScope` is always cleared. Also, if `BEGIN IMMEDIATE` fails
after `#activeTransactionScope` is assigned in the root transaction branch,
clear that scope before rethrowing so later calls can start new root
transactions.

In `@src/document/document.ts`:
- Around line 394-397: The release flow in Document.release currently skips
closing the file store if flush() or `#database.close`() throws, which can leak
sdpub-backed resources. Update release() to always attempt `#fileStore.close`() by
moving it into a finally path around the flush()/#database.close() work, and
make sure errors from the primary cleanup are still surfaced appropriately while
the file store is closed regardless. Use the release, flush, `#database.close`,
and `#fileStore.close` symbols to locate the cleanup block.
- Around line 196-218: When opening a document in document.ts, the fileStore
lease acquired by resolveDatabasePath() must be cleaned up if Database.open()
fails, since DirectoryDocument.release() will never run. Add a cleanup guard
around the open path in the document opening flow (near resolveDatabasePath,
fragments.ensureCreated, and Database.open) so any exception triggers
fileStore.close()/release logic before rethrowing, preserving archive-backed
store flushing behavior.

In `@src/facade/archive.ts`:
- Around line 209-235: The archive writer in archive.ts is adding overlay entry
paths without the same allow-list validation used for source entries. Update the
archive-building flow around validateArchiveManifest, normalizeArchivePath, and
outputZipFile.addFile/addBuffer to reject or skip any overlay whose entryPath
does not pass isSdpubArchivePath() before it is written. Keep the manifest entry
allowed as-is, but ensure only valid SDPUB archive paths from overlayByPath are
emitted so a persisted overlay cannot create unexpected ZIP entries or overwrite
unrelated paths.

In `@src/facade/build-queue.ts`:
- Around line 1090-1100: The legacy queue migration in
openBuildQueueDatabase/build-queue migration logic is swallowing all failures,
which can hide real SQLite, permission, and copy errors. Update the try/catch
around legacyDatabaseHasTable and copyFile so it only ignores expected “missing
legacy DB/table” and COPYFILE_EXCL race conditions, and rethrow any other error.
Use a helper like isNodeError to inspect the error code and keep the existing
migration flow in build-queue.ts from silently falling back to a fresh database.

In `@src/facade/chapter-build.ts`:
- Around line 63-72: The summary input validation currently uses z.custom<T>()
placeholders in summaryInputSnapshotSchema, which allows arbitrary data to pass
through. Replace each placeholder with the real Zod schema for the corresponding
record type (chunks, fragmentGroups, fragments, knowledgeEdges, serial,
snakeChunks, snakeEdges, snakes) so summary-input.json is fully validated at
parse time instead of deferring bad data to later snapshot logic.

In `@src/facade/sdpub-coordinator.ts`:
- Around line 810-837: The stale-state cleanup in cleanupStaleState is deleting
active entry_locks, entry_sqlite_leases, and archive_commit_locks based only on
heartbeat_at age, but heartbeat_at is never refreshed. Update the lock/lease
lifecycle so held records are either periodically renewed while active or only
removed after verifying the owner process is dead, and make the cleanup logic in
cleanupStaleState use that stronger check before deleting rows.
- Around line 416-437: The flush logic in sdpub-coordinator should only operate
on overlays that are actually covered by the acquired locks, because
`currentOverlays` is currently re-listed after `acquireArchiveCommitLock` and
can include newly added unlocked entries. Update the flush path around
`listOverlays`, `acquireArchiveCommitLock`, and `writeSdpubArchiveWithOverlays`
so it reuses or filters to the originally locked overlay set before
rewriting/deleting anything, ensuring only locked workspaces are flushed and
removed.

In `@test/facade/build-queue.test.ts`:
- Around line 433-443: The timeout is being applied only around the awaited
result of waitForRunningJob() in the build queue test, so when it rejects the
underlying waitForRunningJobCount polling keeps running and scheduling timers in
the background. Move the timeout handling into the polling helper itself (the
waitForRunningJobCount / waitForRunningJob logic used by the queue-slot
assertions) so each poll cycle is bounded and any rejection stops further
scheduling, and apply the same change to the related queue-list assertion block
that uses listBuildJobs({ all: true }).

In `@test/facade/spine-digest-file.test.ts`:
- Around line 404-408: The assertion helper database setup in
readCoordinatorOverlays() is creating the entry_overlays table, which can hide
missing-initialization regressions. Update the Database.open usage in the test
helper to open the coordinator SQLite file read-only and remove the CREATE TABLE
statement so the assertion only reads existing state; keep the fix localized to
readCoordinatorOverlays and its Database.open call.

---

Outside diff comments:
In `@src/document/document.ts`:
- Around line 574-584: Rollback cleanup in Document still uses local unlink
logic instead of the injected `#fileStore`, so archive-backed paths can remain
behind after failures. Update the rollback path in document.ts to delete
registered created files through `#fileStore.delete/remove` (the same abstraction
used by `#fileStore.writeFile`), and keep the registered-path cleanup in sync with
the existing registerCreatedFile flow. Use the existing Document class methods
and `#contextScope` store to locate the rollback branch that currently unlinks
paths.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c5313664-3129-46d3-9aa3-47137b3a8123

📥 Commits

Reviewing files that changed from the base of the PR and between 55ce42a and dae63fb.

📒 Files selected for processing (16)
  • src/cli/queue.ts
  • src/document/database.ts
  • src/document/document.ts
  • src/document/fragments.ts
  • src/facade/archive.ts
  • src/facade/build-queue.ts
  • src/facade/chapter-build.ts
  • src/facade/chapter.ts
  • src/facade/sdpub-coordinator.ts
  • src/facade/spine-digest-file.ts
  • test/cli/queue.test.ts
  • test/facade/app.test.ts
  • test/facade/build-queue.test.ts
  • test/facade/chapter-graph.test.ts
  • test/facade/chapter.test.ts
  • test/facade/spine-digest-file.test.ts

Comment thread src/cli/queue.ts Outdated
Comment thread src/document/database.ts Outdated
Comment thread src/document/document.ts Outdated
Comment thread src/document/document.ts Outdated
Comment thread src/facade/archive.ts
Comment thread src/facade/chapter-build.ts
Comment thread src/facade/sdpub-coordinator.ts Outdated
Comment thread src/facade/sdpub-coordinator.ts
Comment thread test/facade/build-queue.test.ts Outdated
Comment thread test/facade/spine-digest-file.test.ts
@Moskize91 Moskize91 changed the title Fix queue worker resilience fix(queue): improve worker resilience Jun 26, 2026
@Moskize91 Moskize91 merged commit 0828c0e into main Jun 26, 2026
3 checks passed
@Moskize91 Moskize91 deleted the fix/queue-worker-resilience branch June 26, 2026 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant