Fix build queue slot refilling#66
Conversation
Summary by CodeRabbit
WalkthroughThe 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches✨ Simplify code
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/facade/build-queue.ts`:
- Around line 540-542: The Promise.all call in the slot execution block rejects
immediately on the first error, causing the finally block to execute and clear
heartbeats, release the lease, and close state while other slots are still
executing jobs. Replace Promise.all with Promise.allSettled when creating the
array of promises from the runSlot() calls to ensure all slots complete
execution before the finally block cleans up resources.
- Around line 559-570: The markBuildJobSucceeded call is currently inside the
same try-catch block as the job execution and start event recording, which means
if success-finalization fails, it gets caught and incorrectly recorded as a job
failure. Move the markBuildJobSucceeded call outside the try-catch block so the
catch block only handles errors from appendBuildJobEvent and executeJob. This
ensures success-finalization errors surface separately instead of being
converted into job failures.
- Around line 781-791: The UPDATE statement in the state.run() call attempts to
claim a job by updating its state from 'queued' to 'running', but the subsequent
call to requireBuildJobById(state, job.jobId) returns the job regardless of
whether the UPDATE actually succeeded (affected zero or one row). Capture the
result from state.run() to verify that at least one row was updated, confirming
the job was actually claimed before it's returned. If no rows were updated, the
job is no longer queued (possibly claimed by another process) and should not be
returned for execution.
In `@test/facade/build-queue.test.ts`:
- Around line 262-275: The cleanup calls to releaseFirst() and releaseSecond()
will not execute if the withTimeout assertion for secondStartedSignal fails,
leaving jobs blocked and the worker running. Wrap the code block containing the
addBuildJob call and the withTimeout assertion in a try/finally block to ensure
releaseFirst() and releaseSecond() are always called in the finally block.
Additionally, add a timeout message to the await firstStartedSignal call for
bounded failure diagnostics similar to the second signal timeout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0a0b0c7f-36e5-4039-8b2e-b2a810481b0a
📒 Files selected for processing (2)
src/facade/build-queue.tstest/facade/build-queue.test.ts
| await Promise.all( | ||
| Array.from({ length: concurrency }, async () => await runSlot()), | ||
| ); |
There was a problem hiding this comment.
Wait for all slots before releasing the worker lease.
Promise.all rejects on the first slot error, so the finally block can clear heartbeats, release the lease, and close state while other slots are still executing jobs. That can make an in-flight job look stale or leave sibling slots using a closed database.
Proposed fix
- await Promise.all(
- Array.from({ length: concurrency }, async () => await runSlot()),
- );
+ const slotErrors: unknown[] = [];
+
+ await Promise.all(
+ Array.from({ length: concurrency }, async () => {
+ try {
+ await runSlot();
+ } catch (error) {
+ stopping = true;
+ slotErrors.push(error);
+ }
+ }),
+ );
+
+ if (slotErrors.length > 0) {
+ throw slotErrors[0];
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/facade/build-queue.ts` around lines 540 - 542, The Promise.all call in
the slot execution block rejects immediately on the first error, causing the
finally block to execute and clear heartbeats, release the lease, and close
state while other slots are still executing jobs. Replace Promise.all with
Promise.allSettled when creating the array of promises from the runSlot() calls
to ensure all slots complete execution before the finally block cleans up
resources.
| try { | ||
| await appendBuildJobEvent(job, { | ||
| at: Date.now(), | ||
| jobId: job.jobId, | ||
| seq: 0, | ||
| state: "running", | ||
| type: "started", | ||
| }); | ||
| await options.executeJob(job, reporter); | ||
| await markBuildJobSucceeded(job.jobId, ownerId); | ||
| } catch (error) { | ||
| await markBuildJobFailed(job.jobId, ownerId, error); |
There was a problem hiding this comment.
Don’t convert success-finalization failures into job failures.
Line 568 is inside the same try as executeJob, so a failure while recording success is caught and then recorded as a job failure. Keep the catch scoped to the actual job execution/start path and let success-finalization errors surface separately.
Proposed fix
try {
await appendBuildJobEvent(job, {
at: Date.now(),
jobId: job.jobId,
seq: 0,
state: "running",
type: "started",
});
await options.executeJob(job, reporter);
- await markBuildJobSucceeded(job.jobId, ownerId);
} catch (error) {
await markBuildJobFailed(job.jobId, ownerId, error);
+ return;
}
+
+ await markBuildJobSucceeded(job.jobId, ownerId);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try { | |
| await appendBuildJobEvent(job, { | |
| at: Date.now(), | |
| jobId: job.jobId, | |
| seq: 0, | |
| state: "running", | |
| type: "started", | |
| }); | |
| await options.executeJob(job, reporter); | |
| await markBuildJobSucceeded(job.jobId, ownerId); | |
| } catch (error) { | |
| await markBuildJobFailed(job.jobId, ownerId, error); | |
| try { | |
| await appendBuildJobEvent(job, { | |
| at: Date.now(), | |
| jobId: job.jobId, | |
| seq: 0, | |
| state: "running", | |
| type: "started", | |
| }); | |
| await options.executeJob(job, reporter); | |
| } catch (error) { | |
| await markBuildJobFailed(job.jobId, ownerId, error); | |
| return; | |
| } | |
| await markBuildJobSucceeded(job.jobId, ownerId); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/facade/build-queue.ts` around lines 559 - 570, The markBuildJobSucceeded
call is currently inside the same try-catch block as the job execution and start
event recording, which means if success-finalization fails, it gets caught and
incorrectly recorded as a job failure. Move the markBuildJobSucceeded call
outside the try-catch block so the catch block only handles errors from
appendBuildJobEvent and executeJob. This ensures success-finalization errors
surface separately instead of being converted into job failures.
| const now = Date.now(); | ||
| await state.run( | ||
| ` | ||
| UPDATE build_jobs | ||
| SET state = 'running', owner_id = ?, owner_pid = ?, updated_at = ? | ||
| WHERE job_id = ? AND state = 'queued' | ||
| `, | ||
| [ownerId, process.pid, now, job.jobId], | ||
| ); | ||
| } | ||
|
|
||
| return await Promise.all( | ||
| jobs.map(async (job) => await requireBuildJobById(state, job.jobId)), | ||
| [ownerId, process.pid, now, job.jobId], | ||
| ); | ||
|
|
||
| return await requireBuildJobById(state, job.jobId); |
There was a problem hiding this comment.
Verify the conditional claim update before returning the job.
Line 786 allows the UPDATE to affect zero rows if the selected job is no longer queued, but Line 791 still returns that job. Re-read and verify ownerId/state before handing it to executeJob.
Proposed fix
await state.run(
`
UPDATE build_jobs
SET state = 'running', owner_id = ?, owner_pid = ?, updated_at = ?
WHERE job_id = ? AND state = 'queued'
`,
[ownerId, process.pid, now, job.jobId],
);
- return await requireBuildJobById(state, job.jobId);
+ const claimedJob = await requireBuildJobById(state, job.jobId);
+
+ if (claimedJob.ownerId !== ownerId || claimedJob.state !== "running") {
+ return undefined;
+ }
+
+ return claimedJob;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/facade/build-queue.ts` around lines 781 - 791, The UPDATE statement in
the state.run() call attempts to claim a job by updating its state from 'queued'
to 'running', but the subsequent call to requireBuildJobById(state, job.jobId)
returns the job regardless of whether the UPDATE actually succeeded (affected
zero or one row). Capture the result from state.run() to verify that at least
one row was updated, confirming the job was actually claimed before it's
returned. If no rows were updated, the job is no longer queued (possibly claimed
by another process) and should not be returned for execution.
| await firstStartedSignal; | ||
| await addBuildJob({ | ||
| archivePath: `${path}/book.sdpub`, | ||
| chapterId: 2, | ||
| target: "graph", | ||
| }); | ||
| await withTimeout( | ||
| secondStartedSignal, | ||
| "Timed out waiting for idle queue slot to claim the second job.", | ||
| ); | ||
|
|
||
| releaseFirst(); | ||
| releaseSecond(); | ||
| await worker; |
There was a problem hiding this comment.
Release the blocked jobs even when the timeout assertion fails.
If withTimeout(secondStartedSignal, ...) rejects, Lines 273-275 are skipped, leaving the first job blocked and the worker heartbeat running. Wrap the assertion path in try/finally; this also gives the first-start wait a bounded failure message.
Proposed fix
- await firstStartedSignal;
- await addBuildJob({
- archivePath: `${path}/book.sdpub`,
- chapterId: 2,
- target: "graph",
- });
- await withTimeout(
- secondStartedSignal,
- "Timed out waiting for idle queue slot to claim the second job.",
- );
-
- releaseFirst();
- releaseSecond();
- await worker;
+ try {
+ await withTimeout(
+ firstStartedSignal,
+ "Timed out waiting for the first job to start.",
+ );
+ await addBuildJob({
+ archivePath: `${path}/book.sdpub`,
+ chapterId: 2,
+ target: "graph",
+ });
+ await withTimeout(
+ secondStartedSignal,
+ "Timed out waiting for idle queue slot to claim the second job.",
+ );
+ } finally {
+ releaseFirst();
+ releaseSecond();
+ await withTimeout(
+ worker,
+ "Timed out waiting for the build worker to stop.",
+ );
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/facade/build-queue.test.ts` around lines 262 - 275, The cleanup calls to
releaseFirst() and releaseSecond() will not execute if the withTimeout assertion
for secondStartedSignal fails, leaving jobs blocked and the worker running. Wrap
the code block containing the addBuildJob call and the withTimeout assertion in
a try/finally block to ensure releaseFirst() and releaseSecond() are always
called in the finally block. Additionally, add a timeout message to the await
firstStartedSignal call for bounded failure diagnostics similar to the second
signal timeout.
Summary
Tests