Skip to content

Fix task lease reconnect grace period#12541

Open
bduffany wants to merge 3 commits into
masterfrom
lease-reconnect-fix-2
Open

Fix task lease reconnect grace period#12541
bduffany wants to merge 3 commits into
masterfrom
lease-reconnect-fix-2

Conversation

@bduffany

@bduffany bduffany commented Jun 24, 2026

Copy link
Copy Markdown
Member

This PR fixes some issues with the "reconnect grace period" logic. Context: https://buildbuddy-corp.slack.com/archives/C07SND71EDC/p1782246097521369?thread_ts=1782151034.068199&cid=C07SND71EDC

The reconnect grace period was intended to allow executors to quickly re-establish a lease that was temporarily broken due to a scheduler shutdown, keeping the lease alive so the task does not have to be retried. However, this logic was not working due to some issues:

  • unclaimTask always passed the current time as the reconnect period end, instead of actually applying the grace period duration by adding it to the current time.
  • The claimTask redis script was checking for checkTaskReconnectToken == "true", but the value of that variable was effectively always either "0" or "1" due to how go-redis serializes Go bool values when executing scripts. So the reconnect token check was never running.
    • Even if that check had run, the script would have compared the stored reconnectPeriodEnd (a string returned by hget) directly against the current time (a number), which raises an error in Lua.
    • The missing-field case was also mishandled, because hget returns false (not nil) when the field is absent. Fixed by converting the field with tonumber up front and treating a nil result as "not reserved".

Fixing the above issues was enough to get the reconnect behavior mostly working as expected, except for a few other issues which are also addressed in this PR:

  • Executors treated the scheduler's supports_reconnect response bit as the condition for sending reconnect tokens. Renewal responses do not repeat reconnect fields, so after the first lease renewal the executor would stop sending its already-issued reconnect token and get rejected while the task was still in its reconnect grace period. The client now stores the issued reconnect token directly and sends it on subsequent lease requests.
  • The deferred re-enqueue in LeaseTask applied the reconnect grace to every stream close, including ordinary executor disconnects. The grace period was originally intended only to run on scheduler shutdown, so that the executor can reconnect to another app replica. An executor that simply dies should have its task re-enqueued promptly (other executors should not have to wait for the grace period to elapse). Fixed by clearing the reconnect token unless the scheduler is shutting down.
    • Note: there are other cases where a grace period makes sense too, for example a transient network error. However I think it is worth being careful about checking for the right error codes, so that we properly distinguish a normal executor shutdown from other errors. I left a TODO for now to handle these in a later PR.
  • The re-enqueue delay only applies to the reservation pushed by reEnqueueTask, not to tasks pulled from the unclaimed list when an executor asks for work. The other executor would receive the reservation and then (with a high chance) fail to claim the task because it is still in its reconnect grace period. This is not really a correctness issue, but it is wasted work. sampleUnclaimedTasks now reads reconnectPeriodEnd and skips tasks that are still reserved, so the doomed reservation and lease attempt are avoided.
  • We never cleared the reconnect grace period after the task was successfully claimed. There is an edge case here in which an executor reconnects a lease then immediately shuts down - when the executor shuts down, it will re-enqueue the task on another executor. If that executor then tries to lease the task, it will fail because the task is still in its reconnect grace period. In the worst case (no other executors have this task in their queue), the task just sits in the unclaimedTasks list and we're relying only on work stealing / executor registration to assign work. This means that if executors are busy and no new executor registrations are happening, the task can potentially sit in the unclaimedTasks list, unfairly extending its wait time.

This PR also adds a regression test - the lease reconnect behavior was previously only tested through TestAppShutdownDuringExecution_LeaseTaskRetried (in remote_execution_test.go) which only tested using a single executor. The bug is only observable when there are multiple running executors, which the new test exercises.

@bduffany bduffany force-pushed the lease-reconnect-fix-2 branch 4 times, most recently from 32fad1e to 6b4e049 Compare June 24, 2026 14:57
@bduffany bduffany changed the title Fix reconnect grace period Fix task lease reconnect grace period Jun 24, 2026
@bduffany bduffany marked this pull request as ready for review June 24, 2026 14:57
Copilot AI review requested due to automatic review settings June 24, 2026 14:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes the scheduler “lease reconnect grace period” mechanism so tasks can’t be stolen during the reconnect window, and ensures the grace period is applied only for intended shutdown scenarios. This strengthens correctness of task lease recovery during scheduler restarts and reduces wasted scheduling work.

Changes:

  • Fix Redis Lua claim script reconnect checks (boolean argument encoding + numeric comparison of reconnectPeriodEnd).
  • Extend reconnect reservation window when unclaiming with a reconnect token; avoid applying reconnect grace on non-shutdown stream closes.
  • Skip reconnect-reserved tasks during sampleUnclaimedTasks, and add a regression test covering the steal-prevention behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
enterprise/server/scheduling/scheduler_server/scheduler_server.go Fix reconnect grace period logic in Redis scripts + Go paths; skip reconnect-reserved tasks during unclaimed sampling.
enterprise/server/scheduling/scheduler_server/scheduler_server_test.go Update fake executor lease helpers and add a reconnect-grace regression test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread enterprise/server/scheduling/scheduler_server/scheduler_server.go
buildbuddy-io[bot]

This comment was marked as outdated.

@bduffany bduffany force-pushed the lease-reconnect-fix-2 branch from 6b4e049 to ec25a22 Compare June 24, 2026 18:51
@bduffany bduffany force-pushed the lease-reconnect-fix-2 branch from ec25a22 to 0e3dcf1 Compare June 24, 2026 19:00
@bduffany bduffany force-pushed the lease-reconnect-fix-2 branch from 75d6ac5 to 82c22c6 Compare June 24, 2026 21:20
@bduffany

This comment was marked as outdated.

@bduffany bduffany force-pushed the lease-reconnect-fix-2 branch 2 times, most recently from dd98006 to eb5dcf3 Compare June 24, 2026 23:26
@bduffany bduffany force-pushed the lease-reconnect-fix-2 branch from eb5dcf3 to c64b027 Compare June 24, 2026 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants