Fix task lease reconnect grace period#12541
Open
bduffany wants to merge 3 commits into
Open
Conversation
32fad1e to
6b4e049
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes the scheduler “lease reconnect grace period” mechanism so tasks can’t be stolen during the reconnect window, and ensures the grace period is applied only for intended shutdown scenarios. This strengthens correctness of task lease recovery during scheduler restarts and reduces wasted scheduling work.
Changes:
- Fix Redis Lua claim script reconnect checks (boolean argument encoding + numeric comparison of
reconnectPeriodEnd). - Extend reconnect reservation window when unclaiming with a reconnect token; avoid applying reconnect grace on non-shutdown stream closes.
- Skip reconnect-reserved tasks during
sampleUnclaimedTasks, and add a regression test covering the steal-prevention behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| enterprise/server/scheduling/scheduler_server/scheduler_server.go | Fix reconnect grace period logic in Redis scripts + Go paths; skip reconnect-reserved tasks during unclaimed sampling. |
| enterprise/server/scheduling/scheduler_server/scheduler_server_test.go | Update fake executor lease helpers and add a reconnect-grace regression test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6b4e049 to
ec25a22
Compare
ec25a22 to
0e3dcf1
Compare
75d6ac5 to
82c22c6
Compare
This comment was marked as outdated.
This comment was marked as outdated.
dd98006 to
eb5dcf3
Compare
eb5dcf3 to
c64b027
Compare
vadimberezniker
approved these changes
Jun 25, 2026
vanja-p
approved these changes
Jun 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes some issues with the "reconnect grace period" logic. Context: https://buildbuddy-corp.slack.com/archives/C07SND71EDC/p1782246097521369?thread_ts=1782151034.068199&cid=C07SND71EDC
The reconnect grace period was intended to allow executors to quickly re-establish a lease that was temporarily broken due to a scheduler shutdown, keeping the lease alive so the task does not have to be retried. However, this logic was not working due to some issues:
unclaimTaskalways passed the current time as the reconnect period end, instead of actually applying the grace period duration by adding it to the current time.claimTaskredis script was checking forcheckTaskReconnectToken == "true", but the value of that variable was effectively always either"0"or"1"due to how go-redis serializes Goboolvalues when executing scripts. So the reconnect token check was never running.reconnectPeriodEnd(a string returned byhget) directly against the current time (a number), which raises an error in Lua.hgetreturnsfalse(notnil) when the field is absent. Fixed by converting the field withtonumberup front and treating anilresult as "not reserved".Fixing the above issues was enough to get the reconnect behavior mostly working as expected, except for a few other issues which are also addressed in this PR:
supports_reconnectresponse bit as the condition for sending reconnect tokens. Renewal responses do not repeat reconnect fields, so after the first lease renewal the executor would stop sending its already-issued reconnect token and get rejected while the task was still in its reconnect grace period. The client now stores the issued reconnect token directly and sends it on subsequent lease requests.LeaseTaskapplied the reconnect grace to every stream close, including ordinary executor disconnects. The grace period was originally intended only to run on scheduler shutdown, so that the executor can reconnect to another app replica. An executor that simply dies should have its task re-enqueued promptly (other executors should not have to wait for the grace period to elapse). Fixed by clearing the reconnect token unless the scheduler is shutting down.reEnqueueTask, not to tasks pulled from the unclaimed list when an executor asks for work. The other executor would receive the reservation and then (with a high chance) fail to claim the task because it is still in its reconnect grace period. This is not really a correctness issue, but it is wasted work.sampleUnclaimedTasksnow readsreconnectPeriodEndand skips tasks that are still reserved, so the doomed reservation and lease attempt are avoided.This PR also adds a regression test - the lease reconnect behavior was previously only tested through
TestAppShutdownDuringExecution_LeaseTaskRetried(inremote_execution_test.go) which only tested using a single executor. The bug is only observable when there are multiple running executors, which the new test exercises.