feat(worker): defer SERVING until child processes are warm by enochtangg · Pull Request #731 · getsentry/taskbroker

enochtangg · 2026-06-25T18:53:22Z

In push mode, a new taskworker pod advertises gRPC health SERVING immediately after server.start(), with zero warm children. Each child must run import_app() + app.load_modules() (~30-40s) before it consumes from the queue. Because the k8s readiness probe and the GCP NEG both trust this gRPC health signal, the pod enters Envoy's routable set ~30s before it can actually serve. As a result, the child queue fills, the worker returns busy, and the broker's bounded push pool jams on the slow pushes.

This PR is responsible for holding the gRPC health service at NOT_SERVING until children are warm. Changes:

the child_process takes an optional ready_counter (multiprocessing.Value) and increments it once, right before the consume loop. The increment happens after load_modules() and after the SIGTERM/SIGINT handlers are installed, so any child counted as warm can also drain on shutdown.
Setsready_count in parent thread and passes it to each child.
On startup, updated the push worker to _await_children_warm() between server.start() and the SERVING flip. Polling ready_count >= min_ready every 250ms tick. This is also bounded by warmup_timeout.

Testing:

Tested changes locally
Added unit tests

linear-code · 2026-06-25T18:53:28Z

STREAM-1244

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 8c73b6f. Configure here.}

george-sentry · 2026-06-25T19:45:42Z

        push_task_timeout: float = 5,
        update_in_batches: bool = False,
        skip_awaiting_futures: bool = True,
+        min_ready: int | None = None,


On one hand, I can see why having this lever could be useful. On the other hand, it adds an extra lever that needs to be configured properly. Why not simply wait for all the children to become ready?

You're right, we should reduce the number of levers if possible. The lever allows us to tolerate some level of stuck child threads and enables other threads to start working. All that is speculative for v1 though. I'll remove this as we can always add back if we find it useful in practice in prod.

george-sentry · 2026-06-25T19:51:32Z

 The number of gRPC requests before touching the health check file
 """

+DEFAULT_WORKER_WARMUP_TIMEOUT_SEC = 90.0


Note that we need to account for livenessProbe and readinessProbe values here. For example, if the livenessProbe is less than this timeout, then the pod may be restarted before the workers have warmed up since we don't set the status to SERVING until they are all ready.

Is this something we should worry about? If so, how should we deal with it?

Good call, k8s offers startupProbe so that liveness can't kill the pod mid-warmup. We would just need to make sure it comfortably exceeds warmup_timeout.

We can also use a different mechanism for the liveness probe (so that it succeeds right away) while keeping the existing health endpoint for the readiness probe (so that it's not considered ready until SERVING).

Defer SERVING until child processes are warm

8c73b6f

enochtangg requested a review from a team as a code owner June 25, 2026 18:53

cursor Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread clients/python/src/taskbroker_client/worker/worker.py

Comment thread clients/python/src/taskbroker_client/worker/worker.py Outdated

dont advertise SERVING after shutdown event

a27b15e

george-sentry reviewed Jun 25, 2026

View reviewed changes

remove min_ready

6841f94

george-sentry approved these changes Jun 25, 2026

View reviewed changes

george-sentry merged commit 2a03bb4 into main Jun 25, 2026
27 checks passed

george-sentry deleted the enochtang/stream-1244-defer-taskworker-serving-until-child-processes-are-warm-push branch June 25, 2026 21:46

sentry-release-bot Bot mentioned this pull request Jun 25, 2026

publish: getsentry/taskbroker/clients@0.20.6 getsentry/publish#8708

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(worker): defer SERVING until child processes are warm#731

feat(worker): defer SERVING until child processes are warm#731
george-sentry merged 3 commits into
mainfrom
enochtang/stream-1244-defer-taskworker-serving-until-child-processes-are-warm-push

enochtangg commented Jun 25, 2026 •

edited

Loading

Uh oh!

linear-code Bot commented Jun 25, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

george-sentry Jun 25, 2026

Uh oh!

enochtangg Jun 25, 2026

Uh oh!

george-sentry Jun 25, 2026

Uh oh!

enochtangg Jun 25, 2026

Uh oh!

george-sentry Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

enochtangg commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linear-code Bot commented Jun 25, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

george-sentry Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

enochtangg Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

george-sentry Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

enochtangg Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

george-sentry Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

enochtangg commented Jun 25, 2026 •

edited

Loading