Skip to content

feat(worker): defer SERVING until child processes are warm#731

Merged
george-sentry merged 3 commits into
mainfrom
enochtang/stream-1244-defer-taskworker-serving-until-child-processes-are-warm-push
Jun 25, 2026
Merged

feat(worker): defer SERVING until child processes are warm#731
george-sentry merged 3 commits into
mainfrom
enochtang/stream-1244-defer-taskworker-serving-until-child-processes-are-warm-push

Conversation

@enochtangg

@enochtangg enochtangg commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Refs: STREAM-1244

In push mode, a new taskworker pod advertises gRPC health SERVING immediately after server.start(), with zero warm children. Each child must run import_app() + app.load_modules() (~30-40s) before it consumes from the queue. Because the k8s readiness probe and the GCP NEG both trust this gRPC health signal, the pod enters Envoy's routable set ~30s before it can actually serve. As a result, the child queue fills, the worker returns busy, and the broker's bounded push pool jams on the slow pushes.

This PR is responsible for holding the gRPC health service at NOT_SERVING until children are warm. Changes:

  • the child_process takes an optional ready_counter (multiprocessing.Value) and increments it once, right before the consume loop. The increment happens after load_modules() and after the SIGTERM/SIGINT handlers are installed, so any child counted as warm can also drain on shutdown.
  • Setsready_count in parent thread and passes it to each child.
  • On startup, updated the push worker to _await_children_warm() between server.start() and the SERVING flip. Polling ready_count >= min_ready every 250ms tick. This is also bounded by warmup_timeout.

Testing:

  • Tested changes locally
  • Added unit tests

@enochtangg enochtangg requested a review from a team as a code owner June 25, 2026 18:53
@linear-code

linear-code Bot commented Jun 25, 2026

Copy link
Copy Markdown

STREAM-1244

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8c73b6f. Configure here.

Comment thread clients/python/src/taskbroker_client/worker/worker.py
Comment thread clients/python/src/taskbroker_client/worker/worker.py Outdated
push_task_timeout: float = 5,
update_in_batches: bool = False,
skip_awaiting_futures: bool = True,
min_ready: int | None = None,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one hand, I can see why having this lever could be useful. On the other hand, it adds an extra lever that needs to be configured properly. Why not simply wait for all the children to become ready?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we should reduce the number of levers if possible. The lever allows us to tolerate some level of stuck child threads and enables other threads to start working. All that is speculative for v1 though. I'll remove this as we can always add back if we find it useful in practice in prod.

The number of gRPC requests before touching the health check file
"""

DEFAULT_WORKER_WARMUP_TIMEOUT_SEC = 90.0

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we need to account for livenessProbe and readinessProbe values here. For example, if the livenessProbe is less than this timeout, then the pod may be restarted before the workers have warmed up since we don't set the status to SERVING until they are all ready.

Is this something we should worry about? If so, how should we deal with it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, k8s offers startupProbe so that liveness can't kill the pod mid-warmup. We would just need to make sure it comfortably exceeds warmup_timeout.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also use a different mechanism for the liveness probe (so that it succeeds right away) while keeping the existing health endpoint for the readiness probe (so that it's not considered ready until SERVING).

@george-sentry george-sentry merged commit 2a03bb4 into main Jun 25, 2026
27 checks passed
@george-sentry george-sentry deleted the enochtang/stream-1244-defer-taskworker-serving-until-child-processes-are-warm-push branch June 25, 2026 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants