feat(worker): defer SERVING until child processes are warm#731
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8c73b6f. Configure here.
| push_task_timeout: float = 5, | ||
| update_in_batches: bool = False, | ||
| skip_awaiting_futures: bool = True, | ||
| min_ready: int | None = None, |
There was a problem hiding this comment.
On one hand, I can see why having this lever could be useful. On the other hand, it adds an extra lever that needs to be configured properly. Why not simply wait for all the children to become ready?
There was a problem hiding this comment.
You're right, we should reduce the number of levers if possible. The lever allows us to tolerate some level of stuck child threads and enables other threads to start working. All that is speculative for v1 though. I'll remove this as we can always add back if we find it useful in practice in prod.
| The number of gRPC requests before touching the health check file | ||
| """ | ||
|
|
||
| DEFAULT_WORKER_WARMUP_TIMEOUT_SEC = 90.0 |
There was a problem hiding this comment.
Note that we need to account for livenessProbe and readinessProbe values here. For example, if the livenessProbe is less than this timeout, then the pod may be restarted before the workers have warmed up since we don't set the status to SERVING until they are all ready.
Is this something we should worry about? If so, how should we deal with it?
There was a problem hiding this comment.
Good call, k8s offers startupProbe so that liveness can't kill the pod mid-warmup. We would just need to make sure it comfortably exceeds warmup_timeout.
There was a problem hiding this comment.
We can also use a different mechanism for the liveness probe (so that it succeeds right away) while keeping the existing health endpoint for the readiness probe (so that it's not considered ready until SERVING).

Refs: STREAM-1244
In push mode, a new taskworker pod advertises gRPC health
SERVINGimmediately afterserver.start(), with zero warm children. Each child must runimport_app()+app.load_modules()(~30-40s) before it consumes from the queue. Because the k8s readiness probe and the GCP NEG both trust this gRPC health signal, the pod enters Envoy's routable set ~30s before it can actually serve. As a result, the child queue fills, the worker returns busy, and the broker's bounded push pool jams on the slow pushes.This PR is responsible for holding the gRPC health service at
NOT_SERVINGuntil children are warm. Changes:ready_counter(multiprocessing.Value) and increments it once, right before the consume loop. The increment happens afterload_modules()and after the SIGTERM/SIGINT handlers are installed, so any child counted as warm can also drain on shutdown.ready_countin parent thread and passes it to each child._await_children_warm()betweenserver.start()and theSERVINGflip. Pollingready_count >= min_readyevery 250ms tick. This is also bounded bywarmup_timeout.Testing: