fix(scheduler): relax AdaptivePolicy cohort watermark instead of ratcheting monotonically#437
Conversation
…heting monotonically `fired_high_water` only ratcheted up and never decayed, so once a transient burst lifted it above the steady-state cohort — either concurrency peaking before in-flight requests finish, or pass-level speculation inflating a batch — every later fire whose batch landed below that peak failed the rule-3 firing test and was parked on the `last_latency` watchdog before going out. This added a ~one-fire-time stall to a large fraction of fires for the rest of the run. Relax the watermark by one toward the live cohort on a below-watermark fire so it tracks the current steady-state cohort instead of an all-time peak. Ratchet-up, cold-start, structural-cap, and watchdog behavior are unchanged, and a uniform cohort holds the watermark exactly at the cohort size (so steady-state batching and coalescing are untouched). Repro (tau2-bench airline, conc 6, 16384 max, 25-min budget): the monotonic watermark stalls 24-37% of fires (37.5% even with speculation OFF, from concurrency decline); relaxing it drops that to 0.2-0.6% and halves inter-fire p95 (104-127ms -> 55-69ms). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011YJWTtt36uwcfLcx7Ckubq
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
Walkthrough
ChangesAdaptivePolicy relaxable high-water watermark
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Great work, @evanc7007 ! |
Summary
AdaptivePolicy::fired_high_wateronly ever ratchets up and never decays. Once a transientburst lifts it above the steady-state cohort size, every later fire whose batch lands below
that peak fails the rule-3 firing test (
B >= fired_high_water) and is parked on the rule-4last_latencywatchdog before it goes out — adding a full ~one-fire-time stall to a largefraction of fires for the rest of the engine's lifetime.
This PR makes
fired_high_watertrack the recent cohort instead of an all-time peak: itstill ratchets up immediately on a meet-or-exceed fire, but relaxes by one (floored at the
observed size) on a below-watermark fire, so it follows the live cohort back down. Fixes #436
Why the current behavior is a bug
The watermark exists to avoid firing a half-empty batch when more peers are about to re-submit.
But because it never decays, any event that lifts it above the current steady state poisons
every subsequent sub-peak fire:
cohort falls from its peak (say 6) through 5, 4, 3… Each sub-peak cohort now fails rule 3 and
waits out the watchdog, even though no more peers are coming.
lifting the watermark; the normal cohort of ~4–5 is then permanently "below peak" and stalls.
So this is a general scheduler-policy defect, not a speculation-specific one — speculation is
just one trigger. (It surfaced first under speculation because that re-triggers the condition
continuously; see the speculation throughput discussion in #434's companion thread.)
Reproduction & evidence
Workload:
tau2-benchairline,--num-tasks 12 --max-concurrency 6 --seed 10,max_tokens=16384, served via an OpenAI-compatible inferlet; 25-min budget per arm; enginebuilt
PIE_PORTABLE_CUDA=1 PIE_PORTABLE_CUDA_ARCH=89, NVIDIA L40S, Qwen3-30B-A3B-GGUF (Q4_K_M).Per-fire scheduler trace parsed into inter-fire gap by batch size; "stall %" = fires whose
preceding gap exceeds 2× the dominant (at-watermark) cohort's median.
Note the spec-off control already shows 37.5 % stalled fires purely from concurrency
decline — the bug reproduces with speculation entirely disabled. With the fix, the sub-peak
cohorts that previously sat at 100–125 ms (batch sizes 4–5 while the watermark was pinned at 6)
drop to ~40–65 ms, and the relaxing-watermark spec-on arms end up cleaner than the monotonic
spec-off control. Result reproduced in both verbose and quiet logging configurations.
The change
runtime/src/inference/adaptive_policy.rsonly:Decay-by-one (rather than e.g. snapping straight to the live size, or a sliding-window max) was
chosen against the recorded fire-size sequences: a window-max doesn't relax when large batches
recur, while snapping to the size of a single fire is too eager; stepping down by one removed
~99.7 % of the watchdog stalls in replay and reacts within 1–2 fires, while ignoring a lone
anomalously small fire. (On the observed traces the cohort never fell by more than one between
consecutive fires, so a snap-to-size variant would have been indistinguishable here.)
Behavior preserved: cold-start (
fired_high_water == 0), the structural cap, and thelast_latencywatchdog are unchanged. A uniform cohort keeps the watermark exactly at thecohort size (verified by test), so steady-state batching and the existing coalescing benefit
are untouched — only the never-decaying tail behavior changes.
Tests
cargo test -p pie -- adaptive relax_high_water: ratchet-up-on-exceed, relax-after-burst(7 → settles at the live cohort of 5), steady-cohort stability (uniform load stays at the
cohort size, never inflates or under-fires), and the
relax_high_waterfloor cases.Summary by CodeRabbit