Skip to content

fix(scheduler): relax AdaptivePolicy cohort watermark instead of ratcheting monotonically#437

Merged
ingim merged 1 commit into
pie-project:mainfrom
evanc7007:fix/adaptive-highwater-decay
Jun 19, 2026
Merged

fix(scheduler): relax AdaptivePolicy cohort watermark instead of ratcheting monotonically#437
ingim merged 1 commit into
pie-project:mainfrom
evanc7007:fix/adaptive-highwater-decay

Conversation

@evanc7007

@evanc7007 evanc7007 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

AdaptivePolicy::fired_high_water only ever ratchets up and never decays. Once a transient
burst lifts it above the steady-state cohort size, every later fire whose batch lands below
that peak fails the rule-3 firing test (B >= fired_high_water) and is parked on the rule-4
last_latency watchdog before it goes out — adding a full ~one-fire-time stall to a large
fraction of fires for the rest of the engine's lifetime.

This PR makes fired_high_water track the recent cohort instead of an all-time peak: it
still ratchets up immediately on a meet-or-exceed fire, but relaxes by one (floored at the
observed size) on a below-watermark fire, so it follows the live cohort back down. Fixes #436

Why the current behavior is a bug

The watermark exists to avoid firing a half-empty batch when more peers are about to re-submit.
But because it never decays, any event that lifts it above the current steady state poisons
every subsequent sub-peak fire:

  • Concurrency decline (no speculation needed): as in-flight requests finish, the active
    cohort falls from its peak (say 6) through 5, 4, 3… Each sub-peak cohort now fails rule 3 and
    waits out the watchdog, even though no more peers are coming.
  • Pass-level speculation: speculative pre-fires transiently inflate a batch (e.g. 4 → 6–7),
    lifting the watermark; the normal cohort of ~4–5 is then permanently "below peak" and stalls.

So this is a general scheduler-policy defect, not a speculation-specific one — speculation is
just one trigger. (It surfaced first under speculation because that re-triggers the condition
continuously; see the speculation throughput discussion in #434's companion thread.)

Reproduction & evidence

Workload: tau2-bench airline, --num-tasks 12 --max-concurrency 6 --seed 10,
max_tokens=16384, served via an OpenAI-compatible inferlet; 25-min budget per arm; engine
built PIE_PORTABLE_CUDA=1 PIE_PORTABLE_CUDA_ARCH=89, NVIDIA L40S, Qwen3-30B-A3B-GGUF (Q4_K_M).
Per-fire scheduler trace parsed into inter-fire gap by batch size; "stall %" = fires whose
preceding gap exceeds 2× the dominant (at-watermark) cohort's median.

arm watermark stall % inter-fire p95
spec-off (no speculation) monotonic (today) 37.5% 104 ms
spec-on monotonic (today) 23.9–25.2% 117–127 ms
spec-on relaxing (this PR) 0.2–0.6% 55–69 ms

Note the spec-off control already shows 37.5 % stalled fires purely from concurrency
decline — the bug reproduces with speculation entirely disabled. With the fix, the sub-peak
cohorts that previously sat at 100–125 ms (batch sizes 4–5 while the watermark was pinned at 6)
drop to ~40–65 ms, and the relaxing-watermark spec-on arms end up cleaner than the monotonic
spec-off control
. Result reproduced in both verbose and quiet logging configurations.

The change

runtime/src/inference/adaptive_policy.rs only:

fn relax_high_water(current: usize, fired_size: usize) -> usize {
    if fired_size >= current { fired_size } else { current.saturating_sub(1) }
}
// on_fired: self.fired_high_water = relax_high_water(self.fired_high_water, fired_size);

Decay-by-one (rather than e.g. snapping straight to the live size, or a sliding-window max) was
chosen against the recorded fire-size sequences: a window-max doesn't relax when large batches
recur, while snapping to the size of a single fire is too eager; stepping down by one removed
~99.7 % of the watchdog stalls in replay and reacts within 1–2 fires, while ignoring a lone
anomalously small fire. (On the observed traces the cohort never fell by more than one between
consecutive fires, so a snap-to-size variant would have been indistinguishable here.)

Behavior preserved: cold-start (fired_high_water == 0), the structural cap, and the
last_latency watchdog are unchanged. A uniform cohort keeps the watermark exactly at the
cohort size (verified by test), so steady-state batching and the existing coalescing benefit
are untouched — only the never-decaying tail behavior changes.

Tests

cargo test -p pie -- adaptive relax_high_water: ratchet-up-on-exceed, relax-after-burst
(7 → settles at the live cohort of 5), steady-cohort stability (uniform load stays at the
cohort size, never inflates or under-fires), and the relax_high_water floor cases.

Summary by CodeRabbit

  • Improvements
    • Adaptive policy watermark behavior enhanced to dynamically respond to workload changes with both upward and downward adjustments, providing improved flexibility in system adaptation.
  • Documentation
    • Updated documentation reflecting the new adaptive policy watermark adjustment patterns.
  • Tests
    • Expanded unit tests to validate watermark behavior under various workload scenarios and adjustment patterns.

…heting monotonically

`fired_high_water` only ratcheted up and never decayed, so once a transient
burst lifted it above the steady-state cohort — either concurrency peaking
before in-flight requests finish, or pass-level speculation inflating a batch —
every later fire whose batch landed below that peak failed the rule-3 firing
test and was parked on the `last_latency` watchdog before going out. This added
a ~one-fire-time stall to a large fraction of fires for the rest of the run.

Relax the watermark by one toward the live cohort on a below-watermark fire so
it tracks the current steady-state cohort instead of an all-time peak.
Ratchet-up, cold-start, structural-cap, and watchdog behavior are unchanged,
and a uniform cohort holds the watermark exactly at the cohort size (so
steady-state batching and coalescing are untouched).

Repro (tau2-bench airline, conc 6, 16384 max, 25-min budget): the monotonic
watermark stalls 24-37% of fires (37.5% even with speculation OFF, from
concurrency decline); relaxing it drops that to 0.2-0.6% and halves inter-fire
p95 (104-127ms -> 55-69ms).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011YJWTtt36uwcfLcx7Ckubq
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 464de69e-d8be-4547-b331-f3ac7c24fb4c

📥 Commits

Reviewing files that changed from the base of the PR and between 905498f and 12322c3.

📒 Files selected for processing (1)
  • runtime/src/inference/adaptive_policy.rs

Walkthrough

AdaptivePolicy's fired_high_water watermark is changed from strictly monotonic to a relaxable recent-cohort high-water. A new private relax_high_water helper ratchets up when fired_size >= current and steps the watermark down by one (saturating at 0) otherwise. on_fired now calls this helper, and the firing-rule comment, field documentation, and unit tests are updated accordingly.

Changes

AdaptivePolicy relaxable high-water watermark

Layer / File(s) Summary
relax_high_water helper, docs, and on_fired call site
runtime/src/inference/adaptive_policy.rs
Adds the private relax_high_water function implementing ratchet-up / step-down-by-one logic; expands fired_high_water field documentation to describe the non-monotonic recent high-water semantics; updates the firing-rule 3 inline comment; changes on_fired to call relax_high_water instead of the old upward-only conditional assignment.
Updated unit tests
runtime/src/inference/adaptive_policy.rs
Removes the old monotonic-only watermark assertion; adds test cases for immediate ratcheting up, post-burst relaxation under sustained smaller cohorts, stability at a constant cohort size, and direct relax_high_water step-down behavior.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(scheduler): relax AdaptivePolicy cohort watermark instead of ratcheting monotonically' directly and specifically describes the main code change: replacing monotonic (upward-only) watermark behavior with a relaxation mechanism that can step down.
Linked Issues check ✅ Passed The PR addresses the monotonic watermark defect identified in issue #436 as a root cause of scheduler stalls. The changes implement the core fix to the AdaptivePolicy logic, though the full issue resolution (gating speculation on spare capacity) requires additional changes outside this PR's scope.
Out of Scope Changes check ✅ Passed All changes are focused on the AdaptivePolicy watermark mechanism in adaptive_policy.rs: the relax_high_water helper, on_fired logic update, and corresponding test updates. No unrelated or extraneous changes are present.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ingim

ingim commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Great work, @evanc7007 !

@ingim ingim merged commit c458a6b into pie-project:main Jun 19, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pass-level speculation under batch saturation regresses long-context concurrent (MoE) decode 2–4×

2 participants