Speculative decoding#3
Merged
Merged
Conversation
…a-2, gemma-3-4b, llama Phase 1 variance now fully complete (201/200 files); gemma-3-4b filled in 50/50 and Llama sglang top-up added. Phase 3 decode sweep topped up to n=3 iterations across all 4 non-Gemma-4 models (96/96 files, up from 72). Regenerated reports/decode_length_sweep_summary.md from actual data and added reports/decode_length_analysis.md produced by analysis/decode_length_analysis.py. Also extends scripts/run_new_benchmarks.sh: - --phase4 for Gemma 4 (E2B/E4B) baseline + ngram spec-dec - --all to run phases 1-4 end-to-end - SKIP_GEMMA4 / SKIP_GEMMA4_SGLANG env vars to scope disk usage - run_vllm_gemma4 / run_sglang_gemma4 launchers (apt-get install git, pip install transformers-from-git, python3 entrypoint) - Gemma 4 entries reordered first in Phase 3 so any new-launcher issues surface early Adds the first Gemma 4 E2B-it vLLM baseline + ngram runs and docs/RUN_STATUS.md (per-model per-phase done/remaining tracker).
finishes the last outstanding benchmark work — gemma-4 E2B and E4B on
SGLang, across the decode-length sweep (phase 3) and the baseline +
ngram speculative-decoding cells (phase 4). all 381 extended-phase
cells now complete, zero cell-level failures. request-level error
rates on SGLang sit between 0–3%, consistent with the 4-model base
phase-3 behaviour.
the unblock: sglang:latest (apr-09) falls back to the generic
TransformersMultiModalForCausalLM wrapper and dies on gemma 4's
QK-norm params (k_norm / q_norm). pinned GEMMA4_SGLANG_IMAGE to
lmsysorg/sglang:dev-cu13 (apr-16 snapshot off main) — that build
ships the native gemma-4 model class and loads weights cleanly.
script changes:
- moved run_vllm_gemma4 / run_sglang_gemma4 above phase 3 so the
--phase3 invocation can reach them (they were defined later in the
phase 4 block, which was a latent bug when --phase3 was run alone)
- added --enforce-eager on the phase 3 gemma 4 vLLM call for parity
with phase 4 (CUDA-graph safety)
- commented out the vllm pull_image_if_missing calls in phases 3/4 —
we only pull sglang now, since all vLLM work is already complete
docs:
- README: summary updated to reflect 16 models (incl. gemma 4), phase
3/4 rows marked complete, added a "notes on getting gemma 4 working"
subsection that documents the QK-norm fallback error and the
dev-cu13 fix
- docs/RUN_STATUS.md: rewritten as a completion snapshot — all phases
at 100%, 381/381 runs, "remaining: 0". retained the eagle3 entry
under blocked/out-of-scope since that's a separate retired-image
issue
regenerated reports/{decode_length_analysis,tpot_analysis,variance_analysis}.md
from the full result set (now includes gemma 4). figures in
reports/figures/ regenerated from analysis.generate_final_benchmark_report
(byte-identical, so no diff).
…points
the scripts/ dir had accumulated eight overlapping shell scripts — one per
phase plus two pending-audit wrappers. they were all superseded by either
run_all_benchmarks.sh (the 14-model baseline suite) or
run_new_benchmarks.sh (the extended variance / concurrency-64 / decode-sweep
/ gemma-4 phases). trimmed down to those two.
removed:
scripts/run_concurrency_64.sh -> run_new_benchmarks.sh --phase2
scripts/run_decode_sweep.sh -> run_new_benchmarks.sh --phase3
scripts/run_variance_subset.sh -> run_new_benchmarks.sh --phase1
scripts/run_gemma4_benchmarks.sh -> run_new_benchmarks.sh --phase4
scripts/run_pending.sh -> subsumed by idempotent resume
scripts/run_phase_a_pending.sh -> ditto
scripts/pending.sh -> ditto
docs/NEXT_STEPS.md -> all items complete; RUN_STATUS.md
is now the source of truth
also removed empty scripts/{logs,reports,results_*}/ subdirs that an old
script created in the wrong working directory.
updated references:
- scripts/EXECUTION_GUIDE.md rewritten to describe both entry points
with env-var knobs and a cleaner troubleshooting table
- README "Reproducing These Results" split into option A (baseline) and
option B (extended phases); stale "near-term in-progress" checklist
removed now that all items are done
- analysis/{variance,decode_length}_analysis.py error messages now
point at run_new_benchmarks.sh with the right phase flag
- docs/RUN_STATUS.md re-audit command replaced with an inline find
one-liner (pending.sh is gone)
the CLI flags and doc headings used generic phase numbers (--phase1
through --phase4) that required a legend to interpret. switched to the
function each block actually performs.
CLI flags in scripts/run_new_benchmarks.sh:
--phase1 -> --variance
--phase2 -> --concurrency
--phase3 -> --decode-sweep
--phase4 -> --gemma4
--phase3-redo -> --decode-sweep-redo
--all unchanged
internal bash vars renamed in lockstep:
RUN_PHASE1..4 -> RUN_VARIANCE, RUN_CONCURRENCY, RUN_DECODE_SWEEP,
RUN_GEMMA4
RUN_PHASE3_REDO -> RUN_DECODE_SWEEP_REDO
log() banners ("PHASE 1 — VARIANCE SUBSET" etc.) dropped the "PHASE N"
prefix. section comments and "next steps" echo at the end of the run
updated accordingly.
docs:
- README: extended-phase table now calls the rows "Variance subset",
"Concurrency-64 ramp", "Decode-length sweep", "Gemma 4 baseline +
ngram" — all with the new flag in the command column. the old
project-structure block that pointed at the deleted per-phase
scripts has been trimmed to just the two remaining entry points
plus a pointer to EXECUTION_GUIDE.md.
- docs/RUN_STATUS.md: section headings renamed ("## Variance subset",
"## Concurrency-64 ramp", "## Decode-length sweep", "## Gemma 4
baseline + ngram spec-dec") and the grand-totals table now uses
function names instead of phase numbers.
- scripts/EXECUTION_GUIDE.md: per-phase command block and env-var
knobs table no longer reference phase numbers.
- analysis/{variance,decode_length}_analysis.py: error messages
("Run scripts/run_decode_sweep.sh first.") updated to point at the
new --decode-sweep / --variance flags.
varad-more
added a commit
that referenced
this pull request
Apr 22, 2026
Added decode sweep benchmark runs and gemma4 model
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.