Skip to content

Speculative decoding#3

Merged
varad-more merged 16 commits into
mainfrom
speculative-decoding
Apr 20, 2026
Merged

Speculative decoding#3
varad-more merged 16 commits into
mainfrom
speculative-decoding

Conversation

@varad-more
Copy link
Copy Markdown
Owner

No description provided.

varad-more and others added 16 commits April 7, 2026 18:39
…a-2, gemma-3-4b, llama

Phase 1 variance now fully complete (201/200 files); gemma-3-4b filled in
50/50 and Llama sglang top-up added. Phase 3 decode sweep topped up to
n=3 iterations across all 4 non-Gemma-4 models (96/96 files, up from 72).

Regenerated reports/decode_length_sweep_summary.md from actual data and
added reports/decode_length_analysis.md produced by
analysis/decode_length_analysis.py.

Also extends scripts/run_new_benchmarks.sh:
- --phase4 for Gemma 4 (E2B/E4B) baseline + ngram spec-dec
- --all to run phases 1-4 end-to-end
- SKIP_GEMMA4 / SKIP_GEMMA4_SGLANG env vars to scope disk usage
- run_vllm_gemma4 / run_sglang_gemma4 launchers (apt-get install git,
  pip install transformers-from-git, python3 entrypoint)
- Gemma 4 entries reordered first in Phase 3 so any new-launcher issues
  surface early

Adds the first Gemma 4 E2B-it vLLM baseline + ngram runs and docs/RUN_STATUS.md
(per-model per-phase done/remaining tracker).
finishes the last outstanding benchmark work — gemma-4 E2B and E4B on
SGLang, across the decode-length sweep (phase 3) and the baseline +
ngram speculative-decoding cells (phase 4). all 381 extended-phase
cells now complete, zero cell-level failures. request-level error
rates on SGLang sit between 0–3%, consistent with the 4-model base
phase-3 behaviour.

the unblock: sglang:latest (apr-09) falls back to the generic
TransformersMultiModalForCausalLM wrapper and dies on gemma 4's
QK-norm params (k_norm / q_norm). pinned GEMMA4_SGLANG_IMAGE to
lmsysorg/sglang:dev-cu13 (apr-16 snapshot off main) — that build
ships the native gemma-4 model class and loads weights cleanly.

script changes:
  - moved run_vllm_gemma4 / run_sglang_gemma4 above phase 3 so the
    --phase3 invocation can reach them (they were defined later in the
    phase 4 block, which was a latent bug when --phase3 was run alone)
  - added --enforce-eager on the phase 3 gemma 4 vLLM call for parity
    with phase 4 (CUDA-graph safety)
  - commented out the vllm pull_image_if_missing calls in phases 3/4 —
    we only pull sglang now, since all vLLM work is already complete

docs:
  - README: summary updated to reflect 16 models (incl. gemma 4), phase
    3/4 rows marked complete, added a "notes on getting gemma 4 working"
    subsection that documents the QK-norm fallback error and the
    dev-cu13 fix
  - docs/RUN_STATUS.md: rewritten as a completion snapshot — all phases
    at 100%, 381/381 runs, "remaining: 0". retained the eagle3 entry
    under blocked/out-of-scope since that's a separate retired-image
    issue

regenerated reports/{decode_length_analysis,tpot_analysis,variance_analysis}.md
from the full result set (now includes gemma 4). figures in
reports/figures/ regenerated from analysis.generate_final_benchmark_report
(byte-identical, so no diff).
…points

the scripts/ dir had accumulated eight overlapping shell scripts — one per
phase plus two pending-audit wrappers. they were all superseded by either
run_all_benchmarks.sh (the 14-model baseline suite) or
run_new_benchmarks.sh (the extended variance / concurrency-64 / decode-sweep
/ gemma-4 phases). trimmed down to those two.

removed:
  scripts/run_concurrency_64.sh       -> run_new_benchmarks.sh --phase2
  scripts/run_decode_sweep.sh         -> run_new_benchmarks.sh --phase3
  scripts/run_variance_subset.sh      -> run_new_benchmarks.sh --phase1
  scripts/run_gemma4_benchmarks.sh    -> run_new_benchmarks.sh --phase4
  scripts/run_pending.sh              -> subsumed by idempotent resume
  scripts/run_phase_a_pending.sh      -> ditto
  scripts/pending.sh                  -> ditto
  docs/NEXT_STEPS.md                  -> all items complete; RUN_STATUS.md
                                         is now the source of truth

also removed empty scripts/{logs,reports,results_*}/ subdirs that an old
script created in the wrong working directory.

updated references:
  - scripts/EXECUTION_GUIDE.md rewritten to describe both entry points
    with env-var knobs and a cleaner troubleshooting table
  - README "Reproducing These Results" split into option A (baseline) and
    option B (extended phases); stale "near-term in-progress" checklist
    removed now that all items are done
  - analysis/{variance,decode_length}_analysis.py error messages now
    point at run_new_benchmarks.sh with the right phase flag
  - docs/RUN_STATUS.md re-audit command replaced with an inline find
    one-liner (pending.sh is gone)
the CLI flags and doc headings used generic phase numbers (--phase1
through --phase4) that required a legend to interpret. switched to the
function each block actually performs.

CLI flags in scripts/run_new_benchmarks.sh:
  --phase1       -> --variance
  --phase2       -> --concurrency
  --phase3       -> --decode-sweep
  --phase4       -> --gemma4
  --phase3-redo  -> --decode-sweep-redo
  --all          unchanged

internal bash vars renamed in lockstep:
  RUN_PHASE1..4  -> RUN_VARIANCE, RUN_CONCURRENCY, RUN_DECODE_SWEEP,
                    RUN_GEMMA4
  RUN_PHASE3_REDO -> RUN_DECODE_SWEEP_REDO

log() banners ("PHASE 1 — VARIANCE SUBSET" etc.) dropped the "PHASE N"
prefix. section comments and "next steps" echo at the end of the run
updated accordingly.

docs:
  - README: extended-phase table now calls the rows "Variance subset",
    "Concurrency-64 ramp", "Decode-length sweep", "Gemma 4 baseline +
    ngram" — all with the new flag in the command column. the old
    project-structure block that pointed at the deleted per-phase
    scripts has been trimmed to just the two remaining entry points
    plus a pointer to EXECUTION_GUIDE.md.
  - docs/RUN_STATUS.md: section headings renamed ("## Variance subset",
    "## Concurrency-64 ramp", "## Decode-length sweep", "## Gemma 4
    baseline + ngram spec-dec") and the grand-totals table now uses
    function names instead of phase numbers.
  - scripts/EXECUTION_GUIDE.md: per-phase command block and env-var
    knobs table no longer reference phase numbers.
  - analysis/{variance,decode_length}_analysis.py: error messages
    ("Run scripts/run_decode_sweep.sh first.") updated to point at the
    new --decode-sweep / --variance flags.
@varad-more varad-more merged commit ecf3809 into main Apr 20, 2026
1 check failed
varad-more added a commit that referenced this pull request Apr 22, 2026
Added decode sweep benchmark runs and gemma4 model
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant