Speculative decoding by varad-more · Pull Request #3 · varad-more/inference-engine-benchmark-system

varad-more · 2026-04-20T05:01:55Z

No description provided.

…weep top-up

…a-2, gemma-3-4b, llama Phase 1 variance now fully complete (201/200 files); gemma-3-4b filled in 50/50 and Llama sglang top-up added. Phase 3 decode sweep topped up to n=3 iterations across all 4 non-Gemma-4 models (96/96 files, up from 72). Regenerated reports/decode_length_sweep_summary.md from actual data and added reports/decode_length_analysis.md produced by analysis/decode_length_analysis.py. Also extends scripts/run_new_benchmarks.sh: - --phase4 for Gemma 4 (E2B/E4B) baseline + ngram spec-dec - --all to run phases 1-4 end-to-end - SKIP_GEMMA4 / SKIP_GEMMA4_SGLANG env vars to scope disk usage - run_vllm_gemma4 / run_sglang_gemma4 launchers (apt-get install git, pip install transformers-from-git, python3 entrypoint) - Gemma 4 entries reordered first in Phase 3 so any new-launcher issues surface early Adds the first Gemma 4 E2B-it vLLM baseline + ngram runs and docs/RUN_STATUS.md (per-model per-phase done/remaining tracker).

finishes the last outstanding benchmark work — gemma-4 E2B and E4B on SGLang, across the decode-length sweep (phase 3) and the baseline + ngram speculative-decoding cells (phase 4). all 381 extended-phase cells now complete, zero cell-level failures. request-level error rates on SGLang sit between 0–3%, consistent with the 4-model base phase-3 behaviour. the unblock: sglang:latest (apr-09) falls back to the generic TransformersMultiModalForCausalLM wrapper and dies on gemma 4's QK-norm params (k_norm / q_norm). pinned GEMMA4_SGLANG_IMAGE to lmsysorg/sglang:dev-cu13 (apr-16 snapshot off main) — that build ships the native gemma-4 model class and loads weights cleanly. script changes: - moved run_vllm_gemma4 / run_sglang_gemma4 above phase 3 so the --phase3 invocation can reach them (they were defined later in the phase 4 block, which was a latent bug when --phase3 was run alone) - added --enforce-eager on the phase 3 gemma 4 vLLM call for parity with phase 4 (CUDA-graph safety) - commented out the vllm pull_image_if_missing calls in phases 3/4 — we only pull sglang now, since all vLLM work is already complete docs: - README: summary updated to reflect 16 models (incl. gemma 4), phase 3/4 rows marked complete, added a "notes on getting gemma 4 working" subsection that documents the QK-norm fallback error and the dev-cu13 fix - docs/RUN_STATUS.md: rewritten as a completion snapshot — all phases at 100%, 381/381 runs, "remaining: 0". retained the eagle3 entry under blocked/out-of-scope since that's a separate retired-image issue regenerated reports/{decode_length_analysis,tpot_analysis,variance_analysis}.md from the full result set (now includes gemma 4). figures in reports/figures/ regenerated from analysis.generate_final_benchmark_report (byte-identical, so no diff).

…points the scripts/ dir had accumulated eight overlapping shell scripts — one per phase plus two pending-audit wrappers. they were all superseded by either run_all_benchmarks.sh (the 14-model baseline suite) or run_new_benchmarks.sh (the extended variance / concurrency-64 / decode-sweep / gemma-4 phases). trimmed down to those two. removed: scripts/run_concurrency_64.sh -> run_new_benchmarks.sh --phase2 scripts/run_decode_sweep.sh -> run_new_benchmarks.sh --phase3 scripts/run_variance_subset.sh -> run_new_benchmarks.sh --phase1 scripts/run_gemma4_benchmarks.sh -> run_new_benchmarks.sh --phase4 scripts/run_pending.sh -> subsumed by idempotent resume scripts/run_phase_a_pending.sh -> ditto scripts/pending.sh -> ditto docs/NEXT_STEPS.md -> all items complete; RUN_STATUS.md is now the source of truth also removed empty scripts/{logs,reports,results_*}/ subdirs that an old script created in the wrong working directory. updated references: - scripts/EXECUTION_GUIDE.md rewritten to describe both entry points with env-var knobs and a cleaner troubleshooting table - README "Reproducing These Results" split into option A (baseline) and option B (extended phases); stale "near-term in-progress" checklist removed now that all items are done - analysis/{variance,decode_length}_analysis.py error messages now point at run_new_benchmarks.sh with the right phase flag - docs/RUN_STATUS.md re-audit command replaced with an inline find one-liner (pending.sh is gone)

the CLI flags and doc headings used generic phase numbers (--phase1 through --phase4) that required a legend to interpret. switched to the function each block actually performs. CLI flags in scripts/run_new_benchmarks.sh: --phase1 -> --variance --phase2 -> --concurrency --phase3 -> --decode-sweep --phase4 -> --gemma4 --phase3-redo -> --decode-sweep-redo --all unchanged internal bash vars renamed in lockstep: RUN_PHASE1..4 -> RUN_VARIANCE, RUN_CONCURRENCY, RUN_DECODE_SWEEP, RUN_GEMMA4 RUN_PHASE3_REDO -> RUN_DECODE_SWEEP_REDO log() banners ("PHASE 1 — VARIANCE SUBSET" etc.) dropped the "PHASE N" prefix. section comments and "next steps" echo at the end of the run updated accordingly. docs: - README: extended-phase table now calls the rows "Variance subset", "Concurrency-64 ramp", "Decode-length sweep", "Gemma 4 baseline + ngram" — all with the new flag in the command column. the old project-structure block that pointed at the deleted per-phase scripts has been trimmed to just the two remaining entry points plus a pointer to EXECUTION_GUIDE.md. - docs/RUN_STATUS.md: section headings renamed ("## Variance subset", "## Concurrency-64 ramp", "## Decode-length sweep", "## Gemma 4 baseline + ngram spec-dec") and the grand-totals table now uses function names instead of phase numbers. - scripts/EXECUTION_GUIDE.md: per-phase command block and env-var knobs table no longer reference phase numbers. - analysis/{variance,decode_length}_analysis.py: error messages ("Run scripts/run_decode_sweep.sh first.") updated to point at the new --decode-sweep / --variance flags.

Added decode sweep benchmark runs and gemma4 model

varad-more and others added 16 commits April 7, 2026 18:39

added TPOT analysis and added decode length sweep

888db4e

Merge remote-tracking branch 'origin/main' into speculative-decoding

530ad73

added script for running

836e70e

Update run_new_benchmarks.sh

44be2b6

added new runs

1cb1125

Added runs for variance testing

341fe6e

Decode Length Sweep for gemma2 and phi4 mini

49bd4df

added decode length sweep runs for gemma and phi4

c3a59e5

completed decode length sweep and partial concurrency runs

e55ce6a

completed phase 2 concurrency-64 runs for all 8 cells

9dcdce4

add idempotent resume logic for phase 1 variance and phase 3 decode-s…

0cdfcd1

…weep top-up

Added benchmark runs for gemma4 e4b

9b336c5

varad-more merged commit ecf3809 into main Apr 20, 2026
1 check failed

varad-more added a commit that referenced this pull request Apr 22, 2026

Merge pull request #3 from varad-more/speculative-decoding

445992d

Added decode sweep benchmark runs and gemma4 model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative decoding#3

Speculative decoding#3
varad-more merged 16 commits into
mainfrom
speculative-decoding

varad-more commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

varad-more commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant