Skip to content

Add 16GB-tier MoE models to gen-3 eval set + regenerate dashboard#107

Merged
antoinezambelli merged 3 commits into
mainfrom
az/evals
Jun 18, 2026
Merged

Add 16GB-tier MoE models to gen-3 eval set + regenerate dashboard#107
antoinezambelli merged 3 commits into
mainfrom
az/evals

Conversation

@antoinezambelli

Copy link
Copy Markdown
Owner

What

Adds the 16GB-tier MoE models to the v0.7.5 / gen-3 eval set and regenerates the shipped dashboard so they appear in the published results.

This is not a release — gen is a comparability epoch, not a version. No version bump.

Changes

18b09e6 — eval set additions (LFM2.5-8B-A1B, Mellum2-12B-A2.5B Thinking + Instruct)

  • sampling_defaults: official card params for the three models
  • batch_eval: GGUF entries, new-models config subset, reasoning-format auto flags for the two CoT models
  • eval_results_v0.7.5.jsonl: +15,600 rows stamped gen=3 (LFS)
  • EVAL_GUIDE: eval generations + post-release addendum

3d54141 — dashboard regeneration + docs fix

  • Rebuilt docs/results/dashboard.html and markdown views from the full versioned dataset set (v0.6.0 + v0.7.0 + v0.7.4 + v0.7.5), so the new gen-3 models slot in alongside the carried-forward older generations (gen 1 = v0.6.0, gen 2 = v0.7.0 + v0.7.4).
  • Fixed EVAL_GUIDE regen instructions: the shipped dashboard is a multi-file render (dedup_latest_gen keeps newest gen per config), not single-file. The old single-file example silently dropped every model absent from that one file — including the entire 27–35B tier and the Anthropic baselines.

Headline

Mellum2-12B-A2.5B Instruct (native, reforged) leads the additions at 81.0%.

Notes

  • Conflict-free with main (verified via merge-tree): touches a disjoint file set from the #76 MockClient consolidation already on main.
  • Caveat: replay coverage for the three new models is none only, not the full none/keep-last/full grid the other gen-3 8–14B models carry.

🤖 Generated with Claude Code

antoinezambelli and others added 3 commits June 17, 2026 11:19
Wire LFM2.5-8B-A1B and Mellum2-12B-A2.5B (Thinking + Instruct) into the
batch eval harness and sampling registry, and fold their n=50 results
into the v0.7.5 / gen 3 dataset. No version bump — gen is a comparability
epoch, not a release version, and these are net-new configs.

- sampling_defaults: official card params for the three models
- batch_eval: GGUF entries, "new-models" config subset, reasoning-format
  auto flags for the two CoT models (LFM2.5, Mellum2 Thinking)
- eval_results_v0.7.5.jsonl: +15,600 rows stamped gen=3
  (Mellum2-12B Instruct native/reforged leads at 81.0%)
- EVAL_GUIDE: document eval generations + the post-release addendum

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rebuild the shipped dashboard and markdown views from the full versioned
dataset set (v0.6.0 + v0.7.0 + v0.7.4 + v0.7.5) so the new gen-3 16GB-tier
models (LFM2.5-8B-A1B, Mellum2-12B Thinking + Instruct) appear alongside the
carried-forward older generations. Mellum2-12B Instruct (native, reforged)
leads the additions at 81.0%.

Also correct EVAL_GUIDE: the shipped dashboard is a multi-file render across
all eval_results_v*.jsonl (dedup_latest_gen keeps newest gen per config), not
a single-file render. The prior single-file example silently dropped every
model absent from that one file, including the carried-forward generations.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@antoinezambelli antoinezambelli merged commit 7aef050 into main Jun 18, 2026
2 checks passed
@antoinezambelli antoinezambelli deleted the az/evals branch June 18, 2026 07:28
@antoinezambelli antoinezambelli mentioned this pull request Jun 20, 2026
antoinezambelli added a commit that referenced this pull request Jun 20, 2026
Bump version to 0.7.6 and add the CHANGELOG entry for the batch since
0.7.5: two Ollama backend fixes (#111, #115), inline <think> reasoning
capture on vLLM + Ollama (#110), the shared think-tag parsing refactor
(#112), the consolidated test-double fixture (#76), and the 16GB-tier
eval models (#107).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant