Add 16GB-tier MoE models to gen-3 eval set + regenerate dashboard by antoinezambelli · Pull Request #107 · antoinezambelli/forge

antoinezambelli · 2026-06-18T07:26:17Z

What

Adds the 16GB-tier MoE models to the v0.7.5 / gen-3 eval set and regenerates the shipped dashboard so they appear in the published results.

This is not a release — gen is a comparability epoch, not a version. No version bump.

Changes

18b09e6 — eval set additions (LFM2.5-8B-A1B, Mellum2-12B-A2.5B Thinking + Instruct)

sampling_defaults: official card params for the three models
batch_eval: GGUF entries, new-models config subset, reasoning-format auto flags for the two CoT models
eval_results_v0.7.5.jsonl: +15,600 rows stamped gen=3 (LFS)
EVAL_GUIDE: eval generations + post-release addendum

3d54141 — dashboard regeneration + docs fix

Rebuilt docs/results/dashboard.html and markdown views from the full versioned dataset set (v0.6.0 + v0.7.0 + v0.7.4 + v0.7.5), so the new gen-3 models slot in alongside the carried-forward older generations (gen 1 = v0.6.0, gen 2 = v0.7.0 + v0.7.4).
Fixed EVAL_GUIDE regen instructions: the shipped dashboard is a multi-file render (dedup_latest_gen keeps newest gen per config), not single-file. The old single-file example silently dropped every model absent from that one file — including the entire 27–35B tier and the Anthropic baselines.

Headline

Mellum2-12B-A2.5B Instruct (native, reforged) leads the additions at 81.0%.

Notes

Conflict-free with main (verified via merge-tree): touches a disjoint file set from the #76 MockClient consolidation already on main.
Caveat: replay coverage for the three new models is none only, not the full none/keep-last/full grid the other gen-3 8–14B models carry.

🤖 Generated with Claude Code

Wire LFM2.5-8B-A1B and Mellum2-12B-A2.5B (Thinking + Instruct) into the batch eval harness and sampling registry, and fold their n=50 results into the v0.7.5 / gen 3 dataset. No version bump — gen is a comparability epoch, not a release version, and these are net-new configs. - sampling_defaults: official card params for the three models - batch_eval: GGUF entries, "new-models" config subset, reasoning-format auto flags for the two CoT models (LFM2.5, Mellum2 Thinking) - eval_results_v0.7.5.jsonl: +15,600 rows stamped gen=3 (Mellum2-12B Instruct native/reforged leads at 81.0%) - EVAL_GUIDE: document eval generations + the post-release addendum Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Rebuild the shipped dashboard and markdown views from the full versioned dataset set (v0.6.0 + v0.7.0 + v0.7.4 + v0.7.5) so the new gen-3 16GB-tier models (LFM2.5-8B-A1B, Mellum2-12B Thinking + Instruct) appear alongside the carried-forward older generations. Mellum2-12B Instruct (native, reforged) leads the additions at 81.0%. Also correct EVAL_GUIDE: the shipped dashboard is a multi-file render across all eval_results_v*.jsonl (dedup_latest_gen keeps newest gen per config), not a single-file render. The prior single-file example silently dropped every model absent from that one file, including the carried-forward generations. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Bump version to 0.7.6 and add the CHANGELOG entry for the batch since 0.7.5: two Ollama backend fixes (#111, #115), inline <think> reasoning capture on vLLM + Ollama (#110), the shared think-tag parsing refactor (#112), the consolidated test-double fixture (#76), and the 16GB-tier eval models (#107). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

antoinezambelli and others added 3 commits June 17, 2026 11:19

Merge branch 'main' into az/evals

8845ee3

antoinezambelli merged commit 7aef050 into main Jun 18, 2026
2 checks passed

antoinezambelli deleted the az/evals branch June 18, 2026 07:28

antoinezambelli mentioned this pull request Jun 20, 2026

release: v0.7.6 #118

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 16GB-tier MoE models to gen-3 eval set + regenerate dashboard#107

Add 16GB-tier MoE models to gen-3 eval set + regenerate dashboard#107
antoinezambelli merged 3 commits into
mainfrom
az/evals

antoinezambelli commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antoinezambelli commented Jun 18, 2026

What

Changes

Headline

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant