Add 16GB-tier MoE models to gen-3 eval set + regenerate dashboard#107
Merged
Conversation
Wire LFM2.5-8B-A1B and Mellum2-12B-A2.5B (Thinking + Instruct) into the batch eval harness and sampling registry, and fold their n=50 results into the v0.7.5 / gen 3 dataset. No version bump — gen is a comparability epoch, not a release version, and these are net-new configs. - sampling_defaults: official card params for the three models - batch_eval: GGUF entries, "new-models" config subset, reasoning-format auto flags for the two CoT models (LFM2.5, Mellum2 Thinking) - eval_results_v0.7.5.jsonl: +15,600 rows stamped gen=3 (Mellum2-12B Instruct native/reforged leads at 81.0%) - EVAL_GUIDE: document eval generations + the post-release addendum Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rebuild the shipped dashboard and markdown views from the full versioned dataset set (v0.6.0 + v0.7.0 + v0.7.4 + v0.7.5) so the new gen-3 16GB-tier models (LFM2.5-8B-A1B, Mellum2-12B Thinking + Instruct) appear alongside the carried-forward older generations. Mellum2-12B Instruct (native, reforged) leads the additions at 81.0%. Also correct EVAL_GUIDE: the shipped dashboard is a multi-file render across all eval_results_v*.jsonl (dedup_latest_gen keeps newest gen per config), not a single-file render. The prior single-file example silently dropped every model absent from that one file, including the carried-forward generations. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merged
antoinezambelli
added a commit
that referenced
this pull request
Jun 20, 2026
Bump version to 0.7.6 and add the CHANGELOG entry for the batch since 0.7.5: two Ollama backend fixes (#111, #115), inline <think> reasoning capture on vLLM + Ollama (#110), the shared think-tag parsing refactor (#112), the consolidated test-double fixture (#76), and the 16GB-tier eval models (#107). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds the 16GB-tier MoE models to the v0.7.5 / gen-3 eval set and regenerates the shipped dashboard so they appear in the published results.
This is not a release —
genis a comparability epoch, not a version. No version bump.Changes
18b09e6— eval set additions (LFM2.5-8B-A1B, Mellum2-12B-A2.5B Thinking + Instruct)sampling_defaults: official card params for the three modelsbatch_eval: GGUF entries,new-modelsconfig subset, reasoning-format auto flags for the two CoT modelseval_results_v0.7.5.jsonl: +15,600 rows stampedgen=3(LFS)EVAL_GUIDE: eval generations + post-release addendum3d54141— dashboard regeneration + docs fixdocs/results/dashboard.htmland markdown views from the full versioned dataset set (v0.6.0 + v0.7.0 + v0.7.4 + v0.7.5), so the new gen-3 models slot in alongside the carried-forward older generations (gen 1 = v0.6.0, gen 2 = v0.7.0 + v0.7.4).EVAL_GUIDEregen instructions: the shipped dashboard is a multi-file render (dedup_latest_genkeeps newest gen per config), not single-file. The old single-file example silently dropped every model absent from that one file — including the entire 27–35B tier and the Anthropic baselines.Headline
Mellum2-12B-A2.5B Instruct (native, reforged) leads the additions at 81.0%.
Notes
main(verified viamerge-tree): touches a disjoint file set from the#76MockClient consolidation already on main.noneonly, not the full none/keep-last/full grid the other gen-3 8–14B models carry.🤖 Generated with Claude Code