Skip to content

Add reasoning-pattern benchmark scaffold#435

Draft
LIU-ZHIYUAN-source wants to merge 3 commits into
pie-project:mainfrom
LIU-ZHIYUAN-source:reasoning-benchmark
Draft

Add reasoning-pattern benchmark scaffold#435
LIU-ZHIYUAN-source wants to merge 3 commits into
pie-project:mainfrom
LIU-ZHIYUAN-source:reasoning-benchmark

Conversation

@LIU-ZHIYUAN-source

Copy link
Copy Markdown
Contributor

Summary

This PR adds an initial benchmark scaffold for comparing inferlet-based reasoning patterns in Pie.

It includes:

  • A reasoning benchmark inferlet with Direct, Best-of-N, Tree-of-Thought, and Graph-of-Thought modes
  • A GSM8K-style benchmark runner with deterministic answer extraction/evaluation
  • A CLI execution mode that runs through pie run with local wasm + manifest + config
  • Strict final-answer parsing to avoid extracting numbers from incomplete generations
  • A no-thinking control path for Qwen3-style models, currently implemented as a benchmark-local workaround

Validation

Tested locally / on RunPod with:

  • cargo fmt --manifest-path inferlets/reasoning-benchmark/Cargo.toml
  • cargo test --manifest-path inferlets/reasoning-benchmark/Cargo.toml
  • python3 -m unittest benches.test_reasoning_bench
  • RunPod smoke tests through ./target/release/pie run

Small sanity experiments:

  • Qwen3-0.6B and Qwen3-1.7B on GSM8K-10
  • Patterns: Direct, Best-of-N, Tree-of-Thought, Graph-of-Thought
  • No runner errors observed
  • Qwen3 no-thinking mode suppressed <think> after the benchmark-local cue workaround

Notes

This PR focuses on benchmark infrastructure, not final benchmark results. The current GSM8K-10 runs are only sanity checks and are not statistically meaningful.

Follow-up

This initial scaffold keeps all reasoning patterns in one benchmark inferlet for rapid validation. A planned follow-up is to split the patterns into separate inferlets and introduce a base text-completion inferlet, so user-submitted test-time scaling inferlets can be run and scored in isolation.

@LIU-ZHIYUAN-source LIU-ZHIYUAN-source marked this pull request as draft June 19, 2026 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant