Add reasoning-pattern benchmark scaffold by LIU-ZHIYUAN-source · Pull Request #435 · pie-project/pie

LIU-ZHIYUAN-source · 2026-06-18T10:42:52Z

Summary

This PR adds an initial benchmark scaffold for comparing inferlet-based reasoning patterns in Pie.

It includes:

A reasoning benchmark inferlet with Direct, Best-of-N, Tree-of-Thought, and Graph-of-Thought modes
A GSM8K-style benchmark runner with deterministic answer extraction/evaluation
A CLI execution mode that runs through pie run with local wasm + manifest + config
Strict final-answer parsing to avoid extracting numbers from incomplete generations
A no-thinking control path for Qwen3-style models, currently implemented as a benchmark-local workaround

Validation

Tested locally / on RunPod with:

cargo fmt --manifest-path inferlets/reasoning-benchmark/Cargo.toml
cargo test --manifest-path inferlets/reasoning-benchmark/Cargo.toml
python3 -m unittest benches.test_reasoning_bench
RunPod smoke tests through ./target/release/pie run

Small sanity experiments:

Qwen3-0.6B and Qwen3-1.7B on GSM8K-10
Patterns: Direct, Best-of-N, Tree-of-Thought, Graph-of-Thought
No runner errors observed
Qwen3 no-thinking mode suppressed <think> after the benchmark-local cue workaround

Notes

This PR focuses on benchmark infrastructure, not final benchmark results. The current GSM8K-10 runs are only sanity checks and are not statistically meaningful.

Follow-up

This initial scaffold keeps all reasoning patterns in one benchmark inferlet for rapid validation. A planned follow-up is to split the patterns into separate inferlets and introduce a base text-completion inferlet, so user-submitted test-time scaling inferlets can be run and scored in isolation.

LIU-ZHIYUAN-source added 3 commits June 12, 2026 11:11

Add MVP reasoning benchmark infrastructure

2a99ff8

Add CLI mode to reasoning benchmark runner

267c901

Add CLI reasoning benchmark mode and Qwen3 no-thinking control

2b5e001

LIU-ZHIYUAN-source marked this pull request as draft June 19, 2026 02:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add reasoning-pattern benchmark scaffold#435

Add reasoning-pattern benchmark scaffold#435
LIU-ZHIYUAN-source wants to merge 3 commits into
pie-project:mainfrom
LIU-ZHIYUAN-source:reasoning-benchmark

LIU-ZHIYUAN-source commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

LIU-ZHIYUAN-source commented Jun 18, 2026

Summary

Validation

Notes

Follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant