LLM Eval Benchmark

A reproducible CLI-first benchmark for controlled Direct-versus-RAG comparisons, stability analysis, and auditable model evaluation.

Status: Fixed-set benchmark release | Public release: 2026-06-10

Start here	Resource
Primary documentation	Release gate
Reproducibility / implementation	Protocol
Verified outcomes	Result summary

Highlights

promptfoo for prompt/system evals
ragas for RAG-specific metrics
structured reports for comparisons

Official Release Entry

For external citation, portfolio use, and final benchmark conclusions, start here:

FINAL_RESULT_SUMMARY.md
FINAL_RELEASE_GATE.md
reports/README_RELEASE_STRUCTURE.md
docs/benchmark_showcase.html for a lightweight read-only presentation layer
reports/benchmark_final_summary_20260502/benchmark_results_final_summary_cn.md
reports/benchmark_final_summary_20260502/benchmark_results_final_summary.md

Important scoping:

reports/latest/* is a local smoke / reproducibility output surface, not the promoted benchmark release.
data/fbtp_eval.jsonl is a runnable 24-row smoke slice for end-to-end checks.
data/fbtp_eval_fixed_120.jsonl is the official fixed-set dataset for promoted benchmark conclusions.

Formal Benchmark Protocol (Current)

The current promoted benchmark release uses a controlled reasoning protocol for the main result family:

Main table: 8 models x 2 methods (Direct / RAG) on the fixed 120-question set
Stability release: selected-RAG clean comparison for DeepSeek-V3.2 and MiniMax-M2.7, 100 rounds each, RAG-only
Category table: same 8 controlled models, broken down by ragppi / doc-design / schema_tables
Appendix table: same 8 models under provider-native mode
topK ablation: same 8 models with candidate pool sweep 32 / 64 / 128

Protocol note:

candidate_topk_256 is deprecated and is not part of the formal benchmark protocol
formal reporting should only use 32 / 64 / 128

Controlled reasoning policy:

DeepSeek-V3.2: thinking disabled
GLM-5: thinking disabled
Kimi-K2.5: thinking disabled
DeepSeek-R1: keep native reasoning behavior
Qwen3-235B-A22B-Instruct-2507: controlled non-thinking protocol
MiniMax-M2: controlled non-thinking protocol
MiniMax-M2.7: controlled non-thinking protocol
ERNIE-4.5-Turbo-128K: controlled non-thinking protocol

Supplementary appendix table:

uses the same 8 models in provider-native mode
Appendix runs are not mixed into the main or stability aggregates

Replacement note:

Baichuan-M3 was removed from the formal 8-model protocol on 2026-04-18 and replaced by MiniMax-M2.7
this was due to provider-side thinking control incompatibility (budget_tokens errors), not a benchmark quality conclusion

Release-note clarification:

Earlier larger stability plans and repair-heavy intermediate runs remain in the repo for auditability, but the promoted release surface is the clean summary package under reports/benchmark_final_summary_20260502.
The dashboard / control-plane UI is an engineering support surface and is not the official release gate for benchmark conclusions.

See the Chinese benchmark protocol note:

docs/benchmark_protocol_cn.md

Architecture

flowchart TD
  D[Benchmark Dataset] --> R[RAG Pipeline]
  D --> X[Direct Pipeline]
  R --> C[Compare]
  X --> C
  C --> S[summary.md / summary.json]
  C --> P[comparison.csv / comparison.md / comparison_summary.png]
  C --> A[answer jsonl artifacts]
  R --> G[RAGAS outputs when available]

Quick Start

# promptfoo (node-based)
# npx promptfoo eval -c configs/promptfoo.yaml

# ragas (python-based)
# pip install -e ../llm-rag-knowledge-base
# python -m pipelines.rag_pipeline --data-path data/fbtp_eval.jsonl

One-Click Eval

powershell -ExecutionPolicy Bypass -File scripts/run_eval.ps1

Local Comparison Mode

When remote RAGAS evaluation is unavailable, python -m pipelines.compare --eval-mode auto falls back to a fully local metric set and still produces:

reports/latest/comparison.csv
reports/latest/comparison.md
reports/latest/comparison_summary.png
reports/latest/summary.md
reports/latest/summary.json
reports/latest/rag_answers.jsonl
reports/latest/direct_answers.jsonl

If you want CI or automation to stop instead of silently downgrading, run:

python -m pipelines.compare --data-path data/fbtp_eval.jsonl --output-dir reports/latest --eval-mode auto --fail-on-fallback

That raises a hard error whenever the command would otherwise fall back from remote evaluation to local scoring.

Repo Layout

data/: benchmark datasets and small samples
prompts/: prompt templates
pipelines/: RAG vs direct pipelines
metrics/: evaluation metrics
reports/: generated results
finetune/: optional LoRA/PEFT module

Notes

Do not commit full datasets. Keep scripts and small samples only.
data/fbtp_eval.jsonl now ships with a 24-row runnable smoke slice so the repo can run end-to-end.
Official benchmark conclusions should cite data/fbtp_eval_fixed_120.jsonl together with the promoted outputs under reports/benchmark_final_summary_20260502.
When RAGAS is available, additional outputs are generated:
- reports/latest/ragas_scores.csv
- reports/latest/ragas_scores.md
- reports/latest/ragas_summary.md
- reports/latest/ragas_summary.json
- reports/latest/ragas_summary.png

RAG vs Direct (Comparison)

After running python -m pipelines.compare, check:

reports/latest/comparison.csv
reports/latest/comparison.md
reports/latest/comparison_summary.png
reports/latest/category_breakdown.csv
reports/latest/category_breakdown.md

Comparison Output (Sample)

method	faithfulness	answer_relevancy	context_precision	latency_ms	estimated_cost_usd
RAG	0.99	0.46	0.93	40.92	0.0044
Direct		0.60		6355.18	0.0007

Category Breakdown

category_breakdown.csv now groups by method + category, so you can see whether a slice is strong because the method is good overall or because one specific category is easy.

Multi-Model Repeated Sampling

pipelines.compare now supports a formal experiment matrix for repeated benchmark runs across multiple models.

Example:

python -m pipelines.compare \
  --data-path data/fbtp_eval_fixed_120.jsonl \
  --output-dir reports/formal_round1_4models \
  --eval-mode auto \
  --rounds 100 \
  --sample-size 100 \
  --with-replacement \
  --seed 42 \
  --benchmark-model DeepSeek-V3.2 \
  --benchmark-model GLM-5 \
  --benchmark-model Kimi-K2.5 \
  --benchmark-model DeepSeek-R1

You can also pass labeled model specs:

python -m pipelines.compare \
  --benchmark-model baseline=DeepSeek-V3.2 \
  --benchmark-model cn_structured=GLM-5 \
  --benchmark-model long_context=Kimi-K2.5 \
  --benchmark-model reasoner=DeepSeek-R1

Formal multi-model outputs include:

experiment_config.json
per_round_results.csv
model_summary.csv
leaderboard.csv
models/<slug>/summary.json
legacy compatibility files such as comparison.csv

See:

reports/benchmark_output_contract_2026-04-11.md

Evaluation Output (Sample)

RAGAS Summary (Smoke Example, 24-row Checked-In Slice)

metric	value
samples	24
eval_mode	ragas
RAG faithfulness	0.9954
RAG answer_relevancy	0.4643
RAG context_precision	0.7083
Direct answer_relevancy	0.6213

This section is only a runnable smoke example. It is not the official promoted benchmark conclusion.

File Outputs

reports/latest/ragas_scores.csv
reports/latest/ragas_scores.md
reports/latest/ragas_summary.md
reports/latest/ragas_summary.json
reports/latest/ragas_summary.png
reports/latest/direct_ragas_scores.csv
reports/latest/direct_ragas_scores.md
reports/latest/direct_ragas_summary.md
reports/latest/direct_ragas_summary.json
reports/latest/direct_ragas_summary.png
reports/latest/comparison.csv
reports/latest/comparison.md
reports/latest/comparison_summary.png
reports/latest/category_breakdown.csv
reports/latest/category_breakdown.md
reports/latest/summary.md
reports/latest/summary.json

Roadmap

Add category-balanced evaluation slices beyond protein interaction rows.
Persist automatic cleanup for stale RAGAS artifacts after local fallback runs.
Split category breakdown by method + category instead of category-only aggregation.
Add benchmark cards for structure-specific and methodology-specific domain questions.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
configs		configs
data		data
docs		docs
metrics		metrics
pipelines		pipelines
prompts		prompts
reports		reports
scripts		scripts
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
FINAL_RELEASE_GATE.md		FINAL_RELEASE_GATE.md
FINAL_RESULT_SUMMARY.md		FINAL_RESULT_SUMMARY.md
INTERFACE_GUIDE_CN.md		INTERFACE_GUIDE_CN.md
INTERFACE_GUIDE_EN.md		INTERFACE_GUIDE_EN.md
README.md		README.md
README_CN.md		README_CN.md
SECURITY.md		SECURITY.md
_COPY_STATUS.json		_COPY_STATUS.json
_UPLOAD_FILE_MANIFEST.tsv		_UPLOAD_FILE_MANIFEST.tsv
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Eval Benchmark

Highlights

Official Release Entry

Formal Benchmark Protocol (Current)

Architecture

Quick Start

One-Click Eval

Local Comparison Mode

Repo Layout

Notes

RAG vs Direct (Comparison)

Comparison Output (Sample)

Category Breakdown

Multi-Model Repeated Sampling

Evaluation Output (Sample)

RAGAS Summary (Smoke Example, 24-row Checked-In Slice)

File Outputs

Roadmap

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Eval Benchmark

Highlights

Official Release Entry

Formal Benchmark Protocol (Current)

Architecture

Quick Start

One-Click Eval

Local Comparison Mode

Repo Layout

Notes

RAG vs Direct (Comparison)

Comparison Output (Sample)

Category Breakdown

Multi-Model Repeated Sampling

Evaluation Output (Sample)

RAGAS Summary (Smoke Example, 24-row Checked-In Slice)

File Outputs

Roadmap

About

Topics

Resources

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages