A reproducible CLI-first benchmark for controlled Direct-versus-RAG comparisons, stability analysis, and auditable model evaluation.
Status: Fixed-set benchmark release | Public release: 2026-06-10
| Start here | Resource |
|---|---|
| Primary documentation | Release gate |
| Reproducibility / implementation | Protocol |
| Verified outcomes | Result summary |
- promptfoo for prompt/system evals
- ragas for RAG-specific metrics
- structured reports for comparisons
For external citation, portfolio use, and final benchmark conclusions, start here:
FINAL_RESULT_SUMMARY.mdFINAL_RELEASE_GATE.mdreports/README_RELEASE_STRUCTURE.mddocs/benchmark_showcase.htmlfor a lightweight read-only presentation layerreports/benchmark_final_summary_20260502/benchmark_results_final_summary_cn.mdreports/benchmark_final_summary_20260502/benchmark_results_final_summary.md
Important scoping:
reports/latest/*is a local smoke / reproducibility output surface, not the promoted benchmark release.data/fbtp_eval.jsonlis a runnable 24-row smoke slice for end-to-end checks.data/fbtp_eval_fixed_120.jsonlis the official fixed-set dataset for promoted benchmark conclusions.
The current promoted benchmark release uses a controlled reasoning protocol for the main result family:
- Main table:
8 models x 2 methods (Direct / RAG)on the fixed120-question set - Stability release: selected-RAG clean comparison for
DeepSeek-V3.2andMiniMax-M2.7,100rounds each,RAG-only - Category table: same
8controlled models, broken down byragppi / doc-design / schema_tables - Appendix table: same
8models under provider-native mode topKablation: same8models with candidate pool sweep32 / 64 / 128
Protocol note:
candidate_topk_256is deprecated and is not part of the formal benchmark protocol- formal reporting should only use
32 / 64 / 128
Controlled reasoning policy:
DeepSeek-V3.2: thinking disabledGLM-5: thinking disabledKimi-K2.5: thinking disabledDeepSeek-R1: keep native reasoning behaviorQwen3-235B-A22B-Instruct-2507: controlled non-thinking protocolMiniMax-M2: controlled non-thinking protocolMiniMax-M2.7: controlled non-thinking protocolERNIE-4.5-Turbo-128K: controlled non-thinking protocol
Supplementary appendix table:
- uses the same
8models in provider-native mode - Appendix runs are not mixed into the main or stability aggregates
Replacement note:
Baichuan-M3was removed from the formal 8-model protocol on2026-04-18and replaced byMiniMax-M2.7- this was due to provider-side
thinkingcontrol incompatibility (budget_tokenserrors), not a benchmark quality conclusion
Release-note clarification:
- Earlier larger stability plans and repair-heavy intermediate runs remain in the repo for auditability, but the promoted release surface is the clean summary package under
reports/benchmark_final_summary_20260502. - The dashboard / control-plane UI is an engineering support surface and is not the official release gate for benchmark conclusions.
See the Chinese benchmark protocol note:
docs/benchmark_protocol_cn.md
flowchart TD
D[Benchmark Dataset] --> R[RAG Pipeline]
D --> X[Direct Pipeline]
R --> C[Compare]
X --> C
C --> S[summary.md / summary.json]
C --> P[comparison.csv / comparison.md / comparison_summary.png]
C --> A[answer jsonl artifacts]
R --> G[RAGAS outputs when available]
# promptfoo (node-based)
# npx promptfoo eval -c configs/promptfoo.yaml
# ragas (python-based)
# pip install -e ../llm-rag-knowledge-base
# python -m pipelines.rag_pipeline --data-path data/fbtp_eval.jsonlpowershell -ExecutionPolicy Bypass -File scripts/run_eval.ps1When remote RAGAS evaluation is unavailable, python -m pipelines.compare --eval-mode auto falls back to a fully local metric set and still produces:
reports/latest/comparison.csvreports/latest/comparison.mdreports/latest/comparison_summary.pngreports/latest/summary.mdreports/latest/summary.jsonreports/latest/rag_answers.jsonlreports/latest/direct_answers.jsonl
If you want CI or automation to stop instead of silently downgrading, run:
python -m pipelines.compare --data-path data/fbtp_eval.jsonl --output-dir reports/latest --eval-mode auto --fail-on-fallbackThat raises a hard error whenever the command would otherwise fall back from remote evaluation to local scoring.
data/: benchmark datasets and small samplesprompts/: prompt templatespipelines/: RAG vs direct pipelinesmetrics/: evaluation metricsreports/: generated resultsfinetune/: optional LoRA/PEFT module
- Do not commit full datasets. Keep scripts and small samples only.
data/fbtp_eval.jsonlnow ships with a 24-row runnable smoke slice so the repo can run end-to-end.- Official benchmark conclusions should cite
data/fbtp_eval_fixed_120.jsonltogether with the promoted outputs underreports/benchmark_final_summary_20260502. - When RAGAS is available, additional outputs are generated:
reports/latest/ragas_scores.csvreports/latest/ragas_scores.mdreports/latest/ragas_summary.mdreports/latest/ragas_summary.jsonreports/latest/ragas_summary.png
After running python -m pipelines.compare, check:
reports/latest/comparison.csvreports/latest/comparison.mdreports/latest/comparison_summary.pngreports/latest/category_breakdown.csvreports/latest/category_breakdown.md
| method | faithfulness | answer_relevancy | context_precision | latency_ms | estimated_cost_usd |
|---|---|---|---|---|---|
| RAG | 0.99 | 0.46 | 0.93 | 40.92 | 0.0044 |
| Direct | 0.60 | 6355.18 | 0.0007 |
category_breakdown.csv now groups by method + category, so you can see whether a slice is strong because the method is good overall or because one specific category is easy.
pipelines.compare now supports a formal experiment matrix for repeated benchmark runs across multiple models.
Example:
python -m pipelines.compare \
--data-path data/fbtp_eval_fixed_120.jsonl \
--output-dir reports/formal_round1_4models \
--eval-mode auto \
--rounds 100 \
--sample-size 100 \
--with-replacement \
--seed 42 \
--benchmark-model DeepSeek-V3.2 \
--benchmark-model GLM-5 \
--benchmark-model Kimi-K2.5 \
--benchmark-model DeepSeek-R1You can also pass labeled model specs:
python -m pipelines.compare \
--benchmark-model baseline=DeepSeek-V3.2 \
--benchmark-model cn_structured=GLM-5 \
--benchmark-model long_context=Kimi-K2.5 \
--benchmark-model reasoner=DeepSeek-R1Formal multi-model outputs include:
experiment_config.jsonper_round_results.csvmodel_summary.csvleaderboard.csvmodels/<slug>/summary.json- legacy compatibility files such as
comparison.csv
See:
reports/benchmark_output_contract_2026-04-11.md
| metric | value |
|---|---|
| samples | 24 |
| eval_mode | ragas |
| RAG faithfulness | 0.9954 |
| RAG answer_relevancy | 0.4643 |
| RAG context_precision | 0.7083 |
| Direct answer_relevancy | 0.6213 |
This section is only a runnable smoke example. It is not the official promoted benchmark conclusion.
reports/latest/ragas_scores.csvreports/latest/ragas_scores.mdreports/latest/ragas_summary.mdreports/latest/ragas_summary.jsonreports/latest/ragas_summary.pngreports/latest/direct_ragas_scores.csvreports/latest/direct_ragas_scores.mdreports/latest/direct_ragas_summary.mdreports/latest/direct_ragas_summary.jsonreports/latest/direct_ragas_summary.pngreports/latest/comparison.csvreports/latest/comparison.mdreports/latest/comparison_summary.pngreports/latest/category_breakdown.csvreports/latest/category_breakdown.mdreports/latest/summary.mdreports/latest/summary.json
- Add category-balanced evaluation slices beyond protein interaction rows.
- Persist automatic cleanup for stale RAGAS artifacts after local fallback runs.
- Split category breakdown by
method + categoryinstead of category-only aggregation. - Add benchmark cards for structure-specific and methodology-specific domain questions.
