Skip to content

data(estimate): add deepseek-v4-flash benchmark results#1294

Merged
k08200 merged 2 commits into
mainfrom
benchmark/deepseek-v4-flash
Apr 27, 2026
Merged

data(estimate): add deepseek-v4-flash benchmark results#1294
k08200 merged 2 commits into
mainfrom
benchmark/deepseek-v4-flash

Conversation

@k08200

@k08200 k08200 commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds deepseek-v4-flash evaluation reports for the four standard benchmark projects (todo, reddit, shopping, erp) and updates batch-summary.json so the dev surface (autobe.dev) picks up the new model.

Results

project grade score gate doc req test logic api
todo B 87 PASS 91 100 79 100 100
reddit C 78 PASS 87 100 74 90 100
shopping C 73 PASS 84 97 74 92 100
erp B 80 PASS 91 100 85 92 100

Average: 79.5 (B-tier, comparable to claude-sonnet-4.6 / qwen3.5-397b-a17b at 80.0).

Changes

  • packages/estimate/reports/benchmark/deepseek-v4-flash/{erp,reddit,shopping,todo}/estimate-report.json — new reports
  • packages/estimate/reports/benchmark/batch-summary.json — 4 new entries inserted after deepseek-v3.2

k08200 added 2 commits April 27, 2026 12:51
Adds deepseek-v4-flash benchmark reports for the four standard projects
and updates batch-summary.json so dev picks up the new model.

| project  | grade | score |
|----------|-------|-------|
| todo     | B     | 87    |
| reddit   | C     | 78    |
| shopping | C     | 73    |
| erp      | B     | 80    |
…v4-flash

- Add deepseek-v4-flash to MODEL_TO_RESULTS_PATH so the aggregator
  resolves its autobe-examples directory and enriches pipeline detail.
- Regenerate apps/dashboard-ui/public/benchmark-summary.json and
  website/public/benchmark/benchmark-summary.json so localhost:3000/benchmark
  and the deployed dashboard surface the new model.
@k08200 k08200 merged commit 20ad514 into main Apr 27, 2026
2 checks passed
@k08200 k08200 deleted the benchmark/deepseek-v4-flash branch April 27, 2026 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant