Add Mate benchmark suite with Claude Code and Codex adapters by wachterjohannes · Pull Request #21 · MatesOfMate/.github

wachterjohannes · 2026-05-01T20:04:54Z

Summary

Introduces a new benchmark/ suite for measuring assistant performance on Symfony bug-fix scenarios with and without Mate
Adds adapters for Claude Code and Codex built on the symfony/ai platform bridges, with a shared platform adapter, MCP provisioning, and JSON/Markdown reporting
Ships ten reproducible scenarios, deterministic evaluators (functional, root-cause, mate-tool-usage, minimality, verification, efficiency), weighted scoring, and a benchmark:compare command

Highlights

Scenario semantics: expected_tool_calls (all-required) and expected_tool_calls_any (any-of) so alternative tools satisfying the same intent score correctly
Token accuracy: TokenUsage::totalTokens() now includes cached tokens, exposing the full context processed under Anthropic prompt caching (e.g. 313k vs the previously reported ~12 input tokens)
Mate provisioning: dedicated MateProvisioner + interface for clean factory composition and testability
Reports: JSON/Markdown writers plus a CLI comparator surfacing score, token, duration, and mate-call deltas

Validation results

mate.custom-tool-required scenario, --repeat=1:

Adapter	Mate	Score	Tokens	Duration	Mate calls
claude	off	2.93	1,128,832	132,830 ms	0
claude	on	4.48	313,728	32,898 ms	12
codex	off	2.98	317,743	36,952 ms	0
codex	on	4.48	194,692	41,275 ms	1

Both adapters land at the same 4.48 ceiling, blocked by verification (no command re-run signal in stdout) and efficiency (token threshold).

Test plan

composer test (173 tests, 585 assertions passing)
Run bin/console benchmark:run --scenario=mate.custom-tool-required --adapter=claude --mate=enabled end-to-end
Run bin/console benchmark:run --scenario=mate.custom-tool-required --adapter=codex --mate=enabled end-to-end
Inspect generated results.json and Markdown report for one scenario
bin/console benchmark:compare --latest against two adjacent runs

- Set up benchmark/ package with Symfony Console application - Implement benchmark:run command with all required options - Define scenario YAML format with JSON schema validation - Add Scenario value object, loader, validator and repository - Cover loader, validator, repository and command with PHPUnit tests - Include first scenario (bug.autowiring.private-service) and PLAN/specs

- Introduce Workspace value object and WorkspaceFactory with per-attempt paths - Add FixtureCopier that mirrors fixtures without mutating the source - Add CommandExecutor capturing stdout, stderr, exit code, duration and timeouts - Add GitDiffCollector with init / seal-after-setup / collect-against-baseline flow - Pin Codex platform reference and surface milestone 03 status in the README

- Define AssistantAdapterInterface with AssistantRunInput/AssistantRunResult/TokenUsage/ToolCall - Implement NullAdapter as a deterministic no-op assistant for runner exercise - Add AdapterRegistry resolving --adapter values, with UnsupportedAdapterException - Add ScenarioRunner orchestrating fixture copy, setup, seal, adapter run, diff and verify - Wire BenchmarkRunCommand to actually execute scenarios end-to-end (with --list opt-out)

- Add MateConfiguration value object with enabled/disabled factories and per-run env hints - Add MateConfigurationFactory writing .mate/config.json into the workspace before sealing the baseline - Add MateMetricsCollector aggregating tool call count, names, first-call timestamp, errors and missing expected tools - Replace AssistantRunInput::mateEnabled with the richer MateConfiguration field and surface metrics on RunOutcome - Drive the --mate=enabled|disabled CLI toggle through ScenarioRunner and reflect it on the outcome line

- Define MetricsBag holding every required and optional metric with null defaults - Add five collectors: duration, token usage, tool usage, diff and command results - Add MetricsAggregator merging collector output into a single bag - Surface MetricsBag on RunOutcome and wire the aggregator through ScenarioRunner and bin/console

- Define EvaluatorInterface with EvaluationInput and EvaluationResult (score, pass/fail, evidence) - Add FunctionalEvaluator scoring re-runnable verification commands - Add rule-based RootCauseEvaluator matching scenario keywords against assistant output and diff - Add DiffMinimalityEvaluator and ForbiddenChangesEvaluator gating on file footprint - Add VerificationEvaluator and MateToolUsageEvaluator using stdout and tool-call evidence - Add EfficiencyEvaluator combining duration and token thresholds

- Add ScoreWeights with PLAN defaults, scenario-level overrides and percentage normalisation - Add Score value object exposing final and raw values, per-category scores, missing evaluators and gate penalties - Add ScoreCalculator applying default weights and a forbidden-changes gate penalty - Add EvaluationPipeline running the seven default judges and converting evaluator errors into failing results - Surface evaluations and Score on RunOutcome and render the final score on each command outcome line

- Introduce ProcessAdapter base spawning a CLI binary, piping the prompt via stdin and parsing structured output - Add ClaudeCodeAdapter wired to "claude --print --output-format=stream-json --bare" with --mcp-config when Mate is enabled - Add CodexAdapter wired to "codex exec --json --skip-git-repo-check --sandbox=workspace-write" - Provide ClaudeStreamJsonParser and CodexJsonParser as best-effort JSONL parsers for token usage and tool calls - Allow binary path and extra flags overrides via BENCHMARK_CLAUDE_BIN/ARGS and BENCHMARK_CODEX_BIN/ARGS - Register both adapters in bin/console alongside NullAdapter - Cover parsers and adapters with offline PHP fakes so no real model calls are made in tests

- Add ReportContext bundling run metadata and outcomes for the writers - Add ArtifactsWriter persisting diffs, command logs and raw assistant stdout/stderr per attempt - Add JsonReportWriter producing a deterministic results.json with summary, scenarios, evaluations, metrics and Mate data - Add MarkdownReportWriter rendering every spec section: summary, adapter, Mate, scenarios, tool/token usage, slowest, failed, most-changed files - Add ReportPipeline composing the three writers and wire it into BenchmarkRunCommand to emit reports under reports/<run-id>/

- Add three code-generation scenarios: console-command, controller-route-test, service-with-di - Add four bug-finding scenarios: autowiring, failing-phpunit, invalid-env-config, security-access-control - Add two runtime-debugging scenarios: twig-variable-missing, monolog-exception - Add one Mate-specific scenario requiring symfony_logs to identify the missing service - Each fixture is pure PHP with a single deterministic verification command - Replace the original placeholder bug.autowiring.private-service stub - Drop the fixtures gitignore now that real fixture content lives in-tree

- Add BenchmarkCompareCommand diffing two results.json files side-by-side: score, tokens, duration and Mate calls per scenario plus a run-level summary - Treat --suite=all as an explicit alias for "no suite filter" - Surface every documented invocation pattern in the README under a new CLI examples section - Register benchmark:compare in bin/console alongside benchmark:run

- Add DEFINITION-OF-DONE.md mapping every checklist item to the relevant code paths - Reproduce the offline acceptance test (--suite=all --adapter=null with mate enabled and disabled, then benchmark:compare) end-to-end - Mark the benchmark scaffold feature-complete against the original spec in the README

…dges - Add symfony/ai-platform, symfony/ai-claude-code-platform and symfony/ai-codex-platform dependencies - Introduce Adapter/Platform/PlatformAdapter base that delegates to PlatformInterface::invoke and converts the bridge's result back to AssistantRunResult and TokenUsage - Rewrite ClaudeCodeAdapter and CodexAdapter as thin wrappers around the ClaudeCode and Codex Factory::createPlatform helpers - Forward workspace cwd and Mate mcp_config through invoke options; force Codex sandbox=workspace-write so patches can land in the workspace - Drop the bespoke subprocess base, JSONL parsers and PHP fakes now that the bridges own that work - Stub PlatformInterface in tests so no real model calls happen - Document that tool-call counts are not surfaced via the platform's non-streaming path; token usage and final text remain intact

- Include cached tokens in TokenUsage totals so reported counts reflect actual context processed - Add expected_tool_calls_any scenario field for any-of tool matching when alternative tools satisfy the same intent - Extract MateProvisioner into a dedicated class with interface for cleaner factory composition

wachterjohannes added 14 commits April 25, 2026 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Mate benchmark suite with Claude Code and Codex adapters#21

Add Mate benchmark suite with Claude Code and Codex adapters#21
wachterjohannes wants to merge 14 commits into
mainfrom
benchmark

wachterjohannes commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

wachterjohannes commented May 1, 2026

Summary

Highlights

Validation results

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant