Skip to content

Add Mate benchmark suite with Claude Code and Codex adapters#21

Open
wachterjohannes wants to merge 14 commits into
mainfrom
benchmark
Open

Add Mate benchmark suite with Claude Code and Codex adapters#21
wachterjohannes wants to merge 14 commits into
mainfrom
benchmark

Conversation

@wachterjohannes

Copy link
Copy Markdown
Member

Summary

  • Introduces a new benchmark/ suite for measuring assistant performance on Symfony bug-fix scenarios with and without Mate
  • Adds adapters for Claude Code and Codex built on the symfony/ai platform bridges, with a shared platform adapter, MCP provisioning, and JSON/Markdown reporting
  • Ships ten reproducible scenarios, deterministic evaluators (functional, root-cause, mate-tool-usage, minimality, verification, efficiency), weighted scoring, and a benchmark:compare command

Highlights

  • Scenario semantics: expected_tool_calls (all-required) and expected_tool_calls_any (any-of) so alternative tools satisfying the same intent score correctly
  • Token accuracy: TokenUsage::totalTokens() now includes cached tokens, exposing the full context processed under Anthropic prompt caching (e.g. 313k vs the previously reported ~12 input tokens)
  • Mate provisioning: dedicated MateProvisioner + interface for clean factory composition and testability
  • Reports: JSON/Markdown writers plus a CLI comparator surfacing score, token, duration, and mate-call deltas

Validation results

mate.custom-tool-required scenario, --repeat=1:

Adapter Mate Score Tokens Duration Mate calls
claude off 2.93 1,128,832 132,830 ms 0
claude on 4.48 313,728 32,898 ms 12
codex off 2.98 317,743 36,952 ms 0
codex on 4.48 194,692 41,275 ms 1

Both adapters land at the same 4.48 ceiling, blocked by verification (no command re-run signal in stdout) and efficiency (token threshold).

Test plan

  • composer test (173 tests, 585 assertions passing)
  • Run bin/console benchmark:run --scenario=mate.custom-tool-required --adapter=claude --mate=enabled end-to-end
  • Run bin/console benchmark:run --scenario=mate.custom-tool-required --adapter=codex --mate=enabled end-to-end
  • Inspect generated results.json and Markdown report for one scenario
  • bin/console benchmark:compare --latest against two adjacent runs

- Set up benchmark/ package with Symfony Console application
- Implement benchmark:run command with all required options
- Define scenario YAML format with JSON schema validation
- Add Scenario value object, loader, validator and repository
- Cover loader, validator, repository and command with PHPUnit tests
- Include first scenario (bug.autowiring.private-service) and PLAN/specs
- Introduce Workspace value object and WorkspaceFactory with per-attempt paths
- Add FixtureCopier that mirrors fixtures without mutating the source
- Add CommandExecutor capturing stdout, stderr, exit code, duration and timeouts
- Add GitDiffCollector with init / seal-after-setup / collect-against-baseline flow
- Pin Codex platform reference and surface milestone 03 status in the README
- Define AssistantAdapterInterface with AssistantRunInput/AssistantRunResult/TokenUsage/ToolCall
- Implement NullAdapter as a deterministic no-op assistant for runner exercise
- Add AdapterRegistry resolving --adapter values, with UnsupportedAdapterException
- Add ScenarioRunner orchestrating fixture copy, setup, seal, adapter run, diff and verify
- Wire BenchmarkRunCommand to actually execute scenarios end-to-end (with --list opt-out)
- Add MateConfiguration value object with enabled/disabled factories and per-run env hints
- Add MateConfigurationFactory writing .mate/config.json into the workspace before sealing the baseline
- Add MateMetricsCollector aggregating tool call count, names, first-call timestamp, errors and missing expected tools
- Replace AssistantRunInput::mateEnabled with the richer MateConfiguration field and surface metrics on RunOutcome
- Drive the --mate=enabled|disabled CLI toggle through ScenarioRunner and reflect it on the outcome line
- Define MetricsBag holding every required and optional metric with null defaults
- Add five collectors: duration, token usage, tool usage, diff and command results
- Add MetricsAggregator merging collector output into a single bag
- Surface MetricsBag on RunOutcome and wire the aggregator through ScenarioRunner and bin/console
- Define EvaluatorInterface with EvaluationInput and EvaluationResult (score, pass/fail, evidence)
- Add FunctionalEvaluator scoring re-runnable verification commands
- Add rule-based RootCauseEvaluator matching scenario keywords against assistant output and diff
- Add DiffMinimalityEvaluator and ForbiddenChangesEvaluator gating on file footprint
- Add VerificationEvaluator and MateToolUsageEvaluator using stdout and tool-call evidence
- Add EfficiencyEvaluator combining duration and token thresholds
- Add ScoreWeights with PLAN defaults, scenario-level overrides and percentage normalisation
- Add Score value object exposing final and raw values, per-category scores, missing evaluators and gate penalties
- Add ScoreCalculator applying default weights and a forbidden-changes gate penalty
- Add EvaluationPipeline running the seven default judges and converting evaluator errors into failing results
- Surface evaluations and Score on RunOutcome and render the final score on each command outcome line
- Introduce ProcessAdapter base spawning a CLI binary, piping the prompt via stdin and parsing structured output
- Add ClaudeCodeAdapter wired to "claude --print --output-format=stream-json --bare" with --mcp-config when Mate is enabled
- Add CodexAdapter wired to "codex exec --json --skip-git-repo-check --sandbox=workspace-write"
- Provide ClaudeStreamJsonParser and CodexJsonParser as best-effort JSONL parsers for token usage and tool calls
- Allow binary path and extra flags overrides via BENCHMARK_CLAUDE_BIN/ARGS and BENCHMARK_CODEX_BIN/ARGS
- Register both adapters in bin/console alongside NullAdapter
- Cover parsers and adapters with offline PHP fakes so no real model calls are made in tests
- Add ReportContext bundling run metadata and outcomes for the writers
- Add ArtifactsWriter persisting diffs, command logs and raw assistant stdout/stderr per attempt
- Add JsonReportWriter producing a deterministic results.json with summary, scenarios, evaluations, metrics and Mate data
- Add MarkdownReportWriter rendering every spec section: summary, adapter, Mate, scenarios, tool/token usage, slowest, failed, most-changed files
- Add ReportPipeline composing the three writers and wire it into BenchmarkRunCommand to emit reports under reports/<run-id>/
- Add three code-generation scenarios: console-command, controller-route-test, service-with-di
- Add four bug-finding scenarios: autowiring, failing-phpunit, invalid-env-config, security-access-control
- Add two runtime-debugging scenarios: twig-variable-missing, monolog-exception
- Add one Mate-specific scenario requiring symfony_logs to identify the missing service
- Each fixture is pure PHP with a single deterministic verification command
- Replace the original placeholder bug.autowiring.private-service stub
- Drop the fixtures gitignore now that real fixture content lives in-tree
- Add BenchmarkCompareCommand diffing two results.json files side-by-side: score, tokens, duration and Mate calls per scenario plus a run-level summary
- Treat --suite=all as an explicit alias for "no suite filter"
- Surface every documented invocation pattern in the README under a new CLI examples section
- Register benchmark:compare in bin/console alongside benchmark:run
- Add DEFINITION-OF-DONE.md mapping every checklist item to the relevant code paths
- Reproduce the offline acceptance test (--suite=all --adapter=null with mate enabled and disabled, then benchmark:compare) end-to-end
- Mark the benchmark scaffold feature-complete against the original spec in the README
…dges

- Add symfony/ai-platform, symfony/ai-claude-code-platform and symfony/ai-codex-platform dependencies
- Introduce Adapter/Platform/PlatformAdapter base that delegates to PlatformInterface::invoke and converts the bridge's result back to AssistantRunResult and TokenUsage
- Rewrite ClaudeCodeAdapter and CodexAdapter as thin wrappers around the ClaudeCode and Codex Factory::createPlatform helpers
- Forward workspace cwd and Mate mcp_config through invoke options; force Codex sandbox=workspace-write so patches can land in the workspace
- Drop the bespoke subprocess base, JSONL parsers and PHP fakes now that the bridges own that work
- Stub PlatformInterface in tests so no real model calls happen
- Document that tool-call counts are not surfaced via the platform's non-streaming path; token usage and final text remain intact
- Include cached tokens in TokenUsage totals so reported counts reflect actual context processed
- Add expected_tool_calls_any scenario field for any-of tool matching when alternative tools satisfy the same intent
- Extract MateProvisioner into a dedicated class with interface for cleaner factory composition
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant