Add Mate benchmark suite with Claude Code and Codex adapters#21
Open
wachterjohannes wants to merge 14 commits into
Open
Add Mate benchmark suite with Claude Code and Codex adapters#21wachterjohannes wants to merge 14 commits into
wachterjohannes wants to merge 14 commits into
Conversation
- Set up benchmark/ package with Symfony Console application - Implement benchmark:run command with all required options - Define scenario YAML format with JSON schema validation - Add Scenario value object, loader, validator and repository - Cover loader, validator, repository and command with PHPUnit tests - Include first scenario (bug.autowiring.private-service) and PLAN/specs
- Introduce Workspace value object and WorkspaceFactory with per-attempt paths - Add FixtureCopier that mirrors fixtures without mutating the source - Add CommandExecutor capturing stdout, stderr, exit code, duration and timeouts - Add GitDiffCollector with init / seal-after-setup / collect-against-baseline flow - Pin Codex platform reference and surface milestone 03 status in the README
- Define AssistantAdapterInterface with AssistantRunInput/AssistantRunResult/TokenUsage/ToolCall - Implement NullAdapter as a deterministic no-op assistant for runner exercise - Add AdapterRegistry resolving --adapter values, with UnsupportedAdapterException - Add ScenarioRunner orchestrating fixture copy, setup, seal, adapter run, diff and verify - Wire BenchmarkRunCommand to actually execute scenarios end-to-end (with --list opt-out)
- Add MateConfiguration value object with enabled/disabled factories and per-run env hints - Add MateConfigurationFactory writing .mate/config.json into the workspace before sealing the baseline - Add MateMetricsCollector aggregating tool call count, names, first-call timestamp, errors and missing expected tools - Replace AssistantRunInput::mateEnabled with the richer MateConfiguration field and surface metrics on RunOutcome - Drive the --mate=enabled|disabled CLI toggle through ScenarioRunner and reflect it on the outcome line
- Define MetricsBag holding every required and optional metric with null defaults - Add five collectors: duration, token usage, tool usage, diff and command results - Add MetricsAggregator merging collector output into a single bag - Surface MetricsBag on RunOutcome and wire the aggregator through ScenarioRunner and bin/console
- Define EvaluatorInterface with EvaluationInput and EvaluationResult (score, pass/fail, evidence) - Add FunctionalEvaluator scoring re-runnable verification commands - Add rule-based RootCauseEvaluator matching scenario keywords against assistant output and diff - Add DiffMinimalityEvaluator and ForbiddenChangesEvaluator gating on file footprint - Add VerificationEvaluator and MateToolUsageEvaluator using stdout and tool-call evidence - Add EfficiencyEvaluator combining duration and token thresholds
- Add ScoreWeights with PLAN defaults, scenario-level overrides and percentage normalisation - Add Score value object exposing final and raw values, per-category scores, missing evaluators and gate penalties - Add ScoreCalculator applying default weights and a forbidden-changes gate penalty - Add EvaluationPipeline running the seven default judges and converting evaluator errors into failing results - Surface evaluations and Score on RunOutcome and render the final score on each command outcome line
- Introduce ProcessAdapter base spawning a CLI binary, piping the prompt via stdin and parsing structured output - Add ClaudeCodeAdapter wired to "claude --print --output-format=stream-json --bare" with --mcp-config when Mate is enabled - Add CodexAdapter wired to "codex exec --json --skip-git-repo-check --sandbox=workspace-write" - Provide ClaudeStreamJsonParser and CodexJsonParser as best-effort JSONL parsers for token usage and tool calls - Allow binary path and extra flags overrides via BENCHMARK_CLAUDE_BIN/ARGS and BENCHMARK_CODEX_BIN/ARGS - Register both adapters in bin/console alongside NullAdapter - Cover parsers and adapters with offline PHP fakes so no real model calls are made in tests
- Add ReportContext bundling run metadata and outcomes for the writers - Add ArtifactsWriter persisting diffs, command logs and raw assistant stdout/stderr per attempt - Add JsonReportWriter producing a deterministic results.json with summary, scenarios, evaluations, metrics and Mate data - Add MarkdownReportWriter rendering every spec section: summary, adapter, Mate, scenarios, tool/token usage, slowest, failed, most-changed files - Add ReportPipeline composing the three writers and wire it into BenchmarkRunCommand to emit reports under reports/<run-id>/
- Add three code-generation scenarios: console-command, controller-route-test, service-with-di - Add four bug-finding scenarios: autowiring, failing-phpunit, invalid-env-config, security-access-control - Add two runtime-debugging scenarios: twig-variable-missing, monolog-exception - Add one Mate-specific scenario requiring symfony_logs to identify the missing service - Each fixture is pure PHP with a single deterministic verification command - Replace the original placeholder bug.autowiring.private-service stub - Drop the fixtures gitignore now that real fixture content lives in-tree
- Add BenchmarkCompareCommand diffing two results.json files side-by-side: score, tokens, duration and Mate calls per scenario plus a run-level summary - Treat --suite=all as an explicit alias for "no suite filter" - Surface every documented invocation pattern in the README under a new CLI examples section - Register benchmark:compare in bin/console alongside benchmark:run
- Add DEFINITION-OF-DONE.md mapping every checklist item to the relevant code paths - Reproduce the offline acceptance test (--suite=all --adapter=null with mate enabled and disabled, then benchmark:compare) end-to-end - Mark the benchmark scaffold feature-complete against the original spec in the README
…dges - Add symfony/ai-platform, symfony/ai-claude-code-platform and symfony/ai-codex-platform dependencies - Introduce Adapter/Platform/PlatformAdapter base that delegates to PlatformInterface::invoke and converts the bridge's result back to AssistantRunResult and TokenUsage - Rewrite ClaudeCodeAdapter and CodexAdapter as thin wrappers around the ClaudeCode and Codex Factory::createPlatform helpers - Forward workspace cwd and Mate mcp_config through invoke options; force Codex sandbox=workspace-write so patches can land in the workspace - Drop the bespoke subprocess base, JSONL parsers and PHP fakes now that the bridges own that work - Stub PlatformInterface in tests so no real model calls happen - Document that tool-call counts are not surfaced via the platform's non-streaming path; token usage and final text remain intact
- Include cached tokens in TokenUsage totals so reported counts reflect actual context processed - Add expected_tool_calls_any scenario field for any-of tool matching when alternative tools satisfy the same intent - Extract MateProvisioner into a dedicated class with interface for cleaner factory composition
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
benchmark/suite for measuring assistant performance on Symfony bug-fix scenarios with and without Matesymfony/aiplatform bridges, with a shared platform adapter, MCP provisioning, and JSON/Markdown reportingbenchmark:comparecommandHighlights
expected_tool_calls(all-required) andexpected_tool_calls_any(any-of) so alternative tools satisfying the same intent score correctlyTokenUsage::totalTokens()now includes cached tokens, exposing the full context processed under Anthropic prompt caching (e.g. 313k vs the previously reported ~12 input tokens)MateProvisioner+ interface for clean factory composition and testabilityValidation results
mate.custom-tool-requiredscenario,--repeat=1:Both adapters land at the same 4.48 ceiling, blocked by
verification(no command re-run signal in stdout) andefficiency(token threshold).Test plan
composer test(173 tests, 585 assertions passing)bin/console benchmark:run --scenario=mate.custom-tool-required --adapter=claude --mate=enabledend-to-endbin/console benchmark:run --scenario=mate.custom-tool-required --adapter=codex --mate=enabledend-to-endresults.jsonand Markdown report for one scenariobin/console benchmark:compare --latestagainst two adjacent runs