A Framework Comparison Study for AI Coding Assistants
This benchmark explores: "Do different AI coding frameworks detect specification misalignments differently when using the same underlying language model?"
Unlike existing benchmarks comparing AI models (GPT-4 vs Claude vs Gemini), we compare frameworks (Cursor vs Claude Code) while holding the model constant (Claude 4.5 Sonnet). This isolates framework-specific capabilities like search strategies, context management, and code understanding.
240 test runs completed (120 per framework, 5 runs per condition)
- Cursor Mean F1: 0.50
- Claude Code Mean F1: 0.48
- Statistical Difference: None (p = 0.73)
However, performance varies significantly by misalignment type:
| Type | Cursor F1 | Claude Code F1 | Pattern |
|---|---|---|---|
| Type 1 (Missing) | 0.39 | 0.36 | Cursor slightly better |
| Type 2 (Incorrect) | 0.42 | 0.58 | Claude Code stronger (p=0.03) |
| Type 3 (Extraneous) | 0.77 | 0.60 | Cursor stronger (p=0.04) |
Key Insight: While neither framework has an overall advantage, they show different strengths:
- Cursor excels at finding extraneous code (Type 3) and slightly better at detecting missing features (Type 1)
- Claude Code performs better at identifying incorrect implementations (Type 2)
Next Steps: These patterns emerged from a pilot study (n=5 per condition). We plan to replicate with larger sample sizes (n≥30) to confirm these type-specific differences and establish more robust statistical evidence.
- Type 1 Detection: Cursor 37%, Claude Code 45% (both struggle)
- Type 2 Detection: Cursor 71%, Claude Code 73% (moderate success)
- Type 3 Detection: Cursor 72%, Claude Code 71% (best performance)
- False Positive Rate: Cursor 42%, Claude Code 47% (concerning)
This study used a controlled experimental design with important characteristics to note:
- Sample size: 5 runs per condition (240 total runs)
- Control variables: Same model (Claude 4.5 Sonnet), same prompts, same codebase
- Test branches: 6 branches with varying misalignment distributions
- Ground truth: 38 carefully planted misalignments validated by specification
Sample Size: The current n=5 per condition provides initial insights but is smaller than typical for robust statistical conclusions. Planned replication with n≥30 will provide stronger evidence for the observed type-specific patterns.
Statistical Testing: The Type 2 and Type 3 differences (p=0.03, p=0.04) are at the threshold of significance. With multiple comparisons, these should be interpreted as promising patterns worthy of confirmation rather than definitive findings.
Scope: This benchmark focuses on a single todo application with medium complexity. Results may vary with:
- Enterprise-scale codebases
- Different application domains
- Other programming languages
- Different model versions
Generalizability: Framework performance likely depends on the specific use case, codebase characteristics, and type of misalignment most relevant to your work.
- Framework architecture influences detection capabilities even with identical models
- Type-specific performance patterns exist and are measurable
- A methodology for systematic framework comparison
- Baseline performance data for specification alignment tasks
| Type | Description | Example |
|---|---|---|
| Type 1 | Missing Implementation | Spec requires JWT auth, code has none |
| Type 2 | Incorrect Implementation | Spec: 15-min expiry, Code: 60-min |
| Type 3 | Extraneous Code | Code has admin panel, spec doesn't mention it |
A todo application with specification misalignments across 6 test branches:
| Branch | Type 1 | Type 2 | Type 3 | Total | Purpose |
|---|---|---|---|---|---|
control_perfect |
0 | 0 | 0 | 0 | False positive baseline |
baseline_balanced |
3 | 3 | 2 | 8 | Overall capability |
type1_heavy |
6 | 1 | 1 | 8 | Type 1 focus |
type2_heavy |
1 | 6 | 1 | 8 | Type 2 focus |
subtle_only |
2 | 2 | 2 | 6 | Complex issues |
distributed |
3 | 3 | 2 | 8 | Scattered across files |
Detection patterns across 240 test runs:
-
Type-specific specialization observed
- Cursor: Strong at Type 3 (extraneous code, F1=0.77), competitive at Type 1 (missing, F1=0.39)
- Claude Code: Strong at Type 2 (incorrect implementation, F1=0.58), competitive at other types
-
Type 3 is most detectable - Both frameworks achieved ~70% accuracy finding extraneous code not mentioned in specifications
-
Type 1 is most challenging - Missing implementations harder to detect (~40% F1 for both frameworks)
-
Complexity impacts performance - Subtle/distributed issues reduce detection rates significantly (0.26-0.34 F1 on subtle_only branch vs 0.60-0.70 on baseline)
-
False positives occur - Even on perfect code (control_perfect), frameworks flagged some non-issues (42-47% FP rate overall)
Choose based on your primary use case:
- Finding unauthorized features/code bloat: Cursor shows stronger performance (Type 3)
- Validating implementation correctness: Claude Code shows stronger performance (Type 2)
- Comprehensive auditing: Both frameworks show similar overall capability; type-specific strengths matter more than overall scores
Human oversight recommended: Both frameworks show notable false positive rates, making human review essential for production use.
spec-alignment-benchmark/
├── benchmark/ # Test definitions
│ ├── prompts/ # 4 test prompts
│ │ ├── type1-missing.md
│ │ ├── type2-incorrect.md
│ │ ├── type3-extraneous.md
│ │ └── combined-all-types.md
│ ├── branches/ # 6 branches with ground truth
│ │ └── [branch]/
│ │ ├── ground-truth-type1.json
│ │ ├── ground-truth-type2.json
│ │ ├── ground-truth-type3.json
│ │ └── ground-truth-combined.json
│ └── hypotheses.md
│
├── results/
│ ├── raw/ # All 240 test runs
│ │ ├── cursor/ # 120 Cursor runs
│ │ └── claude-code/ # 120 Claude Code runs
│ ├── comprehensive_analysis_report.pdf
│ └── simple_analysis_results.json
│
├── scripts/ # Analysis tools
│ ├── generate_analysis_pdf.py
│ ├── simple_analyze.py
│ └── test_runner.py
│
└── specs/ # Documentation
├── benchmark-specification.md
└── repository-specification.md
git clone https://github.com/spartypkp/spec-alignment-benchmark
cd spec-alignment-benchmark
pip install -r requirements.txt# Generate comprehensive PDF report
python scripts/generate_analysis_pdf.py
# Run simple analysis
python scripts/simple_analyze.py
# View raw data
ls results/raw/cursor/
ls results/raw/claude-code/-
Get the test repository:
-
Use the prompts:
cat benchmark/prompts/type1-missing.md cat benchmark/prompts/type2-incorrect.md cat benchmark/prompts/type3-extraneous.md cat benchmark/prompts/combined-all-types.md
-
Check against ground truth:
cat benchmark/branches/baseline_balanced/ground-truth-combined.json
Each test run in results/raw/ contains:
{
"type1_missing": [
{"section": "4.1", "reasoning": "Password validation missing"}
],
"type2_incorrect": [
{"section": "2.4", "reasoning": "Sort order wrong", "files": ["src/lib/db.ts"]}
],
"type3_extraneous": [
{"file": "src/app/admin/page.tsx", "reasoning": "Admin panel not in spec"}
]
}Expected results in benchmark/branches/*/ground-truth-*.json:
{
"test_branch": "baseline_balanced",
"test_type": "combined_all_types",
"ground_truth": {
"type1_missing": [...],
"type2_incorrect": [...],
"type3_extraneous": [...]
}
}To improve scientific rigor, future iterations should:
Statistical improvements:
- Increase sample size to n≥30 per condition
- Apply proper multiple comparison corrections
- Use non-parametric tests for non-normal data
- Validate ground truth with multiple independent raters
Methodological enhancements:
- Blind evaluation procedures
- Documented, standardized prompting protocols
- Multiple application domains
- Different model versions
- Additional frameworks (Windsurf, GitHub Copilot, Aider)
Scope expansion:
- Enterprise-scale codebases
- Multiple programming languages
- Different complexity levels
- Real-world specification documents
This benchmark is useful for:
- Understanding framework behavior patterns in specification alignment tasks
- Comparing type-specific detection capabilities
- Informing framework selection based on your primary use case
- Establishing baseline performance expectations
- Generating hypotheses for deeper investigation
Keep in mind:
- Results are from a pilot study; larger replication planned to confirm patterns
- Performance may vary with different codebases, domains, and model versions
- Framework-specific strengths matter more than overall scores
- Human review remains essential regardless of framework choice
We welcome contributions that address study limitations:
- Larger-scale replications
- Additional frameworks
- Improved ground truth validation
- Statistical methodology refinements
MIT License - See LICENSE file
This benchmark addresses a gap in AI tool evaluation by comparing frameworks rather than models, providing empirical data on type-specific performance patterns in specification alignment tasks. While conducted as a pilot study, it establishes methodology and reveals meaningful performance differences that warrant further investigation with larger sample sizes.
Citation:
Specification Alignment Benchmark v1.0.0 (2025)
https://github.com/spartypkp/spec-alignment-benchmark
Pilot study results (n=5 per condition); larger replication planned