Beyond Pattern Matching: Evaluating LLM Security Review on Temporal, Side-Channel, and Compositional Vulnerabilities
LLM-based code review tools are deployed in production CI/CD pipelines, but the security literature has never measured their capability on vulnerability classes that require multi-step reasoning. This repository contains the benchmark, evaluation framework, and experimental results for evaluating 10 models across 57,600 evaluations on three reasoning-required vulnerability classes:
- TOCTOU race conditions (CWE-367)
- Timing side-channels (CWE-208)
- Compositional authorization flaws (CWE-863)
- Model-tier bifurcation: Commercial frontier models achieve 88-100% detection on reasoning-required categories; open-source models achieve 25-59% — a gap of up to 57 percentage points with no analog on pattern-matchable controls (3-5 pp).
- Surface pattern activation: 81.3% of false positives on safe decoy code name the vulnerability class the decoy superficially mimics, indicating detection rates overstate reasoning competence.
- Per-hop authorization defense (D1) eliminates all measured leakage across all corpora and attack variants.
benchmark/ # 160-sample benchmark (130 vulnerable + 30 decoys)
experiments/ # Evaluation harness, scoring, and analysis
tools/ # Model configuration, scoring, and evaluation tools
results/ # Raw and analyzed experimental results
overleaf/ # Paper source (LaTeX)
- Tier 1 (Detection): Binary — does the model flag the code as vulnerable?
- Tier 2 (Classification): Does the model identify the correct vulnerability class with pre-registered minimal evidence?
- Tier 3 (Reasoning Quality): Rubric-based causal reasoning quality scored by dual-judge LLM protocol (RQS, 0-1 scale).
| Model | Tier | Provider |
|---|---|---|
| Claude Opus 4.6 | Commercial | Anthropic |
| GPT-5.2 | Commercial | OpenAI |
| Gemini 2.5 Pro | Commercial | |
| DeepSeek Chat | Commercial | DeepSeek |
| Sonar Pro | Commercial | Perplexity |
| o3-mini | Reasoning | OpenAI |
| DeepSeek-R1 | Reasoning | DeepSeek |
| Qwen 2.5 72B | Open-source | Local (Ollama) |
| Llama 3.3 70B | Open-source | Local (Ollama) |
| DeepSeek-Coder 16B | Open-source | Local (Ollama) |
# Install dependencies
pip install -r requirements.txt
# Run evaluation
python experiments/tools/run_evaluation.py
# Analyze results
python experiments/tools/analyze_results.py@techreport{thornton2026beyondpattern,
author = {Thornton, Scott},
title = {Beyond Pattern Matching: Evaluating LLM Security Review on Temporal, Side-Channel, and Compositional Vulnerabilities},
institution = {perfecXion.ai},
year = {2026}
}
MIT