Beyond Pattern Matching: Evaluating LLM Security Review on Temporal, Side-Channel, and Compositional Vulnerabilities

LLM-based code review tools are deployed in production CI/CD pipelines, but the security literature has never measured their capability on vulnerability classes that require multi-step reasoning. This repository contains the benchmark, evaluation framework, and experimental results for evaluating 10 models across 57,600 evaluations on three reasoning-required vulnerability classes:

TOCTOU race conditions (CWE-367)
Timing side-channels (CWE-208)
Compositional authorization flaws (CWE-863)

Key Findings

Model-tier bifurcation: Commercial frontier models achieve 88-100% detection on reasoning-required categories; open-source models achieve 25-59% — a gap of up to 57 percentage points with no analog on pattern-matchable controls (3-5 pp).
Surface pattern activation: 81.3% of false positives on safe decoy code name the vulnerability class the decoy superficially mimics, indicating detection rates overstate reasoning competence.
Per-hop authorization defense (D1) eliminates all measured leakage across all corpora and attack variants.

Repository Structure

benchmark/          # 160-sample benchmark (130 vulnerable + 30 decoys)
experiments/        # Evaluation harness, scoring, and analysis
  tools/            # Model configuration, scoring, and evaluation tools
  results/          # Raw and analyzed experimental results
overleaf/           # Paper source (LaTeX)

Three-Tier Scoring Methodology

Tier 1 (Detection): Binary — does the model flag the code as vulnerable?
Tier 2 (Classification): Does the model identify the correct vulnerability class with pre-registered minimal evidence?
Tier 3 (Reasoning Quality): Rubric-based causal reasoning quality scored by dual-judge LLM protocol (RQS, 0-1 scale).

Models Evaluated

Model	Tier	Provider
Claude Opus 4.6	Commercial	Anthropic
GPT-5.2	Commercial	OpenAI
Gemini 2.5 Pro	Commercial	Google
DeepSeek Chat	Commercial	DeepSeek
Sonar Pro	Commercial	Perplexity
o3-mini	Reasoning	OpenAI
DeepSeek-R1	Reasoning	DeepSeek
Qwen 2.5 72B	Open-source	Local (Ollama)
Llama 3.3 70B	Open-source	Local (Ollama)
DeepSeek-Coder 16B	Open-source	Local (Ollama)

Reproducing Results

# Install dependencies
pip install -r requirements.txt

# Run evaluation
python experiments/tools/run_evaluation.py

# Analyze results
python experiments/tools/analyze_results.py

Citation

@techreport{thornton2026beyondpattern,
  author = {Thornton, Scott},
  title = {Beyond Pattern Matching: Evaluating LLM Security Review on Temporal, Side-Channel, and Compositional Vulnerabilities},
  institution = {perfecXion.ai},
  year = {2026}
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
experiments		experiments
toctou		toctou
.gitignore		.gitignore
HANDOFF.md		HANDOFF.md
PLAN.md		PLAN.md
README.md		README.md
RESEARCH-PLAN-FOR-REVIEW.md		RESEARCH-PLAN-FOR-REVIEW.md
TASK-LIST.md		TASK-LIST.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond Pattern Matching: Evaluating LLM Security Review on Temporal, Side-Channel, and Compositional Vulnerabilities

Key Findings

Repository Structure

Three-Tier Scoring Methodology

Models Evaluated

Reproducing Results

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Beyond Pattern Matching: Evaluating LLM Security Review on Temporal, Side-Channel, and Compositional Vulnerabilities

Key Findings

Repository Structure

Three-Tier Scoring Methodology

Models Evaluated

Reproducing Results

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages