Skip to content

perfecxion-ai/temporal-vulnerabilities

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Beyond Pattern Matching: Evaluating LLM Security Review on Temporal, Side-Channel, and Compositional Vulnerabilities

LLM-based code review tools are deployed in production CI/CD pipelines, but the security literature has never measured their capability on vulnerability classes that require multi-step reasoning. This repository contains the benchmark, evaluation framework, and experimental results for evaluating 10 models across 57,600 evaluations on three reasoning-required vulnerability classes:

  • TOCTOU race conditions (CWE-367)
  • Timing side-channels (CWE-208)
  • Compositional authorization flaws (CWE-863)

Key Findings

  • Model-tier bifurcation: Commercial frontier models achieve 88-100% detection on reasoning-required categories; open-source models achieve 25-59% — a gap of up to 57 percentage points with no analog on pattern-matchable controls (3-5 pp).
  • Surface pattern activation: 81.3% of false positives on safe decoy code name the vulnerability class the decoy superficially mimics, indicating detection rates overstate reasoning competence.
  • Per-hop authorization defense (D1) eliminates all measured leakage across all corpora and attack variants.

Repository Structure

benchmark/          # 160-sample benchmark (130 vulnerable + 30 decoys)
experiments/        # Evaluation harness, scoring, and analysis
  tools/            # Model configuration, scoring, and evaluation tools
  results/          # Raw and analyzed experimental results
overleaf/           # Paper source (LaTeX)

Three-Tier Scoring Methodology

  1. Tier 1 (Detection): Binary — does the model flag the code as vulnerable?
  2. Tier 2 (Classification): Does the model identify the correct vulnerability class with pre-registered minimal evidence?
  3. Tier 3 (Reasoning Quality): Rubric-based causal reasoning quality scored by dual-judge LLM protocol (RQS, 0-1 scale).

Models Evaluated

Model Tier Provider
Claude Opus 4.6 Commercial Anthropic
GPT-5.2 Commercial OpenAI
Gemini 2.5 Pro Commercial Google
DeepSeek Chat Commercial DeepSeek
Sonar Pro Commercial Perplexity
o3-mini Reasoning OpenAI
DeepSeek-R1 Reasoning DeepSeek
Qwen 2.5 72B Open-source Local (Ollama)
Llama 3.3 70B Open-source Local (Ollama)
DeepSeek-Coder 16B Open-source Local (Ollama)

Reproducing Results

# Install dependencies
pip install -r requirements.txt

# Run evaluation
python experiments/tools/run_evaluation.py

# Analyze results
python experiments/tools/analyze_results.py

Citation

@techreport{thornton2026beyondpattern,
  author = {Thornton, Scott},
  title = {Beyond Pattern Matching: Evaluating LLM Security Review on Temporal, Side-Channel, and Compositional Vulnerabilities},
  institution = {perfecXion.ai},
  year = {2026}
}

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages