A curated, categorized list of AI security and safety evaluation benchmarks well-regarded by Frontier AI labs (Anthropic, OpenAI, Google DeepMind, Meta) and AI Safety Institutes (US AISI, UK AISI).
Maintained by Anshu Gupta, Founder & CISO, Fixin Security. Founder, Tejas Cyber Network
- Total benchmarks: 61
- Categories: 11
| Category | Count |
|---|---|
| Cyber Offense and CTF | 9 |
| Cyber Defense and Threat Intel | 8 |
| Software Security and Code | 2 |
| Agent Security and Prompt Injection | 8 |
| Jailbreak and Refusal | 6 |
| CBRN Knowledge and Bio Uplift | 9 |
| Alignment, Honesty, Scheming | 4 |
| Autonomy and AI R&D | 4 |
| Comprehensive Safety and Trust | 6 |
| Bias and Fairness | 1 |
| Tooling Frameworks | 4 |
- Cyber Offense and CTF
- Cyber Defense and Threat Intel
- Software Security and Code
- Agent Security and Prompt Injection
- Jailbreak and Refusal
- CBRN Knowledge and Bio Uplift
- Alignment, Honesty, Scheming
- Autonomy and AI R&D
- Comprehensive Safety and Trust
- Bias and Fairness
- Tooling Frameworks
Public cyber capabilities benchmark of 40 CTF challenges from four CTF competitions
- Organization: Stanford / academic consortium
- Paper: https://arxiv.org/abs/2408.08926
- Resource: https://cybench.github.io/
- Cited by: Anthropic, OpenAI, Google DeepMind, UK AISI
Targeted vulnerability reproduction in real open-source projects from high-level descriptions
- Organization: UC Berkeley
- Paper: https://arxiv.org/abs/2506.02548
- Cited by: Anthropic (Opus 4.7 system card), OpenAI (GPT-5.5 system card)
Identify and exploit vulnerabilities in free and open-source web applications
- Organization: Academic
- Paper: https://arxiv.org/abs/2503.17332
- Cited by: Used in academic and industry evaluations
15 cyber offense challenges aligned to MITRE ATT&CK, with 80 elicitation configurations to find best-performing setup
- Organization: Apollo Research / Apart Research
- Paper: https://arxiv.org/abs/2410.09114
- Resource: https://cybercapabilities.org
- Cited by: Apollo Research, UK AISI, academic literature
Difficulty scoring system for vulnerability and exploit benchmarks
- Organization: Irregular (Pattern Labs)
- Paper: https://www.irregular.com/publications/introducing-solve
- Cited by: Used by Irregular in frontier lab assessments
Scenario-based benchmarking for LLM cyber capabilities
- Organization: Irregular (Pattern Labs)
- Paper: https://www.irregular.com/publications/cyscenariobench
- Cited by: Used by Irregular in frontier lab assessments
200+ CTF challenges from NYU CSAW competitions; complements Cybench
- Organization: NYU
- Paper: https://arxiv.org/abs/2406.05590
- Resource: https://github.com/NYU-LLM-CTF/NYU_CTF_Bench
- Cited by: Academic, frontier lab cyber suites
Automated vulnerability detection on Nginx and DARPA AIxCC framework
- Organization: Alan Turing Institute / DARPA
- Paper: https://arxiv.org/abs/2410.21939
- Cited by: Alan Turing Institute, DARPA
Open-source applications frozen at vulnerable versions, measuring miss rate on known CVEs
- Organization: XBOW
- Paper: https://xbow.com/blog/mythos-like-hacking-open-to-all
- Cited by: XBOW, cited in OpenAI Daybreak / GPT-5.5 system card discussions
End-to-end detection rule generation with AI agents
- Organization: Microsoft
- Paper: https://arxiv.org/html/2603.13517v1
- Resource: https://www.microsoft.com/en-us/security/blog/2026/03/20/cti-realm-a-new-benchmark-for-end-to-end-detection-rule-generation-with-ai-agents/
- Cited by: Microsoft Security Research
Evaluating LLM agents on cyber threat investigation
- Organization: Microsoft
- Paper: https://arxiv.org/abs/2507.14201
- Resource: https://github.com/microsoft/SecRL
- Cited by: Microsoft Security Research
Malware analysis and threat intelligence reasoning; defensive capabilities benchmark
- Organization: Meta (with CrowdStrike)
- Paper: https://ai.meta.com/research/publications/cybersoceval-benchmarking-llms-capabilities-for-malware-analysis-and-threat-intelligence-reasoning/
- Resource: https://github.com/meta-llama/PurpleLlama
- Cited by: Meta, CrowdStrike
MCQA, RCM, VSP, ATE tasks for cyber threat intelligence (knowledge, attribution, severity)
- Organization: Academic (Alam et al.)
- Paper: https://arxiv.org/abs/2406.07599
- Cited by: Cisco Foundation-Sec, academic security LLM evals
RAG-based benchmark for cybersecurity knowledge (cryptography, reverse engineering, risk)
- Organization: Technology Innovation Institute (TII) / Khalifa University
- Paper: https://arxiv.org/abs/2402.07688
- Cited by: Foundation-Sec models, academic security LLM evaluations
Multi-dimensional cybersecurity benchmark: 44,823 MCQs and 3,087 SAQs across sub-domains
- Organization: Tencent / HK PolyU
- Paper: https://arxiv.org/abs/2412.20787
- Cited by: Tencent, academic security LLM evals
MCQs across software, network, and web security topics
- Organization: Academic (Li et al.)
- Paper: https://arxiv.org/abs/2311.11680
- Cited by: Cisco Foundation-Sec, academic security LLM evals
Foundational cybersecurity concept questions
- Organization: Academic
- Paper: https://arxiv.org/abs/2312.15838
- Cited by: Academic security LLM evals
Security-oriented software engineering benchmark
- Organization: Academic
- Paper: https://arxiv.org/html/2512.03262v1
- Cited by: Academic / under publication
Umbrella suite: insecure coding (CWE), MITRE ATT&CK helpfulness, prompt injection (textual and visual), code interpreter abuse, and CyberSOCEval
- Organization: Meta
- Paper: https://arxiv.org/abs/2404.13161
- Resource: https://github.com/meta-llama/PurpleLlama
- Cited by: Meta (Llama 4 system card), applied to OpenAI, Google, Anthropic models
Curated set of high-impact attacks from large-scale public competition
- Organization: Gray Swan AI
- Paper: https://arxiv.org/pdf/2507.20526
- Cited by: Gray Swan, frontier lab agentic evals
Evaluating sabotage and monitoring in LLM agents (29 complex environments)
- Organization: Anthropic
- Paper: https://arxiv.org/abs/2506.15740
- Resource: https://www.anthropic.com/research/shade-arena-sabotage-monitoring
- Cited by: Anthropic
Dynamic framework jointly evaluating utility and prompt injection resilience for tool-integrated agents
- Organization: ETH Zurich / Invariant Labs
- Paper: https://arxiv.org/abs/2406.13352
- Resource: https://agentdojo.spylab.ai/
- Cited by: US AISI, UK AISI, NeurIPS 2024 SafeBench prize winner
Indirect prompt injection: 1,054 test cases, 17 user tools, 62 attacker tools
- Organization: UIUC (Kang Lab)
- Paper: https://arxiv.org/abs/2403.02691
- Resource: https://github.com/uiuc-kang-lab/InjecAgent
- Cited by: Widely cited, used in frontier agent security research
Benchmark for measuring harmfulness of LLM agents when user is malicious (ICLR 2025)
- Organization: Gray Swan AI / UK AISI
- Paper: https://arxiv.org/abs/2410.09024
- Cited by: Gray Swan, UK AISI, ICLR 2025
Benchmark for Indirect Prompt Injection Attacks
- Organization: Microsoft
- Paper: https://arxiv.org/abs/2312.14197
- Resource: https://github.com/microsoft/BIPIA
- Cited by: Microsoft Security Research
Prompt extraction and hijacking benchmark grown from a public game
- Organization: UC Berkeley
- Paper: https://arxiv.org/abs/2311.01011
- Resource: https://tensortrust.ai/
- Cited by: Referenced in AgentDojo, OpenAI prompt injection work
Browser agent red teaming benchmark
- Organization: Gray Swan AI
- Paper: https://arxiv.org/abs/2410.13886
- Cited by: Gray Swan, frontier browser agent evals
State-of-the-art LLM jailbreak evaluation benchmark with quality-aware scoring
- Organization: UC Berkeley
- Paper: https://arxiv.org/abs/2402.10260
- Resource: https://strong-reject.readthedocs.io/en/latest/
- Cited by: OpenAI, Anthropic system cards
Standardized red-teaming evaluation framework with classifier-based harm grading
- Organization: Center for AI Safety (Mazeika et al.)
- Paper: https://arxiv.org/abs/2402.04249
- Resource: https://www.harmbench.org/
- Cited by: Anthropic, OpenAI, Google DeepMind, Meta system cards
Open robustness benchmark for jailbreaking LLMs (NeurIPS 2024)
- Organization: Academic (Chao, Debenedetti, Robey, et al.)
- Paper: https://arxiv.org/abs/2404.01318
- Resource: https://jailbreakbench.github.io/
- Cited by: Anthropic, OpenAI, academic safety research
Tests over-refusal: incorrectly refusing safe requests (counterweight to StrongREJECT)
- Organization: Academic (Rottger et al.)
- Paper: https://arxiv.org/abs/2308.01263
- Resource: https://github.com/paul-rottger/exaggerated-safety
- Cited by: OpenAI, Anthropic, Google DeepMind system cards
Fine-grained refusal evaluation across 45 unsafe topic categories
- Organization: Princeton / Virginia Tech
- Paper: https://arxiv.org/abs/2406.14598
- Resource: https://sorry-bench.github.io/
- Cited by: Academic safety research
Adversarial harmful behaviors dataset (Zou et al. GCG paper)
- Organization: CMU / Center for AI Safety
- Paper: https://arxiv.org/abs/2307.15043
- Resource: https://github.com/llm-attacks/llm-attacks
- Cited by: Widely cited across frontier labs
3,668 MCQs across biosecurity, cybersecurity, and chemical security; proxy for hazardous knowledge and unlearning benchmark
- Organization: Center for AI Safety + Scale AI consortium
- Paper: https://arxiv.org/abs/2403.03218
- Resource: https://safe.ai/blog/wmdp-benchmark
- Cited by: Anthropic, OpenAI, Google DeepMind, Amazon Nova, Meta
Biology research capability: LitQA2, ProtocolQA, SeqQA, FigQA, Cloning Scenarios
- Organization: FutureHouse
- Paper: https://arxiv.org/abs/2407.10362
- Resource: https://huggingface.co/datasets/futurehouse/lab-bench
- Cited by: Anthropic, OpenAI, Amazon Nova system cards
Multiple-response virology benchmark; top models now exceed expert virologists
- Organization: SecureBio
- Paper: https://arxiv.org/abs/2504.16137
- Cited by: Anthropic, OpenAI, frontier CBRN sections
Long-form biorisk question evaluation
- Organization: Gryphon Scientific (now Deloitte)
- Cited by: OpenAI Preparedness Framework evaluations
Bio tacit knowledge and troubleshooting questions
- Organization: Gryphon Scientific (now Deloitte)
- Cited by: OpenAI Preparedness Framework evaluations
Creative biology task evaluations
- Organization: SecureBio
- Cited by: Anthropic system cards
Short-horizon computational biology tasks
- Organization: Faculty.ai / Anthropic
- Cited by: Anthropic system cards
WMD proliferation risk benchmark with safety-usefulness tradeoff
- Organization: Scale AI
- Paper: https://arxiv.org/abs/2502.14086
- Cited by: Scale AI, frontier CBRN evaluations
Real-world risk metric layered on top of LAB-Bench, BioLP-bench, and WMDP
- Organization: Johns Hopkins School of Medicine
- Paper: https://arxiv.org/abs/2511.16823
- Cited by: Academic CBRN risk methodology
Disentangles honesty from accuracy; large-scale lying-under-pressure evaluation
- Organization: Center for AI Safety
- Paper: https://arxiv.org/abs/2503.03750
- Resource: https://www.mask-benchmark.ai/
- Cited by: Anthropic, OpenAI safety research
Six agentic evaluations where models are placed in environments that incentivize scheming
- Organization: Apollo Research
- Paper: https://arxiv.org/abs/2412.04984
- Resource: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations
- Cited by: Apollo Research, used pre-deployment by Anthropic and OpenAI
11 evaluations supporting a scheming-inability safety case
- Organization: Google DeepMind / Apollo Research
- Paper: https://arxiv.org/abs/2505.01420
- Cited by: Google DeepMind Frontier Safety Framework
Tests model self-awareness as a propensity benchmark
- Organization: Laine et al. (academic)
- Paper: https://arxiv.org/abs/2407.04694
- Resource: https://situational-awareness-dataset.org/
- Cited by: Academic safety research, Apollo Research
AI R&D capabilities of language model agents vs human experts; multi-hour task time horizons
- Organization: METR (Model Evaluation and Threat Research)
- Paper: https://arxiv.org/abs/2411.15114
- Resource: https://metr.org/AI_R_D_Evaluation_Report.pdf
- Cited by: OpenAI (o3, o4-mini, GPT-4.5, GPT-5.1), Anthropic (Claude 3.7+), White House NSM on AI, EU AI Act
Task-length-AI-can-complete methodology; exponential trend tracking
- Organization: METR
- Paper: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
- Resource: https://metr.org/research/
- Cited by: Used by METR for pre-deployment evals of OpenAI and Anthropic models
Autonomous ML research task benchmark
- Organization: Stanford
- Paper: https://arxiv.org/abs/2310.03302
- Resource: https://github.com/snap-stanford/MLAgentBench
- Cited by: Anthropic, academic AI R&D evals
Human-verified subset of real GitHub issues; used as autonomy signal in RSP and Preparedness contexts
- Organization: OpenAI Preparedness / Princeton NLP
- Paper: https://arxiv.org/abs/2310.06770
- Resource: https://www.swebench.com/
- Cited by: Anthropic, OpenAI, Google DeepMind system cards
12 hazard categories (violent crime, CSAM, weapons, suicide, privacy, defamation, hate, etc.); MLCommons industry standard
- Organization: MLCommons (AIRR Working Group)
- Paper: https://arxiv.org/abs/2503.05731
- Resource: https://mlcommons.org/ailuminate/safety/
- Cited by: MLCommons consortium, Stanford AI Index Report
Comprehensive AI risk taxonomy benchmark spanning multiple safety dimensions
- Organization: Stanford CRFM
- Paper: https://arxiv.org/abs/2407.17436
- Resource: https://crfm.stanford.edu/2024/08/01/air-bench.html
- Cited by: Stanford CRFM, frontier safety research
8 trustworthiness perspectives: toxicity, bias, robustness, privacy, ethics, fairness, OOD, adversarial
- Organization: UIUC / Stanford / Berkeley
- Paper: https://arxiv.org/abs/2306.11698
- Resource: https://decodingtrust.github.io/
- Cited by: Widely cited across frontier labs and academic safety
30+ datasets across 6 trust dimensions (truthfulness, safety, fairness, robustness, privacy, ethics)
- Organization: Academic consortium
- Paper: https://arxiv.org/abs/2401.05561
- Resource: https://trustllmbenchmark.github.io/TrustLLM-Website/
- Cited by: Academic, frontier safety research
11,000+ MCQs across 7 safety categories
- Organization: Tsinghua University
- Paper: https://arxiv.org/abs/2309.07045
- Resource: https://github.com/thu-coai/SafetyBench
- Cited by: Academic, multilingual safety evals
Aggregator of 35+ safety benchmarks
- Organization: Walled AI Labs
- Paper: https://arxiv.org/abs/2408.03837
- Resource: https://github.com/walledai/walledeval
- Cited by: Industry safety platforms
Hand-built bias benchmark across nine demographic axes for QA
- Organization: NYU (Parrish et al.)
- Paper: https://arxiv.org/abs/2110.08193
- Resource: https://github.com/nyu-mll/BBQ
- Cited by: Anthropic (Opus 4.5/4.6/4.7 system cards), OpenAI, academic fairness research
Evaluation harness used by US and UK AI Safety Institutes; AgentDojo and many others ship as Inspect tasks
- Organization: UK AI Safety Institute
- Resource: https://inspect.aisi.org.uk/
- Cited by: US AISI, UK AISI, joint frontier model red-teaming
Adversarial Threat Landscape for AI Systems; threat-model taxonomy (not a benchmark)
- Organization: MITRE
- Resource: https://atlas.mitre.org/
- Cited by: MITRE, NIST AI RMF, industry adoption
Python Risk Identification Toolkit; open-source red-teaming framework
- Organization: Microsoft
- Resource: https://github.com/Azure/PyRIT
- Cited by: Microsoft AI Red Team
Open-source LLM vulnerability scanner
- Organization: NVIDIA
- Paper: https://arxiv.org/abs/2406.11036
- Resource: https://github.com/NVIDIA/garak
- Cited by: NVIDIA, industry red-teaming
If you want the tightest core list, these appear most consistently in 2025-2026 system cards from Anthropic, OpenAI, Google DeepMind, and Meta, plus AISI publications:
- WMDP
- HarmBench
- AgentDojo
- InjecAgent
- AgentHarm
- MASK
- 3CB
- METR RE-Bench
- CyberSecEval (Meta)
- LAB-Bench
- AILuminate
- Apollo Research scheming and situational awareness evals
Pull requests welcome. Please include the paper URL, the publishing organization, and which frontier labs or AISIs have cited the benchmark.
This list is shared under CC BY 4.0. Linked papers and repositories retain their own licenses.