AI Security and Safety Evaluation Benchmarks

A curated, categorized list of AI security and safety evaluation benchmarks well-regarded by Frontier AI labs (Anthropic, OpenAI, Google DeepMind, Meta) and AI Safety Institutes (US AISI, UK AISI).

Maintained by Anshu Gupta, Founder & CISO, Fixin Security. Founder, Tejas Cyber Network

Overview

Total benchmarks: 61
Categories: 11

Categories at a glance

Category	Count
Cyber Offense and CTF	9
Cyber Defense and Threat Intel	8
Software Security and Code	2
Agent Security and Prompt Injection	8
Jailbreak and Refusal	6
CBRN Knowledge and Bio Uplift	9
Alignment, Honesty, Scheming	4
Autonomy and AI R&D	4
Comprehensive Safety and Trust	6
Bias and Fairness	1
Tooling Frameworks	4

Cyber Offense and CTF

Cybench

Public cyber capabilities benchmark of 40 CTF challenges from four CTF competitions

Organization: Stanford / academic consortium
Paper: https://arxiv.org/abs/2408.08926
Resource: https://cybench.github.io/
Cited by: Anthropic, OpenAI, Google DeepMind, UK AISI

Cybergym

Targeted vulnerability reproduction in real open-source projects from high-level descriptions

Organization: UC Berkeley
Paper: https://arxiv.org/abs/2506.02548
Cited by: Anthropic (Opus 4.7 system card), OpenAI (GPT-5.5 system card)

CVEBench

Identify and exploit vulnerabilities in free and open-source web applications

Organization: Academic
Paper: https://arxiv.org/abs/2503.17332
Cited by: Used in academic and industry evaluations

3CB (Catastrophic Cyber Capabilities Benchmark)

15 cyber offense challenges aligned to MITRE ATT&CK, with 80 elicitation configurations to find best-performing setup

Organization: Apollo Research / Apart Research
Paper: https://arxiv.org/abs/2410.09114
Resource: https://cybercapabilities.org
Cited by: Apollo Research, UK AISI, academic literature

SOLVE (Scoring Obstacle Levels in Vulnerabilities & Exploits)

Difficulty scoring system for vulnerability and exploit benchmarks

Organization: Irregular (Pattern Labs)
Paper: https://www.irregular.com/publications/introducing-solve
Cited by: Used by Irregular in frontier lab assessments

CyScenarioBench

Scenario-based benchmarking for LLM cyber capabilities

Organization: Irregular (Pattern Labs)
Paper: https://www.irregular.com/publications/cyscenariobench
Cited by: Used by Irregular in frontier lab assessments

NYU CTF Bench

200+ CTF challenges from NYU CSAW competitions; complements Cybench

Organization: NYU
Paper: https://arxiv.org/abs/2406.05590
Resource: https://github.com/NYU-LLM-CTF/NYU_CTF_Bench
Cited by: Academic, frontier lab cyber suites

HonestCyberEval / AIxCC

Automated vulnerability detection on Nginx and DARPA AIxCC framework

Organization: Alan Turing Institute / DARPA
Paper: https://arxiv.org/abs/2410.21939
Cited by: Alan Turing Institute, DARPA

XBOW internal benchmark

Open-source applications frozen at vulnerable versions, measuring miss rate on known CVEs

Organization: XBOW
Paper: https://xbow.com/blog/mythos-like-hacking-open-to-all
Cited by: XBOW, cited in OpenAI Daybreak / GPT-5.5 system card discussions

Cyber Defense and Threat Intel

CTI-REALM

End-to-end detection rule generation with AI agents

Organization: Microsoft
Paper: https://arxiv.org/html/2603.13517v1
Resource: https://www.microsoft.com/en-us/security/blog/2026/03/20/cti-realm-a-new-benchmark-for-end-to-end-detection-rule-generation-with-ai-agents/
Cited by: Microsoft Security Research

ExCyTIn-Bench

Evaluating LLM agents on cyber threat investigation

Organization: Microsoft
Paper: https://arxiv.org/abs/2507.14201
Resource: https://github.com/microsoft/SecRL
Cited by: Microsoft Security Research

CyberSOCEval

Malware analysis and threat intelligence reasoning; defensive capabilities benchmark

Organization: Meta (with CrowdStrike)
Paper: https://ai.meta.com/research/publications/cybersoceval-benchmarking-llms-capabilities-for-malware-analysis-and-threat-intelligence-reasoning/
Resource: https://github.com/meta-llama/PurpleLlama
Cited by: Meta, CrowdStrike

CTIBench

MCQA, RCM, VSP, ATE tasks for cyber threat intelligence (knowledge, attribution, severity)

Organization: Academic (Alam et al.)
Paper: https://arxiv.org/abs/2406.07599
Cited by: Cisco Foundation-Sec, academic security LLM evals

CyberMetric

RAG-based benchmark for cybersecurity knowledge (cryptography, reverse engineering, risk)

Organization: Technology Innovation Institute (TII) / Khalifa University
Paper: https://arxiv.org/abs/2402.07688
Cited by: Foundation-Sec models, academic security LLM evaluations

SecBench

Multi-dimensional cybersecurity benchmark: 44,823 MCQs and 3,087 SAQs across sub-domains

Organization: Tencent / HK PolyU
Paper: https://arxiv.org/abs/2412.20787
Cited by: Tencent, academic security LLM evals

SecEval

MCQs across software, network, and web security topics

Organization: Academic (Li et al.)
Paper: https://arxiv.org/abs/2311.11680
Cited by: Cisco Foundation-Sec, academic security LLM evals

SecQA

Foundational cybersecurity concept questions

Organization: Academic
Paper: https://arxiv.org/abs/2312.15838
Cited by: Academic security LLM evals

Software Security and Code

SusVibes

Security-oriented software engineering benchmark

Organization: Academic
Paper: https://arxiv.org/html/2512.03262v1
Cited by: Academic / under publication

CyberSecEval (Meta Purple Llama, v1 to v4)

Umbrella suite: insecure coding (CWE), MITRE ATT&CK helpfulness, prompt injection (textual and visual), code interpreter abuse, and CyberSOCEval

Organization: Meta
Paper: https://arxiv.org/abs/2404.13161
Resource: https://github.com/meta-llama/PurpleLlama
Cited by: Meta (Llama 4 system card), applied to OpenAI, Google, Anthropic models

Agent Security and Prompt Injection

Agent Red Teaming (ART)

Curated set of high-impact attacks from large-scale public competition

Organization: Gray Swan AI
Paper: https://arxiv.org/pdf/2507.20526
Cited by: Gray Swan, frontier lab agentic evals

SHADE-Arena

Evaluating sabotage and monitoring in LLM agents (29 complex environments)

Organization: Anthropic
Paper: https://arxiv.org/abs/2506.15740
Resource: https://www.anthropic.com/research/shade-arena-sabotage-monitoring
Cited by: Anthropic

AgentDojo

Dynamic framework jointly evaluating utility and prompt injection resilience for tool-integrated agents

Organization: ETH Zurich / Invariant Labs
Paper: https://arxiv.org/abs/2406.13352
Resource: https://agentdojo.spylab.ai/
Cited by: US AISI, UK AISI, NeurIPS 2024 SafeBench prize winner

InjecAgent

Indirect prompt injection: 1,054 test cases, 17 user tools, 62 attacker tools

Organization: UIUC (Kang Lab)
Paper: https://arxiv.org/abs/2403.02691
Resource: https://github.com/uiuc-kang-lab/InjecAgent
Cited by: Widely cited, used in frontier agent security research

AgentHarm

Benchmark for measuring harmfulness of LLM agents when user is malicious (ICLR 2025)

Organization: Gray Swan AI / UK AISI
Paper: https://arxiv.org/abs/2410.09024
Cited by: Gray Swan, UK AISI, ICLR 2025

BIPIA

Benchmark for Indirect Prompt Injection Attacks

Organization: Microsoft
Paper: https://arxiv.org/abs/2312.14197
Resource: https://github.com/microsoft/BIPIA
Cited by: Microsoft Security Research

Tensor Trust

Prompt extraction and hijacking benchmark grown from a public game

Organization: UC Berkeley
Paper: https://arxiv.org/abs/2311.01011
Resource: https://tensortrust.ai/
Cited by: Referenced in AgentDojo, OpenAI prompt injection work

BrowserART

Browser agent red teaming benchmark

Organization: Gray Swan AI
Paper: https://arxiv.org/abs/2410.13886
Cited by: Gray Swan, frontier browser agent evals

Jailbreak and Refusal

StrongREJECT

State-of-the-art LLM jailbreak evaluation benchmark with quality-aware scoring

Organization: UC Berkeley
Paper: https://arxiv.org/abs/2402.10260
Resource: https://strong-reject.readthedocs.io/en/latest/
Cited by: OpenAI, Anthropic system cards

HarmBench

Standardized red-teaming evaluation framework with classifier-based harm grading

Organization: Center for AI Safety (Mazeika et al.)
Paper: https://arxiv.org/abs/2402.04249
Resource: https://www.harmbench.org/
Cited by: Anthropic, OpenAI, Google DeepMind, Meta system cards

JailbreakBench

Open robustness benchmark for jailbreaking LLMs (NeurIPS 2024)

Organization: Academic (Chao, Debenedetti, Robey, et al.)
Paper: https://arxiv.org/abs/2404.01318
Resource: https://jailbreakbench.github.io/
Cited by: Anthropic, OpenAI, academic safety research

XSTest

Tests over-refusal: incorrectly refusing safe requests (counterweight to StrongREJECT)

Organization: Academic (Rottger et al.)
Paper: https://arxiv.org/abs/2308.01263
Resource: https://github.com/paul-rottger/exaggerated-safety
Cited by: OpenAI, Anthropic, Google DeepMind system cards

SORRY-Bench

Fine-grained refusal evaluation across 45 unsafe topic categories

Organization: Princeton / Virginia Tech
Paper: https://arxiv.org/abs/2406.14598
Resource: https://sorry-bench.github.io/
Cited by: Academic safety research

AdvBench

Adversarial harmful behaviors dataset (Zou et al. GCG paper)

Organization: CMU / Center for AI Safety
Paper: https://arxiv.org/abs/2307.15043
Resource: https://github.com/llm-attacks/llm-attacks
Cited by: Widely cited across frontier labs

CBRN Knowledge and Bio Uplift

WMDP (Weapons of Mass Destruction Proxy)

3,668 MCQs across biosecurity, cybersecurity, and chemical security; proxy for hazardous knowledge and unlearning benchmark

Organization: Center for AI Safety + Scale AI consortium
Paper: https://arxiv.org/abs/2403.03218
Resource: https://safe.ai/blog/wmdp-benchmark
Cited by: Anthropic, OpenAI, Google DeepMind, Amazon Nova, Meta

LAB-Bench

Biology research capability: LitQA2, ProtocolQA, SeqQA, FigQA, Cloning Scenarios

Organization: FutureHouse
Paper: https://arxiv.org/abs/2407.10362
Resource: https://huggingface.co/datasets/futurehouse/lab-bench
Cited by: Anthropic, OpenAI, Amazon Nova system cards

Virology Capabilities Test (VCT)

Multiple-response virology benchmark; top models now exceed expert virologists

Organization: SecureBio
Paper: https://arxiv.org/abs/2504.16137
Cited by: Anthropic, OpenAI, frontier CBRN sections

Long-form Biorisk Questions (LFB)

Long-form biorisk question evaluation

Organization: Gryphon Scientific (now Deloitte)
Cited by: OpenAI Preparedness Framework evaluations

Tacit Knowledge and Troubleshooting (TTK)

Bio tacit knowledge and troubleshooting questions

Organization: Gryphon Scientific (now Deloitte)
Cited by: OpenAI Preparedness Framework evaluations

Creative Biology (CrB)

Creative biology task evaluations

Organization: SecureBio
Cited by: Anthropic system cards

Short-Horizon Bio Tasks (SHB)

Short-horizon computational biology tasks

Organization: Faculty.ai / Anthropic
Cited by: Anthropic system cards

FORTRESS

WMD proliferation risk benchmark with safety-usefulness tradeoff

Organization: Scale AI
Paper: https://arxiv.org/abs/2502.14086
Cited by: Scale AI, frontier CBRN evaluations

MOCET (Monte Carlo Expected Threat)

Real-world risk metric layered on top of LAB-Bench, BioLP-bench, and WMDP

Organization: Johns Hopkins School of Medicine
Paper: https://arxiv.org/abs/2511.16823
Cited by: Academic CBRN risk methodology

Alignment, Honesty, Scheming

MASK Benchmark

Disentangles honesty from accuracy; large-scale lying-under-pressure evaluation

Organization: Center for AI Safety
Paper: https://arxiv.org/abs/2503.03750
Resource: https://www.mask-benchmark.ai/
Cited by: Anthropic, OpenAI safety research

Apollo In-Context Scheming Evaluations

Six agentic evaluations where models are placed in environments that incentivize scheming

Organization: Apollo Research
Paper: https://arxiv.org/abs/2412.04984
Resource: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations
Cited by: Apollo Research, used pre-deployment by Anthropic and OpenAI

Stealth and Situational Awareness Evaluations

11 evaluations supporting a scheming-inability safety case

Organization: Google DeepMind / Apollo Research
Paper: https://arxiv.org/abs/2505.01420
Cited by: Google DeepMind Frontier Safety Framework

Situational Awareness Dataset (SAD)

Tests model self-awareness as a propensity benchmark

Organization: Laine et al. (academic)
Paper: https://arxiv.org/abs/2407.04694
Resource: https://situational-awareness-dataset.org/
Cited by: Academic safety research, Apollo Research

Autonomy and AI R&D

METR RE-Bench

AI R&D capabilities of language model agents vs human experts; multi-hour task time horizons

Organization: METR (Model Evaluation and Threat Research)
Paper: https://arxiv.org/abs/2411.15114
Resource: https://metr.org/AI_R_D_Evaluation_Report.pdf
Cited by: OpenAI (o3, o4-mini, GPT-4.5, GPT-5.1), Anthropic (Claude 3.7+), White House NSM on AI, EU AI Act

METR HCAST / Task-Length Suite

Task-length-AI-can-complete methodology; exponential trend tracking

Organization: METR
Paper: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
Resource: https://metr.org/research/
Cited by: Used by METR for pre-deployment evals of OpenAI and Anthropic models

MLAgentBench

Autonomous ML research task benchmark

Organization: Stanford
Paper: https://arxiv.org/abs/2310.03302
Resource: https://github.com/snap-stanford/MLAgentBench
Cited by: Anthropic, academic AI R&D evals

SWE-bench Verified

Human-verified subset of real GitHub issues; used as autonomy signal in RSP and Preparedness contexts

Organization: OpenAI Preparedness / Princeton NLP
Paper: https://arxiv.org/abs/2310.06770
Resource: https://www.swebench.com/
Cited by: Anthropic, OpenAI, Google DeepMind system cards

Comprehensive Safety and Trust

AILuminate v1.0

12 hazard categories (violent crime, CSAM, weapons, suicide, privacy, defamation, hate, etc.); MLCommons industry standard

Organization: MLCommons (AIRR Working Group)
Paper: https://arxiv.org/abs/2503.05731
Resource: https://mlcommons.org/ailuminate/safety/
Cited by: MLCommons consortium, Stanford AI Index Report

AIR-Bench 2024

Comprehensive AI risk taxonomy benchmark spanning multiple safety dimensions

Organization: Stanford CRFM
Paper: https://arxiv.org/abs/2407.17436
Resource: https://crfm.stanford.edu/2024/08/01/air-bench.html
Cited by: Stanford CRFM, frontier safety research

DecodingTrust

8 trustworthiness perspectives: toxicity, bias, robustness, privacy, ethics, fairness, OOD, adversarial

Organization: UIUC / Stanford / Berkeley
Paper: https://arxiv.org/abs/2306.11698
Resource: https://decodingtrust.github.io/
Cited by: Widely cited across frontier labs and academic safety

TrustLLM

30+ datasets across 6 trust dimensions (truthfulness, safety, fairness, robustness, privacy, ethics)

Organization: Academic consortium
Paper: https://arxiv.org/abs/2401.05561
Resource: https://trustllmbenchmark.github.io/TrustLLM-Website/
Cited by: Academic, frontier safety research

SafetyBench

11,000+ MCQs across 7 safety categories

Organization: Tsinghua University
Paper: https://arxiv.org/abs/2309.07045
Resource: https://github.com/thu-coai/SafetyBench
Cited by: Academic, multilingual safety evals

WalledEval

Aggregator of 35+ safety benchmarks

Organization: Walled AI Labs
Paper: https://arxiv.org/abs/2408.03837
Resource: https://github.com/walledai/walledeval
Cited by: Industry safety platforms

Bias and Fairness

Bias Benchmark for Question Answering (BBQ)

Hand-built bias benchmark across nine demographic axes for QA

Organization: NYU (Parrish et al.)
Paper: https://arxiv.org/abs/2110.08193
Resource: https://github.com/nyu-mll/BBQ
Cited by: Anthropic (Opus 4.5/4.6/4.7 system cards), OpenAI, academic fairness research

Tooling Frameworks

UK AISI Inspect Evals

Evaluation harness used by US and UK AI Safety Institutes; AgentDojo and many others ship as Inspect tasks

Organization: UK AI Safety Institute
Resource: https://inspect.aisi.org.uk/
Cited by: US AISI, UK AISI, joint frontier model red-teaming

MITRE ATLAS

Adversarial Threat Landscape for AI Systems; threat-model taxonomy (not a benchmark)

Organization: MITRE
Resource: https://atlas.mitre.org/
Cited by: MITRE, NIST AI RMF, industry adoption

Microsoft PyRIT

Python Risk Identification Toolkit; open-source red-teaming framework

Organization: Microsoft
Resource: https://github.com/Azure/PyRIT
Cited by: Microsoft AI Red Team

Garak

Open-source LLM vulnerability scanner

Organization: NVIDIA
Paper: https://arxiv.org/abs/2406.11036
Resource: https://github.com/NVIDIA/garak
Cited by: NVIDIA, industry red-teaming

Highest-priority benchmarks for Frontier AI labs

If you want the tightest core list, these appear most consistently in 2025-2026 system cards from Anthropic, OpenAI, Google DeepMind, and Meta, plus AISI publications:

WMDP
HarmBench
AgentDojo
InjecAgent
AgentHarm
MASK
3CB
METR RE-Bench
CyberSecEval (Meta)
LAB-Bench
AILuminate
Apollo Research scheming and situational awareness evals

Contributing

Pull requests welcome. Please include the paper URL, the publishing organization, and which frontier labs or AISIs have cited the benchmark.

License

This list is shared under CC BY 4.0. Linked papers and repositories retain their own licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation