Skip to content

anshug/ai-security-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

AI Security and Safety Evaluation Benchmarks

A curated, categorized list of AI security and safety evaluation benchmarks well-regarded by Frontier AI labs (Anthropic, OpenAI, Google DeepMind, Meta) and AI Safety Institutes (US AISI, UK AISI).

Maintained by Anshu Gupta, Founder & CISO, Fixin Security. Founder, Tejas Cyber Network


Overview

  • Total benchmarks: 61
  • Categories: 11

Categories at a glance

Category Count
Cyber Offense and CTF 9
Cyber Defense and Threat Intel 8
Software Security and Code 2
Agent Security and Prompt Injection 8
Jailbreak and Refusal 6
CBRN Knowledge and Bio Uplift 9
Alignment, Honesty, Scheming 4
Autonomy and AI R&D 4
Comprehensive Safety and Trust 6
Bias and Fairness 1
Tooling Frameworks 4

Quick navigation


Cyber Offense and CTF

Cybench

Public cyber capabilities benchmark of 40 CTF challenges from four CTF competitions

Cybergym

Targeted vulnerability reproduction in real open-source projects from high-level descriptions

CVEBench

Identify and exploit vulnerabilities in free and open-source web applications

3CB (Catastrophic Cyber Capabilities Benchmark)

15 cyber offense challenges aligned to MITRE ATT&CK, with 80 elicitation configurations to find best-performing setup

SOLVE (Scoring Obstacle Levels in Vulnerabilities & Exploits)

Difficulty scoring system for vulnerability and exploit benchmarks

CyScenarioBench

Scenario-based benchmarking for LLM cyber capabilities

NYU CTF Bench

200+ CTF challenges from NYU CSAW competitions; complements Cybench

HonestCyberEval / AIxCC

Automated vulnerability detection on Nginx and DARPA AIxCC framework

XBOW internal benchmark

Open-source applications frozen at vulnerable versions, measuring miss rate on known CVEs


Cyber Defense and Threat Intel

CTI-REALM

End-to-end detection rule generation with AI agents

ExCyTIn-Bench

Evaluating LLM agents on cyber threat investigation

CyberSOCEval

Malware analysis and threat intelligence reasoning; defensive capabilities benchmark

CTIBench

MCQA, RCM, VSP, ATE tasks for cyber threat intelligence (knowledge, attribution, severity)

CyberMetric

RAG-based benchmark for cybersecurity knowledge (cryptography, reverse engineering, risk)

  • Organization: Technology Innovation Institute (TII) / Khalifa University
  • Paper: https://arxiv.org/abs/2402.07688
  • Cited by: Foundation-Sec models, academic security LLM evaluations

SecBench

Multi-dimensional cybersecurity benchmark: 44,823 MCQs and 3,087 SAQs across sub-domains

SecEval

MCQs across software, network, and web security topics

SecQA

Foundational cybersecurity concept questions


Software Security and Code

SusVibes

Security-oriented software engineering benchmark

CyberSecEval (Meta Purple Llama, v1 to v4)

Umbrella suite: insecure coding (CWE), MITRE ATT&CK helpfulness, prompt injection (textual and visual), code interpreter abuse, and CyberSOCEval


Agent Security and Prompt Injection

Agent Red Teaming (ART)

Curated set of high-impact attacks from large-scale public competition

SHADE-Arena

Evaluating sabotage and monitoring in LLM agents (29 complex environments)

AgentDojo

Dynamic framework jointly evaluating utility and prompt injection resilience for tool-integrated agents

InjecAgent

Indirect prompt injection: 1,054 test cases, 17 user tools, 62 attacker tools

AgentHarm

Benchmark for measuring harmfulness of LLM agents when user is malicious (ICLR 2025)

BIPIA

Benchmark for Indirect Prompt Injection Attacks

Tensor Trust

Prompt extraction and hijacking benchmark grown from a public game

BrowserART

Browser agent red teaming benchmark


Jailbreak and Refusal

StrongREJECT

State-of-the-art LLM jailbreak evaluation benchmark with quality-aware scoring

HarmBench

Standardized red-teaming evaluation framework with classifier-based harm grading

JailbreakBench

Open robustness benchmark for jailbreaking LLMs (NeurIPS 2024)

XSTest

Tests over-refusal: incorrectly refusing safe requests (counterweight to StrongREJECT)

SORRY-Bench

Fine-grained refusal evaluation across 45 unsafe topic categories

AdvBench

Adversarial harmful behaviors dataset (Zou et al. GCG paper)


CBRN Knowledge and Bio Uplift

WMDP (Weapons of Mass Destruction Proxy)

3,668 MCQs across biosecurity, cybersecurity, and chemical security; proxy for hazardous knowledge and unlearning benchmark

LAB-Bench

Biology research capability: LitQA2, ProtocolQA, SeqQA, FigQA, Cloning Scenarios

Virology Capabilities Test (VCT)

Multiple-response virology benchmark; top models now exceed expert virologists

Long-form Biorisk Questions (LFB)

Long-form biorisk question evaluation

  • Organization: Gryphon Scientific (now Deloitte)
  • Cited by: OpenAI Preparedness Framework evaluations

Tacit Knowledge and Troubleshooting (TTK)

Bio tacit knowledge and troubleshooting questions

  • Organization: Gryphon Scientific (now Deloitte)
  • Cited by: OpenAI Preparedness Framework evaluations

Creative Biology (CrB)

Creative biology task evaluations

  • Organization: SecureBio
  • Cited by: Anthropic system cards

Short-Horizon Bio Tasks (SHB)

Short-horizon computational biology tasks

  • Organization: Faculty.ai / Anthropic
  • Cited by: Anthropic system cards

FORTRESS

WMD proliferation risk benchmark with safety-usefulness tradeoff

MOCET (Monte Carlo Expected Threat)

Real-world risk metric layered on top of LAB-Bench, BioLP-bench, and WMDP


Alignment, Honesty, Scheming

MASK Benchmark

Disentangles honesty from accuracy; large-scale lying-under-pressure evaluation

Apollo In-Context Scheming Evaluations

Six agentic evaluations where models are placed in environments that incentivize scheming

Stealth and Situational Awareness Evaluations

11 evaluations supporting a scheming-inability safety case

Situational Awareness Dataset (SAD)

Tests model self-awareness as a propensity benchmark


Autonomy and AI R&D

METR RE-Bench

AI R&D capabilities of language model agents vs human experts; multi-hour task time horizons

METR HCAST / Task-Length Suite

Task-length-AI-can-complete methodology; exponential trend tracking

MLAgentBench

Autonomous ML research task benchmark

SWE-bench Verified

Human-verified subset of real GitHub issues; used as autonomy signal in RSP and Preparedness contexts


Comprehensive Safety and Trust

AILuminate v1.0

12 hazard categories (violent crime, CSAM, weapons, suicide, privacy, defamation, hate, etc.); MLCommons industry standard

AIR-Bench 2024

Comprehensive AI risk taxonomy benchmark spanning multiple safety dimensions

DecodingTrust

8 trustworthiness perspectives: toxicity, bias, robustness, privacy, ethics, fairness, OOD, adversarial

TrustLLM

30+ datasets across 6 trust dimensions (truthfulness, safety, fairness, robustness, privacy, ethics)

SafetyBench

11,000+ MCQs across 7 safety categories

WalledEval

Aggregator of 35+ safety benchmarks


Bias and Fairness

Bias Benchmark for Question Answering (BBQ)

Hand-built bias benchmark across nine demographic axes for QA


Tooling Frameworks

UK AISI Inspect Evals

Evaluation harness used by US and UK AI Safety Institutes; AgentDojo and many others ship as Inspect tasks

MITRE ATLAS

Adversarial Threat Landscape for AI Systems; threat-model taxonomy (not a benchmark)

Microsoft PyRIT

Python Risk Identification Toolkit; open-source red-teaming framework

Garak

Open-source LLM vulnerability scanner


Highest-priority benchmarks for Frontier AI labs

If you want the tightest core list, these appear most consistently in 2025-2026 system cards from Anthropic, OpenAI, Google DeepMind, and Meta, plus AISI publications:

  1. WMDP
  2. HarmBench
  3. AgentDojo
  4. InjecAgent
  5. AgentHarm
  6. MASK
  7. 3CB
  8. METR RE-Bench
  9. CyberSecEval (Meta)
  10. LAB-Bench
  11. AILuminate
  12. Apollo Research scheming and situational awareness evals

Contributing

Pull requests welcome. Please include the paper URL, the publishing organization, and which frontier labs or AISIs have cited the benchmark.

License

This list is shared under CC BY 4.0. Linked papers and repositories retain their own licenses.

About

Curated AI security and safety evaluation benchmarks well-regarded by Frontier AI labs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors