Skip to content

linny006/agent-eval-harness

Repository files navigation

Agent Eval Harness

Live, open-source benchmark for comparing AI coding agents on real GitHub issues

Stars Last Commit Items Updated

⭐ Star this repo to bookmark — fresh data every 15 minutes

English · 中文 · 日本語 · 한국어 · Español · Português


💡 What is this?

A standardized benchmark suite that runs coding agents against live, real-world GitHub issues with reproduction steps. Unlike static academic benchmarks, it outputs a weekly-updated public leaderboard, enabling developers to compare agents like OpenCode, Codex, and Claude Code in realistic scenarios.

This list is auto-updated every 15 minutes by a GitHub Actions cron. Each commit reflects a real change in the upstream data source — new items added, expired items removed — so you can rely on what you see being current.


📋 Current Items

⏰ Last updated: 2026-06-26 01:15 UTC

Data source: GitHub Search API

The table below is rewritten on every cron tick. Star the repo to bookmark.

# Name Lang Updated Description
1 saddled-panicattack529/idea-evaluation-pipeline 0 2026-06-25 Streamline research idea evaluation for finance and economics to reach top journal quality using an iterative, AI-assist
2 Kondwani10/Origin-Continuum 0 2026-06-25 🌐 Define and explore the Origin ↔ Continuum framework, ensuring proper attribution and continuity in dependency relation
3 Sans-cell-art/-Project-Phoenix-The-E-Waste-Supercomputer- 0 2026-06-25 ♻️ Transform e-waste into a powerful, low-cost cloud operating system, unlocking computing potential and promoting resou
4 bhavya7995/AI_governance 1 PowerShell 2026-06-25 🤖 Streamline AI-assisted development with a governance kit for rules, enforcement, and decision-making, ensuring speed a
5 promptfoo/promptfoo 22603 TypeScript 2026-06-25 Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, C
6 G59-Toneli/dataset-eval-skill 0 JavaScript 2026-06-25 A Claude skill for building golden sets to test AI systems — matching, RAG, LLM-as-judge — without false greens.
7 valbaudo/awf 1 Go 2026-06-25 Run agents you don't babysit, and trust the result. awf runs agentic workflows with an independent gate that checks ever
8 multivon-ai/multivon-eval 8 Python 2026-06-25 Practical LLM evaluation for teams that ship to production. Deterministic + LLM-as-judge evaluators, dataset support, CI
9 NoesisVision/nasde-toolkit 10 Python 2026-06-25 CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini ind
10 thewonderofyou777z-dot/tjoe-reviewkit 0 Python 2026-06-25 TjoeReviewKit:tjoe 的本地离线工作流复盘检查工具;不运行任务、不联网、不接管工具调用、不采集生产日志
11 Giskard-AI/giskard-oss 5465 Python 2026-06-25 🐢 Open-Source Evaluation & Testing library for LLM Agents
12 Arize-ai/phoenix 10280 Python 2026-06-25 AI Observability & Evaluation
13 verifywise-ai/verifywise 313 TypeScript 2026-06-25 Complete AI governance and LLM Evals platform with support for EU AI Act, ISO 42001, NIST AI RMF and 20+ more AI framewo
14 homemade-software-inc/completion-kit 1 Ruby 2026-06-25 Your prompts need tests too. Run prompts against real datasets, score outputs with LLM judges, version everything, and c
15 jeremylongshore/j-rig-skill-binary-eval 0 TypeScript 2026-06-25 Binary-criteria evaluation harness for Claude skills with planned extension to plugins, agents, and MCP servers. Score e
16 RaphaelFakhri/reagent 0 Python 2026-06-24 Tool-using ReAct + RAG agent (enterprise assistant) with a built-in evaluation harness scoring accuracy, tool selection,
17 tkarim45/agent-eval-harness 0 Python 2026-06-24 Agent eval harness — measure task success, tool-call accuracy, step efficiency, and cost for tool-using LLM agents (Clau
18 melody-ling-L/eval-resume 0 HTML 2026-06-24 中文 LLM 简历改写诚实度 benchmark:20 脱敏简历 × 3 模型 × 4 维度 · promptfoo + LLM-as-judge · 含在线报告
19 TheAnacondA57/BidAgent 0 Python 2026-06-23 RAG agentique sur des documents de concession télécom publique (DSP/RIP), pensé eval-first et contrôlé en CI.
20 IonDen/mlx-quant-fidelity 1 Python 2026-06-23 Measure MLX quantization quality loss — KL divergence, perplexity, top-token agreement for KV cache and weights
21 ahwurm/localshift 3 Python 2026-06-22 Migrate headless Claude/AI workloads to local LLMs with a derived, per-workload quality eval — cron job in, zero-margina
22 gashel01/evalmcp 0 Python 2026-06-22 Evaluation for AI agents — judge-based scoring and native RAG metrics (faithfulness, relevancy, context precision/recall
23 anejakartik/evalstack 0 TypeScript 2026-06-22 Open-source LLM evaluation framework — drop-in SDK + CI plugin. LLM-as-judge, regression detection, free + self-hostable
24 truera/trulens 3399 Python 2026-06-21 Evaluation and Tracking for LLM Experiments and AI Agents
25 jmpei/nl2sql-agents 0 Python 2026-06-21 NL→SQL multi-agent pipeline (LangGraph + Claude) with deterministic SQL-injection guardrails and golden-set eval.
26 lokesh75-kank/agenteval 0 TypeScript 2026-06-21 Reliability and audit-evidence testing for LLM agents - wrap any agent, assert behavior, measure determinism, check grou
27 TeracAI/svg-arena 0 TypeScript 2026-06-20 A forkable example of the human-in-the-loop model-improvement loop: AI generates, humans judge via the Terac MCP, you im
28 ozlar34/job-match-radar 0 Python 2026-06-20 Self-hosted n8n + Supabase pipeline that scrapes LinkedIn and a watchlist of company ATS endpoints, scores listings agai
29 kilocommits/campaign-eval-harness 0 Python 2026-06-20 An LLM-as-judge harness that scores AI-generated campaign phone scripts against a weighted quality rubric with a real Ha
30 Ayubjon/refusal-radar 0 JavaScript 2026-06-20 Zero-dependency detector and classifier for LLM refusals, deflections, and capability disclaimers — CLI + library with s
31 melody-ling-L/judgebuddy 0 HTML 2026-06-20 Single-file labeling tool for LLM-as-judge calibration. Three-pane comparison + multi-dim scoring. Zero deployment.
32 ramenprotokol/hallucination-hunter 0 Python 2026-06-20 Detect & score LLM hallucinations by groundedness — labeled data, precision/recall/F1, runs offline with no API key. Plu
33 pdxlab/trustmodel-mcp-server 0 TypeScript 2026-06-19 TrustModel MCP Server — trust evaluation, red-team, and governance for AI agents via the Model Context Protocol. npm: @t
34 gititya/Quality-Agency-support 0 Python 2026-06-17 Five local QA judges that review B2B and B2C customer-support replies, catch the risky parts, and explain what to fix.
35 tushariitr-19/assay 2 Go 2026-06-17 Framework-agnostic evaluation harness for Go — test your MCP servers and AI agents with scored, CI-ready checks.
36 jedobe/skill-evaluator 0 Python 2026-06-17 Score any Claude Code skill against a research-backed rubric derived from the top 9 most-starred skill repos on GitHub
37 ALEX-nlp/OpenSkillEval 12 Python 2026-06-15 OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents
38 mpuodziukas-labs/eval-harness-template 0 Python 2026-06-14 Eval harness template for LLM systems: golden regression, LLM-as-judge, invariants
39 mizcausevic-dev/agent-eval-arena 0 TypeScript 2026-06-22 Agent and LLM evaluation harness — golden datasets, multi-scorer execution, regression detection across model versions,
40 ejentum/eval 3 Python 2026-06-11 A/B evaluate any LLM task with and without Ejentum cognitive injection. n8n workflow + TypeScript module.
41 akanjilal-work/agent-eval-harness 0 Python 2026-06-10 A lightweight harness to test agent behaviour (tool-call correctness, injection refusal, cost ceilings) before deploymen
42 karlmehta/trustmodel-mcp 0 TypeScript 2026-06-10 TrustModel MCP Server — trust evaluation, red-team & governance for AI agents via the Model Context Protocol. Public can
43 reaatech/agent-eval-harness 0 TypeScript 2026-06-22 End-to-end agent evaluation — trajectory eval, tool-use correctness, cost-per-task, latency budgets, regression suites w
44 alyssadata/continuity-keys 1 2026-06-08 Continuity Keys: tests for “same someone” returns. Behavioral identity consistency under pressure. Origin (Alyssa Solen)
45 reaatech/classifier-evals 0 TypeScript 2026-06-24 Offline classifier evaluation harness — dataset loader, confusion matrices, LLM-as-judge with cost accounting, regressio
46 reaatech/rag-eval-pack 0 TypeScript 2026-06-22 RAG evaluation toolkit — faithfulness, answer relevance, context precision/recall, cost accounting, CI gates. Pairs with
47 Juanllenato/llm-eval-harness 0 Python 2026-06-03 A small, production-minded evaluation and observability harness for LLM/RAG features. Runs offline or live, gates CI on
48 Victor-David-Medina/llm-eval-harness 0 Python 2026-06-03 LLM evaluation harness that gates quality in CI: golden datasets, regression detection, grounding and faithfulness check
49 harnexa/nexa-gauge 38 Python 2026-06-22 An graph-eval framework for LLM's
50 thestio/thest-eval 0 Python 2026-06-02 The CI regression gate and governance-evidence layer for LLM systems — zero-dependency, vendor-neutral, offline.
51 monkeyin92/voice-agent-testops 0 TypeScript 2026-06-01 Regression testing for voice agents: scripted conversations, safety assertions, CI-ready reports.
52 fastxyz/skill-optimizer 66 TypeScript 2026-05-28 Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
53 ajmeese7/local-llms 1 Python 2026-05-27 Use local Large Language Models for production use cases, and perform benchmarking for task-specific performance evaluat
54 rogue-socket/focusgroup 0 Python 2026-05-27 Persona-driven dynamic testing for conversational AI products. Focus groups for your agents.
55 chquandogong/mission-spec 0 TypeScript 2026-06-22 Mission Spec — AI 에이전트 워크플로를 위한 task contract layer
56 sanya2025/edututor-eval 0 Python 2026-05-21 A lightweight evaluation framework for AI tutoring responses, built for education-focused LLM systems
57 Alexanderk30/context-override-resistance 0 Python 2026-05-19 RL-style eval measuring intent/action divergence in frontier agents: model acknowledges a correction, then acts on the s
58 GiuseppeSp/n8n-customer-interview-synthesizer 0 2026-05-19 Multi-agent customer-interview synthesis pipeline in n8n with LLM-as-judge eval, Slack human-in-the-loop approval, and d
59 gmitt98/fieldtest 0 Python 2026-05-16 LLM evaluation framework — define what correct, well-formed, and safe means before you measure
60 verifywise-ai/plugin-marketplace 3 TypeScript 2026-05-15 VerifyWise AI Governance Plugin Marketplace
61 AI-QL/tuui 1149 TypeScript 2026-05-14 A desktop MCP client designed as a tool unitary utility integration, accelerating AI adoption through the Model Context
62 prompt-foundry/typescript-sdk 6 TypeScript 2026-05-13 The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.
63 prompt-foundry/python-sdk 8 Python 2026-05-13 The prompt engineering, prompt management, and prompt evaluation tool for Python
64 Ruthwik-Data/mechanictrust 0 2026-05-11 AI product case study for trust, pricing transparency, and explainable diagnosis in auto repair.
65 SAY-5/eval-observability 0 Python 2026-05-10 Python LLM eval framework with full OTel tracing, structured logs, and daily Welch's-t-test regression detection persist
66 Ruthwik-Data/finrag-eval 0 Python 2026-05-10 RAG eval pipeline on Apple's FY 2024 10-K — found confident hallucinations, filed a metric-level bug in DeepEval, and bu
67 Ruthwik-Data/self-improving-prompt-agent 0 Python 2026-05-10 Prompt optimization loop that improves prompts through iterative mutation and LLM-as-judge evaluation. Score went 0.10 →
68 SAY-5/genai-eval 0 Python 2026-05-07 Multilingual GenAI evaluation service across 5 task types and 3 languages, with regression-trend dashboard
69 HumphreySun98/repoagentbench 32 Python 2026-04-30 SWE-bench for your codebase — mine your merged PRs into local, contamination-free coding-agent benchmarks. Adapters: cla
70 YagneshKhamar/phasio 0 TypeScript 2026-04-29 Jest-style testing for LLM prompts. Version prompts, run evals across OpenAI and Anthropic, catch regressions in CI.
71 lehigh-university-libraries/htr 2 Go 2026-06-24 Handwritten Text Recognition llm eval tool
72 JSLEEKR/evaltrack 0 TypeScript 2026-04-24 Local-first regression and trend CLI for promptfoo eval histories — the git log + git diff for LLM eval outputs.
73 izam-mohammed/ragrank 47 Python 2026-04-21 🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, an
74 arthursoares/openclaw-llm-bench 2 Python 2026-04-11 A reasoning benchmark runner for comparing LLMs as OpenClaw agents use them. 52 prompts, 3 eval sets, 11 traps, LLM-as-j
75 YuanyangLiNEU/mini-claude 0 TypeScript 2026-04-11 A minimal Claude Code built from scratch — agent loop, tool calling, web search, permissions, and a black-box LLM eval h
76 webrenew/models-dilemma 4 TypeScript 2026-04-08 The Prisoner's Dilemma played by LLMs
77 AdirAmsalem/openclaw-eval 0 Python 2026-03-31 Compare OpenClaw setups against the same scenario suite. Run prompts across multiple configurations, capture answers, la
78 Data-ScienceTech/forcefield 1 Python 2026-03-30 ForceField Python SDK -- AI security in 3 lines of code. Prompt injection detection, PII redaction, security evals, tool
79 klausners/prompt-optimizer 0 TypeScript 2026-03-26 Config-driven CLI that runs promptfoo evals, identifies low-scoring prompts, rewrites them via Claude API, and re-evalua
80 Aysnc-Labs/llm-eval 1 PHP 2026-03-20 A PHP package for evaluating LLM outputs. Test your prompts, validate responses, and ensure your AI features work correc
81 asarnaout/veritail 6 Python 2026-03-15 LLM-as-a-Judge evaluation platform for ecommerce search. Scores relevance, computes IR metrics, and flags quality issues
82 vola-trebla/llm-infrastructure 0 2026-03-14 Full-stack AI infrastructure - 5 projects from data ingestion to autonomous agents
83 whitecircle/circle-guard-bench 70 Python 2026-03-07 First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (g
84 tpertner/squeeze 5 Python 2026-03-01 Squeeze your model with pressure prompts to see if its behavior leaks.
85 grigio/llm-eval-simple 70 Python 2026-02-28 llm-eval-simple is a simple LLM evaluation framework with intermediate actions and prompt pattern selection
86 QuesmaOrg/BinaryAudit 92 Shell 2026-02-27 An open-source benchmark for evaluating AI agents' ability to find backdoors hidden in compiled binaries.
87 paradime-io/dbt-llm-evals 29 Python 2026-02-10 The warehouse-native LLM evaluation package for dbt™ - monitor AI quality without data egress
88 Striveworks/valor 41 Python 2026-02-09 Valor is a lightweight, numpy-based library designed for fast and seamless evaluation of machine learning models.
89 TADSTech/llm-output-grader 0 Python 2026-01-24 systematic llm grading
90 3ahmood/Agentic-Author-CrewAI 1 Jupyter Notebook 2026-01-15 On device autonomous research and content writing using open-sourced LLMs and Crew AI.
91 Supahands/llm-comparison-backend 22 Python 2026-01-13 This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be r
92 thedataquarry/structured-outputs 28 Python 2025-12-23 Structured output benchmarks comparing DSPy and BAML with different LLMs
93 higuseonhye/worldsim-eval 0 2025-12-20 Evaluate AI agents by simulating world-level consequences.
94 yukincom/llm-SugarScape 6 Python 2025-11-28 Multi-agent simulation using LLMs. Agents autonomously decide actions for survival, reproduction, and social behavior in
95 IAAR-Shanghai/GuessArena 10 Python 2025-11-15 [ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Re
96 artefactop/promptdev 2 Python 2025-09-22 A prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers.
97 multinear/multinear 45 Python 2025-09-02 Develop reliable AI apps
98 attogram/ollama-multirun 16 Shell 2025-08-30 Run a prompt against all, or some, of your models running on Ollama. Creates web pages with the output, performance stat
99 khoj-ai/llm-coup 14 TypeScript 2025-08-18 Let LLMs play coup with each other and see who's the best at deception & strategy
100 jaaack-wang/multi-problem-eval-llm 3 Jupyter Notebook 2025-08-08 Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities

🔍 How it works

Every 15 minutes, a GitHub Action runs tracker.py. That script:

  1. Fetches the latest state from GitHub Search API.
  2. Diffs against data/items.json (the previous snapshot).
  3. Rewrites the table above between the <!-- TRACKER_TABLE_* --> markers.
  4. Commits feat: +N added, -M removed (timestamp) if anything changed.

No external services. No paid APIs. Just a public data source and a free GitHub Action.


🤝 Contributing

See CONTRIBUTING.md — usually you don't need to: the tracker keeps itself current. If you spot a data-source bug or want to suggest a new column for the table, open an issue.


🔗 Related live trackers

If you find this useful, you might also like these other auto-updated trackers from the same maintainer — same mechanism, different upstream:


📜 License

MIT — see LICENSE.

More from linny006

  • Awesome Agent Skills — Curated, auto-updated awesome-list of vetted AI agent skills with quality ratings for Claude, GPT, and open-source agents (⭐ 0)

  • Agent Skills Daily Tracker — Real-time tracking of every new GitHub 'skills' repo to capture the AI agent skill ecosystem trend (⭐ 0)

  • Agent Eval Harness — Live, open-source benchmark for comparing AI coding agents on real GitHub issues (⭐ 0)

  • Prompt Tools Live — Live-updating tracker of prompt engineering tools, libraries, and techniques — refreshed every 15 minutes (⭐ 0)

  • LLMOps Radar — Live index of the newest LLMOps tooling — track what's shipping in LLM observability and deployment (⭐ 0)