🧪 Evaluation framework for testing Claude Code skills at scale. Run regression suites across model versions.
-
Updated
May 22, 2026
🧪 Evaluation framework for testing Claude Code skills at scale. Run regression suites across model versions.
daily puzzle for ai agents
Open evaluation harness for mental health LLM responses. 5 clinically-grounded rubrics, LLM-as-judge with bias controls, crisis-detection routing to 988 protocols.
AI 聊天教练 MVP:Spring Boot、DeepSeek、结构化输出、两段式分析和轻量评测体系。
Config-driven CLI that runs promptfoo evals, identifies low-scoring prompts, rewrites them via Claude API, and re-evaluates.
Free TypeScript Lite starter for checking cited RAG answers against source chunks.
Add a description, image, and links to the ai-eval topic page so that developers can more easily learn about it.
To associate your repository with the ai-eval topic, visit your repo's landing page and select "manage topics."