feat: add automated skill evaluation workflow and reporting#115
Open
yongsinp wants to merge 133 commits into
Open
feat: add automated skill evaluation workflow and reporting#115yongsinp wants to merge 133 commits into
yongsinp wants to merge 133 commits into
Conversation
Set ANTHROPIC_BASE_URL using LITELLM_PROXY_URL secret.
Switch from ANTHROPIC_API_KEY to LITELLM_API_KEY in the evaluate-skills workflow.
This reverts commit 2caff2a.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This pull request introduces a comprehensive automated evaluation workflow for skills, including new scripts and a GitHub Actions workflow for continuous skill review and test case-based evaluation. These changes enable automatic skill discovery, review, evaluation, and summary reporting, with support for test case generation, evaluation using multiple models and detailed Markdown reports.
Changes
New Evaluation Workflow Automation
Added
.github/workflows/evaluate-all-skills.yml, a new GitHub Actions workflow that:tessltool and extracts validation and LLM judge results.Test Case Generation
Added
.github/scripts/generate-testcases.py, a script that automatically generates 5 to 6 YAML evaluation test cases when a new skill gets added, including both positive and negative trigger scenarios, based on the contents of eachSKILL.md.Evaluation Summary Rendering
Added
.github/scripts/render-eval-summary.py, a script that combines review and evaluation JSON results into a detailed Markdown report with per-skill breakdowns, pass rates, and model-by-model comparisons for GitHub Actions reporting.Skill Updates
Updated skills based on feedback from the Tessl LLM judge.