feat: add automated skill evaluation workflow and reporting by yongsinp · Pull Request #115 · uw-ssec/rse-plugins

yongsinp · 2026-06-08T22:58:25Z

Overview

This pull request introduces a comprehensive automated evaluation workflow for skills, including new scripts and a GitHub Actions workflow for continuous skill review and test case-based evaluation. These changes enable automatic skill discovery, review, evaluation, and summary reporting, with support for test case generation, evaluation using multiple models and detailed Markdown reports.

Changes

New Evaluation Workflow Automation

Added .github/workflows/evaluate-all-skills.yml, a new GitHub Actions workflow that:

Discovers all skills and identifies those with evaluation test cases (retrieved from private rse-plugins-testcases).
Reviews each skill using the tessl tool and extracts validation and LLM judge results.
Evaluates skills with test cases using skill-eval-action across all available models.
Generates a Markdown summary report using a new evaluation summary script.

Test Case Generation

Added .github/scripts/generate-testcases.py, a script that automatically generates 5 to 6 YAML evaluation test cases when a new skill gets added, including both positive and negative trigger scenarios, based on the contents of each SKILL.md.

Evaluation Summary Rendering

Added .github/scripts/render-eval-summary.py, a script that combines review and evaluation JSON results into a detailed Markdown report with per-skill breakdowns, pass rates, and model-by-model comparisons for GitHub Actions reporting.

Skill Updates

Updated skills based on feedback from the Tessl LLM judge.

Set ANTHROPIC_BASE_URL using LITELLM_PROXY_URL secret.

Switch from ANTHROPIC_API_KEY to LITELLM_API_KEY in the evaluate-skills workflow.

…yntax

…eader

…idate summary

This reverts commit 2caff2a.

… judge feedback

yongsinp and others added 30 commits April 20, 2026 12:55

feat(ci): Add workflow to evaluate skills with discovery step

29e2cb1

fix(ci): Set ANTHROPIC_BASE_URL in evaluate-skills workflow

4014ff8

Set ANTHROPIC_BASE_URL using LITELLM_PROXY_URL secret.

feat(ci): Add evals for the scientific-documentation skill

d1654f5

fix(ci): fix evaluate-skills workflow to discover nested skills

0101b9d

fix(ci): use correct GitHub secret for API key

27c0ab4

Switch from ANTHROPIC_API_KEY to LITELLM_API_KEY in the evaluate-skills workflow.

chore: ignore .idea directory

5c87e82

fix(ci): loosen RTD dependency criterion to accept any valid config s…

ede3727

…yntax

fix(ci): use basename for skill name to avoid slashes in viewer filename

8b9006e

feat(ci): Add evals for the code-quality-tools skill

6704305

feat(ci): Add evals for the pixi-package-manager skill

03c8ee8

docs(skills): refine skill description for clarity

72e774a

Merge remote-tracking branch 'upstream/main'

dda4b46

feat(ci): add Tessl skill review

7594145

refactor(ci): rename evaluate-skills.yml to evaluate-all-skills.yml

a37c837

fix(ci): review all skills, only evaluate skills with evals/

7f94e39

fix(evals): rephrase ruff prompt to avoid file creation action

bd3cee1

fix(evals): loosen numpy type hint criterion

da3cc81

fix(evals): loosen pre-commit criterion to accept description or example

2084996

feat(ci): add tessl skill review output to Actions step summary

3bfe742

feat(ci): include collapsible judge evaluation in skill review summary

5b5c105

fix(ci): move status emoji into skill review heading

14975c3

fix(ci): add threshold (80) for skill review

b98dbbe

fix(ci): add separate pass/fail status to validation checks summary h…

a7925dd

…eader

fix(ci): always write step summary before propagating review exit code

9e6c44f

feat(ci): fetch all available models and run evaluation for each model

facb796

fix(ci): point skill-eval-action to forked repository for testing

feadb97

fix(ci): hardcode test models instead of fetching from LiteLLM proxy

7c87e7c

refactor(ci): move skill and model name to top

0f0f33f

feat(ci): add cross-model summary

3c2eec0

refactor(ci): split review and evaluate into separate jobs and consol…

9c5e95c

…idate summary

yongsinp added 29 commits June 2, 2026 00:26

docs(design-philosophies): update skill based on judge feedback

74896f8

docs(design-system-creation): update skill based on judge feedback

c672555

docs(design-tokens): update skill based on judge feedback

3e2b186

docs(frontend-components): update skill based on judge feedback

34f6a3d

docs(grid-layout-systems): update skill based on judge feedback

33cff51

docs(information-architecture): update skill based on judge feedback

299eb6d

docs(motion-design): update skill based on judge feedback

b083ddd

docs(responsive-design): update skill based on judge feedback

9cefe6a

docs(typography-systems): update skill based on judge feedback

b0b52ea

docs(usability-evaluation): update skill based on judge feedback

756180b

docs(user-journey-mapping): update skill based on judge feedback

f0dd0d1

docs(user-research): update skill based on judge feedback

7ad8330

docs(ux-writing): update skill based on judge feedback

e1f3611

docs(visual-design): update skill based on judge feedback

9302b68

docs(wireframing): update skill based on judge feedback

ed0d3d7

docs(access-pattern-analysis): update skill based on judge feedback

c72ad0f

docs(chunking-strategy): update skill based on judge feedback

e4649fe

docs(performance-reporting): update skill based on judge feedback

4d14dfb

docs(rechunking): update skill based on judge feedback

b3fac5e

docs(synthetic-data): update skill based on judge feedback

f08cbae

Revert "refactor: temporarily remove sleep for testing"

747d21b

This reverts commit 2caff2a.

docs(cloud-storage-backends): update skill based on judge feedback

535f155

docs(compression-codecs): update skill based on judge feedback

c915462

docs(data-migration): update skill based on judge feedback

acd42ac

docs(zarr-fundamentals): update skill based on judge feedback

a896e43

docs(zarr-xarray-integration): update skill based on judge feedback

9f073ef

docs: second pass of updating skills based on judge feedback

d24031e

fix(skills): quote frontmatter descriptions containing colons

b22e3e9

docs(user-persona-discovery): second pass of updating skills based on…

322459b

… judge feedback

yongsinp requested a review from lsetiawan June 8, 2026 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add automated skill evaluation workflow and reporting#115

feat: add automated skill evaluation workflow and reporting#115
yongsinp wants to merge 133 commits into
uw-ssec:mainfrom
yongsinp:main

yongsinp commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yongsinp commented Jun 8, 2026

Overview

Changes

New Evaluation Workflow Automation

Test Case Generation

Evaluation Summary Rendering

Skill Updates

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant