Skip to content

refactor(ai-research-workflows): skills-first redesign (10 skills, unified research, cross-agent portable)#113

Merged
lsetiawan merged 51 commits into
mainfrom
refactor/ai-research-workflows-skills
Jun 17, 2026
Merged

refactor(ai-research-workflows): skills-first redesign (10 skills, unified research, cross-agent portable)#113
lsetiawan merged 51 commits into
mainfrom
refactor/ai-research-workflows-skills

Conversation

@lsetiawan

@lsetiawan lsetiawan commented Jun 3, 2026

Copy link
Copy Markdown
Member

⚠️ DO NOT MERGE YET — draft. main is installed live by the VISS 2026 workshop (Block 3 installs ai-research-workflows@rse-plugins from main, unpinned). Merge timing must be coordinated with the workshop schedule. Mark ready / merge only once that's settled (or after the workshop, or after Block 3 is updated to the skills-first model).

Summary

Refactors ai-research-workflows from 7 slash commands + 1 monolithic skill into a skills-first plugin, makes it portable across coding agents, unifies research into a single skill, hardens every skill into a forcing function with verified enforcement, relocates durable artifacts into a committed docs/rse/specs/ tree, and makes validating-implementations persist a durable validation report. Plugin stays at 0.2.0.

This branch carries six layers of work:

1. Skills-first redesign

  • Skills become the unit of work and auto-trigger from natural language; slash commands are thin wrappers that invoke a skill (backward compatible).
  • A shared references/interaction-modes.md defines Collaborative (asks, gates on approval) vs Direct (acts on intent); each skill selects per request. "Interactive when you want collaboration, direct when you don't."
  • Orchestrator agent updated to reference the new skills + modes; monolithic research-workflow-management skill retired (templates moved into the owning skills); root README + marketplace entry updated.

2. Cross-agent portability

Makes the skill/command content work beyond Claude Code (Codex, Gemini, etc.) at the content level:

  • Removed ${CLAUDE_PLUGIN_ROOT} — bundled templates/refs now use skill-relative paths.
  • Cross-skill references are namespaced (ai-research-workflows:<skill>); command invocation is platform-neutral (dropped "via the Skill tool").
  • Genericized Claude-Code tool-name mentions in prose; added a single-agent fallback where a skill assumes parallel sub-agents.

3. Unified researching skill

  • Merged researching-codebases + researching-prior-art into one researching skill that scopes a question and investigates the codebase and/or external prior art in one flow, producing a single combined docs/rse/specs/research-<slug>.md.
  • Process mirrors a scope → plan → investigate → present → write → review → hand-off flow; unified exploratory stance (may surface gaps and light recommendations).
  • /prior-art command retired; /researchresearching.

4. Discipline hardening (writing-skills RED→GREEN)

A superpowers:writing-skills audit — fresh-eyes review of all 10 skills plus subagent pressure-tests — found the set was discipline-shaped in topic but advice-shaped in enforcement: it mirrored superpowers' structure but lacked its enforcement spine (Iron Law / Red Flags / rationalization tables). Three failures were witnessed under pressure (e.g. an agent chose to skip clean-room reproduction; planning produced placeholder, tests-last phases), then fixed and re-verified RED→GREEN on both Opus and Haiku:

  • using-research-workflows: passive router → forcing function<EXTREMELY-IMPORTANT> 1%-rule mandate, instruction/skill priority, research-flavored Red Flags; reconciled the "don't over-run the chain" line so it can no longer be used to skip skills entirely; description now fires on every research-software turn.
  • ensuring-reproducibility: closed the "(or explicitly deferred with reason)" / "where feasible" escape hatches — clean-room reproduction is required, bounded only by technical impossibility, never time; adds a minimal-run floor, code-commit + hardware capture, and Red Flags.
  • hardening-research-code: Iron Law that a regression baseline ≠ a correctness check; must-check-derivable-reference gate before pinning; UNVERIFIED labeling; watch-it-fail-first; tolerance-selection guidance.
  • planning-implementations: restored writing-plans' dropped teeth — a No-Placeholders blocking rule (forbids "add appropriate error handling", "write tests for the above") and bite-sized test-first task granularity; template models test-first phases and adds a Reproducibility & Correctness criteria block.
  • validating-implementations: Iron Law "no verdict without fresh output you produced yourself" — explicitly covers a teammate's green report, not just checkmarks; adds reproducibility/correctness validation + Red Flags.
  • running-experiments: Iron Law "an experiment is real measured code, or it is not an experiment" — forbids estimating one side; adds Red Flags and reproducibility-wired benchmarks (repeated runs, variance, fixed seeds).
  • implementing-plans: review-before-building gate, branch hard-stop, per-phase provenance trigger; de-duped the redundant phase-loop restatement.
  • creating-handoffs: capture research state (seeds, env/lockfile, data versions/checksums, partial results, in-flight jobs) + a "report true state" gate forbidding clean-looking handoffs that hide failing tests, uncommitted work, or unreproduced results.
  • Research-lens success criteria baked into the plan / validate / handoff templates; consistency-scan + research-criteria re-trigger added to iterating-plans; research-doc provenance (commit SHA + date) added to researching.

Commits: 556bab6, e9bb6bf, 25fff68, e55ff96.

5. Artifact relocation to docs/rse/specs/

Moves every workflow-generated document out of the gitignored, ephemeral-looking .agents/ and into a tracked, committed docs/rse/specs/ tree — the artifacts capture decisions (research → plan → experiment → implementation → handoff), so they belong in version control alongside the code, not a scratch directory.

  • Flat swap, read-both / write-new. Skills now write only to docs/rse/specs/<type>-<slug>.md; when reading existing docs they search docs/rse/specs/ first and fall back to the legacy .agents/ location — so in-flight consumer projects (incl. the live VISS install) keep working with no forced migration. Mirrors the existing legacy prior-art-*.md handling.
  • Canonical definition lives in two anchors: the README's renamed ## docs/rse/specs/ — workflow artifacts section (framed as committed, version-controlled artifacts) and the expanded using-research-workflows statement; every other skill stays concrete but consistent.
  • 15 files, 76 references updated (README, orchestrator agent, 8 skills + 4 reference docs); commands/ and assets/ needed no change. .gitignore is left untouched — docs/rse/specs/ is tracked by not being ignored, and the legacy .agents/ ignore stays as a safety net.

Commits: 141186842add40 (11 commits).

6. Durable validation artifact

validating-implementations was the only workflow skill that produced no durable artifact — it printed its verdict inline and the result vanished with the conversation. It now also writes docs/rse/specs/validation-<slug>.md (slug mirrors the validated plan), joining the research → plan → implement → handoff decision record.

  • Single living doc, written + shown. Keeps the inline report (immediate feedback + the fix/detail/re-run prompt) and additionally persists the same content. Re-validation overwrites validation-<slug>.md; git history preserves prior verdicts.
  • Provenance inside. Each doc records what the current verdict covers — the plan/implement docs validated against, the commit SHA, and the date — plus a ## References back-link. The validation discipline (Iron Law, re-run-everything-yourself) is unchanged.
  • New validation-*.md type wired in across the README (output column + naming table), the using-research-workflows canonical list, the creating-handoffs artifact survey, and the orchestrator's outcomes + checklist.

Commits: ffc77897d16939 (6 commits).

Skills (10) & commands (9)

  • Workflow (7): researching, planning-implementations, iterating-plans, running-experiments, implementing-plans, validating-implementations, creating-handoffs.
  • Research-software (2): ensuring-reproducibility, hardening-research-code.
  • Meta (1): using-research-workflows — routing + the shared interaction protocol + the forcing rule.
  • Commands (9): /research, /plan, /iterate-plan, /experiment, /implement, /validate, /handoff, /reproduce, /harden.

Key decisions

  • Experiments keep temp-branch / scratch-dir isolation — no git worktrees (deliberate).
  • No hooks/MCP added (a SessionStart hook for the meta-skill is deferred — see layer 4).
  • Discipline-hardening is governed by writing-skills behavioral verification (subagent RED→GREEN), which takes precedence over the tessl rubric where they conflict: the Red Flags / rationalization reinforcement that bulletproofs a discipline skill can lower its tessl score even as it improves agent behavior — a deliberate tradeoff.
  • Workflow artifacts now live in the committed docs/rse/specs/ tree (was the gitignored .agents/); reads fall back to .agents/ so existing projects don't break, and .gitignore is left as-is (see layer 5).
  • validating-implementations writes a single living validation-<slug>.md (overwritten per run; git keeps history) rather than timestamped files — matches the implement-<slug> pattern; the inline report is kept, not replaced (see layer 6).
  • Per-platform packaging manifests (.codex-plugin, gemini-extension.json, AGENTS.md, …) are out of scope here — portability is at the content level only.
  • The earlier command-centric v0.2.0 redesign (issues feat(ai-research-workflows): core infrastructure — workflow model, profiles, retreat paths #87feat(ai-research-workflows): optimize phase, plan template, and docs #92) is superseded by this skills-first direction.

Test plan

  • writing-skills subagent pressure-tests (Opus + Haiku): each discipline fix witnessed RED (rule fails / is rationalized away without it) → GREEN (rule holds, agent cites it) for the meta-skill, reproducibility, hardening, planning, validating, and experiments.
  • Frontmatter intact on all 10 skills; all assets//references/ links resolve; no dangling section refs; plugin.json valid; counts accurate (10 skills / 9 commands).
  • No ${CLAUDE_PLUGIN_ROOT} / old-skill-name references remain; relative paths resolve.
  • .agents/docs/rse/specs/ relocation (layer 5): grep sweep confirms every remaining .agents/ mention is an explicit legacy fallback (zero write-target leaks), all 15 files carry the new path, full-diff review shows no unintended changes, and tessl structural lint passes (0 errors/0 warnings) on all touched skills.
  • Durable validation artifact (layer 6): grep sweep confirms all six touchpoints reference validation-<slug>.md / validation-*.md, the old "Inline validation report" / "Output the report inline" wording is gone, and tessl structural lint passes (0 errors/0 warnings) on the three touched skills (judge scores unchanged from pre-change — score-neutral).
  • [~] tessl skill review ≥ 80 held for the skills-first redesign; the layer-4 hardening prioritizes writing-skills verification, so some skills may now score below 80 on tessl's rubric due to intentional enforcement reinforcement. The layer-5 path swap is score-neutral (verified: identical frontmatter → identical Description sub-score; any Content movement is judge input-sensitivity, not a quality change). Re-run tessl before merge if a numeric gate is required.
  • Manual (needs interactive session): /research triggers researching; /help no longer lists /prior-art; a vague request enters Collaborative scoping and a specific one runs Direct; codebase-only / external-only / both questions each produce the right doc sections; end-to-end research → plan → implement → validate produces the expected docs/rse/specs/ docs.

Design specs + implementation plans live in docs/specs/2026-06-02-ai-research-workflows-skills-refactor-*.md, docs/specs/2026-06-03-researching-skill-unification-*.md, docs/specs/2026-06-04-rse-specs-artifact-location-*.md, and docs/specs/2026-06-04-validation-durable-artifact-*.md (kept uncommitted per project convention).

🤖 Generated with Claude Code

lsetiawan added 19 commits June 2, 2026 15:59
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

🔍 Tessl Skill Review

plugins/ai-research-workflows/skills/creating-handoffs/SKILL.md

score

Review Details

Review Details

Dimension Score Detail
conciseness ██░ 2/3 The skill is mostly efficient and well-structured, but includes some redundancy — the 'Common Mistakes' section largely restates guidance already given in the process steps and writing guidelines (e.g., broken state, research state, critical files). The 'Report true state' subsection could be folded into step 2 more tightly. However, it avoids explaining concepts Claude already knows.
actionability ███ 3/3 The skill provides highly concrete, actionable guidance: specific git commands to run, exact directory paths to search, a precise filename format with example, a template file to read, and a specific output format to present. The instructions are copy-paste ready and leave little ambiguity about what to do.
workflow clarity ███ 3/3 The four-step process is clearly sequenced (Gather → Determine → Generate → Present), with explicit sub-steps within each phase. The 'Report true state' section serves as a validation checkpoint before document generation, ensuring broken/unverified state is surfaced. The workflow includes a clear feedback mechanism via the 'Known-broken / unverified' callout requirement.
progressive disclosure ██░ 2/3 The skill references external files (handoff template at assets/handoff-template.md, other skills via cross-references) which is good progressive disclosure, but no bundle files were provided to verify these exist. The content itself is somewhat long — the 'Common Mistakes' and 'Writing guidelines' sections could potentially be in a separate reference file. The cross-references section is well-organized with clear one-level-deep pointers.

Overall: This is a well-crafted handoff skill with strong actionability and clear workflow sequencing. Its main weakness is moderate redundancy between the process steps, writing guidelines, and common mistakes sections, which repeat similar points (broken state, research state, critical files). The progressive disclosure is reasonable but hard to fully evaluate without bundle files, and some content consolidation could improve token efficiency.

Suggestions:

  • Consolidate the 'Common Mistakes' section by removing items already covered in the process steps and writing guidelines — keep only truly non-obvious pitfalls to reduce redundancy and improve conciseness.
  • Merge the 'Report true state' subsection into Step 2 ('Determine what's relevant') since verification status is already listed there, to avoid restating the same guidance.

plugins/ai-research-workflows/skills/ensuring-reproducibility/SKILL.md

score

Review Details

Review Details

Dimension Score Detail
conciseness ██░ 2/3 The skill is mostly efficient and covers genuinely non-obvious provenance practices, but it's somewhat verbose — the 'Red flags' table, the 'Iron Law' repetition across multiple sections, and the 'Common Mistakes' section partially restate what was already said in the workflow and verification sections. The interaction mode preamble and 'Purpose' section add little value.
actionability ██░ 2/3 The skill provides concrete guidance on what to capture (commit hashes, seeds, lockfiles, commands) and includes example bash commands and a tolerance format. However, it lacks a complete, end-to-end worked example showing a full provenance record — the guidance is specific but scattered across sections rather than demonstrating a complete artifact. The commands shown are illustrative fragments rather than a full reproducible workflow.
workflow clarity ███ 3/3 The workflow is clearly sequenced: capture provenance → write the record to a specific location → reproduce in a clean environment → compare outputs → document the result. The verification steps are explicit (fresh env, pinned lockfile only, exact commands, compare outputs), with a clear feedback loop for nondeterministic cases and explicit tolerance documentation. The 'Red flags' table and Iron Law reinforce that verification cannot be skipped.
progressive disclosure ███ 3/3 The skill appropriately defers environment pinning mechanics to specific package manager skills (pixi, uv), keeps the main content focused on the reproducibility strategy, and clearly signals cross-references to related experiment and implementation skills. File placement conventions are specified. No bundle files are needed for this instruction-oriented skill.

Overall: This is a well-structured reproducibility skill with strong workflow clarity and good progressive disclosure through cross-references to related skills. Its main weaknesses are moderate verbosity (the Iron Law is restated in multiple forms across sections, and the Red Flags table largely repeats earlier content) and the lack of a complete worked example showing a finished provenance record. The actionability would benefit from a concrete, end-to-end example artifact rather than scattered fragments.

Suggestions:

  • Add a complete example of a finished ## Reproducibility section as it would appear in a spec file, showing all provenance fields filled in with realistic values — this would significantly boost actionability.
  • Consolidate the repeated emphasis on clean-room reproduction: the Iron Law section, the Verify section, the Red Flags table, and the Common Mistakes bullet all say the same thing — pick one authoritative location and trim the rest to brief back-references.

plugins/ai-research-workflows/skills/hardening-research-code/SKILL.md

score

Review Details

Review Details

Dimension Score Detail
conciseness ██░ 2/3 The content is generally well-written but includes some redundancy — the Iron Law is restated almost verbatim in the regression tests section, and the 'Purpose' section explains what hardening is, which Claude already knows. The red flags table and common mistakes section overlap significantly. Could be tightened by ~20-30%.
actionability ██░ 2/3 The skill provides strong strategic guidance (what to validate and why) and explicitly defers implementation mechanics to another skill. However, the only code shown is strategy-level pseudocode for tolerance comparison. While the deferral is intentional and justified, the skill itself lacks executable examples — e.g., a concrete golden test skeleton or a real invariant check. The guidance is specific but not copy-paste ready.
workflow clarity ███ 3/3 The 4-step workflow is clearly sequenced with an explicit validation checkpoint (step 3: confirm each test can fail by making it go red then green). The quality checklist provides a comprehensive verification gate before marking complete. The red flags table adds error-recovery guidance. The feedback loop of 'run, confirm failure, restore' is well-articulated.
progressive disclosure ███ 3/3 The skill clearly positions itself as the strategy layer and explicitly defers implementation details (pytest mechanics, CI config, numpy/torch assertions) to the python-testing skill. Cross-references to validating-implementations and ensuring-reproducibility skills are well-signaled and one level deep. Content is well-organized with clear section headers. No bundle files are needed for this strategy-focused skill.

Overall: This is a well-structured strategy-level skill that clearly defines what to validate and why when hardening research code. Its strongest aspects are the workflow clarity (with explicit fail-then-pass validation) and progressive disclosure (clean deferral to implementation skills). Its main weaknesses are moderate redundancy between sections (Iron Law restated in regression tests; red flags overlapping with common mistakes) and the lack of concrete executable examples, relying instead on pseudocode and deferral.

Suggestions:

  • Consolidate the Iron Law restatement in the regression tests section into a brief back-reference rather than repeating the full policy, and merge the overlapping content between 'Red flags' and 'Common Mistakes' to reduce redundancy.
  • Add at least one concrete, executable example — e.g., a real invariant check or golden test skeleton in Python — even if full pytest mechanics are deferred, to give Claude a copy-paste starting point for the most common case.

plugins/ai-research-workflows/skills/implementing-plans/SKILL.md

score

Review Details

Review Details

Dimension Score Detail
conciseness ██░ 2/3 The skill is generally well-structured but includes some redundancy — the 'Common Mistakes' section largely restates rules already covered in the main workflow, and the quality checklists repeat verification steps described earlier. Some sections like 'Interaction mode' and cross-references add modest overhead. However, it avoids explaining concepts Claude already knows and stays focused on the task.
actionability ███ 3/3 The skill provides highly concrete, actionable guidance: specific shell commands (e.g., ls -lt docs/rse/specs/plan-*.md), exact file naming conventions (plan-jwt-auth.mdimplement-jwt-auth.md), explicit checkpoint formatting with the verification output template, and clear rules for checkbox management. The instructions are specific enough to be directly followed without ambiguity.
workflow clarity ███ 3/3 The multi-step workflow is exceptionally clear with explicit sequencing (numbered checklist), mandatory review gates at steps 2, 5, and 7, validation checkpoints after each phase (automated then manual), a feedback loop for failures, mismatch handling with stop-and-report protocol, and a clear exception for consecutive phases. Destructive operations are guarded by the branch confirmation hard stop.
progressive disclosure ██░ 2/3 The skill references external files like references/templates.md, assets/implement-template.md, and several cross-referenced skills, which is good progressive disclosure design. However, no bundle files are provided, so we cannot verify these references resolve correctly. The main content is somewhat long (~180 lines) with the Common Mistakes and Quality Checklist sections that could potentially be split out, but the inline content is reasonably organized with clear headers.

Overall: This is a strong, well-structured implementation workflow skill with excellent workflow clarity and actionability. The step-by-step process with explicit review gates, mismatch handling, and verification checkpoints is thorough and well-designed. Minor weaknesses include some redundancy between the main workflow, common mistakes, and quality checklists, and the inability to verify referenced bundle files that the skill depends on.

Suggestions:

  • Consider consolidating the 'Common Mistakes' section into the relevant workflow steps as inline warnings/notes to reduce redundancy and overall length.
  • The quality checklists at the end largely restate the workflow — consider whether they add enough value to justify the token cost, or if they could be moved to a referenced file.

plugins/ai-research-workflows/skills/iterating-plans/SKILL.md

score

Review Details

Review Details

Dimension Score Detail
conciseness ██░ 2/3 The skill is mostly efficient and well-structured, but includes some content Claude already knows (e.g., explaining what 'good edits' vs 'bad edits' are, the 'Common Mistakes' section largely restates guidance already given in the process steps). The quality checklist also partially duplicates the workflow. However, most content earns its place given the complexity of the task.
actionability ███ 3/3 The skill provides concrete, executable guidance throughout: specific shell commands (ls -lt for finding plans), exact file path patterns, structured confirmation templates, specific edit patterns with markdown examples, and a detailed consistency scan checklist. The guidance is specific enough to be directly followed.
workflow clarity ███ 3/3 The 5-step process is clearly sequenced with explicit validation checkpoints: research must complete before proceeding (Step 2), user confirmation is required before editing (Step 3), a consistency scan must pass after editing (Step 4), and changes are presented for review (Step 5). The feedback loop for further changes is also explicit ('re-read the plan first and apply the same process').
progressive disclosure ███ 3/3 The skill appropriately keeps the main workflow inline while deferring detailed iteration patterns to 'references/iteration-patterns.md' and cross-referencing related skills (implementing-plans, ensuring-reproducibility, hardening-research-code, using-research-workflows). References are one level deep and clearly signaled. The content is well-organized with clear section headers.

Overall: This is a well-crafted skill with a clear, validated multi-step workflow for iterating on implementation plans. Its strengths are strong actionability with concrete templates and commands, excellent workflow clarity with explicit checkpoints and confirmation gates, and good progressive disclosure to related skills and reference files. The main weakness is moderate redundancy between the process steps, common mistakes section, and quality checklist, which could be tightened to save tokens.


plugins/ai-research-workflows/skills/planning-implementations/SKILL.md

score

Review Details

Review Details

Dimension Score Detail
conciseness ██░ 2/3 The skill is fairly well-structured but includes some redundancy — the 'Common Mistakes' section largely restates the blocking rules already defined in the process steps. The 'Quality checklist' also overlaps significantly with inline requirements. Some tightening is possible, though the content is mostly purposeful.
actionability ███ 3/3 The skill provides highly specific, concrete guidance: exact file paths and naming conventions (docs/rse/specs/plan-.md), explicit blocking rules with examples of what NOT to write, a required template reference (assets/plan-template.md), and detailed task sequencing (write failing test → run → implement → run → commit). The actionability is excellent for an instruction-only planning skill.
workflow clarity ███ 3/3 The 5-step process is clearly sequenced with explicit checkpoints: research must complete before synthesis (Step 2), approach approval before detailed writing (Step 3), blocking rules that halt progress until resolved (no placeholders, no open questions), and a final review/iterate step. The quality checklist serves as a validation checkpoint before completion.
progressive disclosure ██░ 2/3 The skill references external files (assets/plan-template.md, other skills like ai-research-workflows:researching, ai-research-workflows:iterating-plans, ai-research-workflows:implementing-plans) which is good progressive disclosure. However, no bundle files are provided, so we can't verify these references resolve. The skill itself is somewhat long (~150 lines) and the Common Mistakes section could potentially be a separate reference, but the inline content is reasonably organized with clear headers.

Overall: This is a strong planning skill with excellent actionability and workflow clarity. The 5-step process is well-sequenced with explicit blocking rules that prevent common failure modes (placeholders, open questions, tests-last). The main weakness is moderate redundancy between the blocking rules, common mistakes, and quality checklist sections, which inflates token usage without adding proportional value.

Suggestions:

  • Consolidate the 'Common Mistakes' section into the blocking rules and process steps where they already appear, or reduce it to a brief list of anti-pattern names without re-explaining each one.
  • Consider moving the quality checklist to a separate referenced file (e.g., assets/plan-checklist.md) to reduce the main skill's length while keeping it accessible.

plugins/ai-research-workflows/skills/researching/SKILL.md

score

Review Details

Review Details

Dimension Score Detail
conciseness ██░ 2/3 The skill is reasonably efficient but includes some redundancy — the checklist and the process section largely duplicate each other. Some phrasing could be tightened (e.g., the interaction mode line adds little without the referenced protocol). However, it avoids explaining concepts Claude already knows and stays focused on the workflow.
actionability ██░ 2/3 The skill provides a clear step-by-step process with specific file paths, naming conventions, and concrete review gates. However, it lacks executable code/commands (e.g., no actual search commands, no example template content) and delegates key details to referenced files (references/codebase-research.md, references/prior-art-research.md, assets/research-template.md) that are not provided in the bundle.
workflow clarity ███ 3/3 The workflow is clearly sequenced with explicit validation checkpoints at steps 5, 7, and 8. There are feedback loops (step 8 mentions revising and re-running self-review), a quality checklist, and clear gates that must be passed before proceeding. The common mistakes section adds guardrails against known failure modes.
progressive disclosure ██░ 2/3 The skill references several external files (references/codebase-research.md, references/prior-art-research.md, assets/research-template.md, and other skills) which suggests good structural intent. However, none of these bundle files are provided, making it impossible to verify they exist or contain useful content. The checklist-then-process duplication also suggests content that could be better organized.

Overall: This is a well-structured workflow skill with strong sequencing, explicit review gates, and good error-prevention guidance via the common mistakes section. Its main weaknesses are the duplication between the checklist and process sections, and heavy reliance on referenced files that aren't provided in the bundle, which undermines actionability since the core investigation steps (codebase pass, prior-art pass) are entirely delegated. The quality checklist and self-review steps are a notable strength.

Suggestions:

  • Eliminate the duplication between the Checklist and Process sections — either merge them or make the checklist a pure tracking artifact that doesn't repeat process details.
  • Include at least a summary of what the codebase-research.md and prior-art-research.md references contain, or inline the key techniques/commands, so the skill is actionable without the bundle files.
  • Provide a minimal example of a completed research document or at least the template structure inline, so Claude knows the expected output format without needing to read assets/research-template.md.

plugins/ai-research-workflows/skills/running-experiments/SKILL.md

score

Review Details

Review Details

Dimension Score Detail
conciseness ██░ 2/3 The skill is reasonably well-written but includes some redundancy — the 'Iron Law' section, 'Red flags' table, and 'Common Mistakes' section all reinforce the same core message (don't fabricate comparisons) three times in slightly different forms. The 'When NOT to experiment' section and interaction mode preamble add moderate overhead. However, most content earns its place given the complexity of the workflow.
actionability ███ 3/3 The skill provides concrete, executable guidance throughout: specific bash commands for benchmarking, git commands for branch isolation, exact file paths for templates and output documents, a markdown template for the recommendation format, and a detailed quality checklist. The process steps are specific enough to follow without ambiguity.
workflow clarity ███ 3/3 The 6-step process is clearly sequenced with explicit validation checkpoints — the quality checklist at the end serves as a comprehensive verification step, the 'Red flags' table provides error-detection guidance mid-workflow, and Step 3 includes explicit instructions to run benchmarks multiple times and report variance. The feedback loop of 'measure → record → compare → recommend' is well-defined.
progressive disclosure ██░ 2/3 The skill references external files (experiment-template.md, research/plan docs in docs/rse/specs/, other skills like ensuring-reproducibility) which is good progressive disclosure design. However, no bundle files are provided to verify these references exist, and the skill itself is quite long (~150+ lines) with content like the 'Red flags' table and 'Common Mistakes' that could potentially be in a supplementary file. The cross-references are clearly signaled but the main document carries substantial inline content.

Overall: This is a well-structured, highly actionable skill for running technical experiments. Its greatest strength is the clear 6-step workflow with concrete commands, output templates, and a thorough quality checklist. Its main weakness is redundancy — the anti-pattern of fabricating comparisons is stated in the Iron Law, the Red Flags table, and the Common Mistakes section, consuming tokens on the same message three times. The skill would benefit from consolidating these overlapping sections.

Suggestions:

  • Consolidate the 'Iron Law', 'Red flags' table, and 'Common Mistakes' section into a single concise anti-patterns section to eliminate redundancy and save ~30-40 lines.
  • Consider moving the 'Red flags' table and 'Common Mistakes' into a supplementary reference file to keep the main SKILL.md focused on the workflow steps.

plugins/ai-research-workflows/skills/using-research-workflows/SKILL.md

score

Review Details

Review Details

Dimension Score Detail
conciseness ██░ 2/3 The skill is reasonably efficient but includes some redundancy — the 'Red flags' table, 'Common Mistakes', and the opening EXTREMELY-IMPORTANT block all reinforce the same 'don't skip skills' message three times. The interaction modes section also explains concepts Claude could infer from a shorter directive. However, the decision tree and workflow patterns are lean and useful.
actionability ███ 3/3 The decision tree provides concrete, unambiguous routing for every scenario. Fully-qualified skill invocation IDs are given, workflow chains are spelled out with exact sequences, document naming conventions include specific patterns and paths, and the cross-plugin deferral table maps concerns to exact plugin IDs. This is copy-paste actionable for a meta-routing skill.
workflow clarity ███ 3/3 The skill excels at workflow sequencing: the decision tree provides clear entry points, the 'Common workflow patterns' section shows five right-sized chains with explicit ordering, and the instruction priority / sequencing section establishes a clear understand→build→verify progression. The interaction modes section adds validation checkpoints (hard stops for irreversible actions). For a meta-routing skill, this is exemplary.
progressive disclosure ██░ 2/3 The skill references references/interaction-modes.md and defers to nine specialist skills/plugins, which is good structure. However, no bundle files were provided, so we cannot verify these references resolve. The skill itself is fairly long (~150 lines of substantive content) and some sections like the full interaction modes explanation could be offloaded to the referenced file rather than partially duplicated inline.

Overall: This is a well-structured meta-routing skill with excellent actionability and workflow clarity — the decision tree, workflow patterns, and cross-plugin deferral table are all highly concrete and useful. Its main weakness is moderate redundancy: the 'never skip a skill' message is hammered home in at least three separate sections (the EXTREMELY-IMPORTANT block, Red flags table, and Common Mistakes), and the interaction modes section partially duplicates content that it also references externally. Overall it's a strong skill that could benefit from tightening.

Suggestions:

  • Consolidate the 'don't skip skills' messaging: the EXTREMELY-IMPORTANT block, Red flags table, and first Common Mistake all say the same thing — pick one authoritative location and reference it from the others.
  • Trim the interaction modes section to just the routing rule (e.g., 'Select Collaborative or Direct per references/interaction-modes.md; explicit user phrasing wins; hard stops always require confirmation') since the full protocol is already referenced externally.

plugins/ai-research-workflows/skills/validating-implementations/SKILL.md

score

Review Details

Review Details

Dimension Score Detail
conciseness ██░ 2/3 The skill is moderately verbose. The 'Iron Law' section, 'Red flags' table, and 'Common Mistakes' section all hammer the same point (don't trust claims, re-run everything yourself) at least 5-6 times across different sections. The quality checklist also repeats much of what was already stated. However, the core workflow steps and report structure are reasonably efficient.
actionability ███ 3/3 The skill provides concrete, executable commands (git log, git diff, make test, pytest, mypy, npm run lint), specific file paths and naming conventions (docs/rse/specs/validation-.md), exact markdown formatting for results documentation, and clear examples of pass/fail output formatting. The slug derivation logic and provenance header are copy-paste ready.
workflow clarity ███ 3/3 The multi-step process is clearly sequenced (gather evidence → read plan → investigate in parallel → per-phase validation → report). Validation checkpoints are explicit throughout: run all automated verification commands, document pass/fail per check, investigate root causes before synthesizing, and a comprehensive quality checklist before delivering. The feedback loop for failures (read error → identify failing code → determine root cause → document finding) is well-defined.
progressive disclosure ██░ 2/3 The skill references external files (references/report-templates.md, other skills like ai-research-workflows:planning-implementations) but no bundle files are provided to verify these exist. The content is somewhat monolithic at ~200 lines with significant repetition that could be extracted. The report template details are referenced externally which is good, but the repeated 'don't rubber-stamp' messaging across Iron Law, Red flags, and Common Mistakes could be consolidated.

Overall: This is a well-structured validation workflow skill with strong actionability and clear step-by-step sequencing including explicit verification checkpoints. Its main weakness is redundancy: the core message of 'don't trust claims, verify everything yourself' is repeated across at least four separate sections (Iron Law, Red flags table, Common Mistakes, Quality checklist), consuming significant token budget without adding new information. The referenced bundle files (report-templates.md) are not provided, making it hard to fully assess progressive disclosure.

Suggestions:

  • Consolidate the 'Iron Law', 'Red flags', and 'Common Mistakes' sections into a single concise 'Critical Rules' section — the same principle (never trust unverified claims) is restated 5+ times across these sections.
  • Remove the 'Common Mistakes' bullet points that duplicate content already covered in the Iron Law and Red flags sections, keeping only genuinely distinct mistakes (e.g., 'stopping at the first failure', 'not separating automated vs manual results').

To improve your score, point your agent at the Tessl optimization guide. Need help? Jump on our Discord.

Feedback

Report issues with this review at tesslio/skill-review, or send private feedback from your terminal with tessl feedback.

@lsetiawan lsetiawan force-pushed the refactor/ai-research-workflows-skills branch from 0320be7 to b62d415 Compare June 3, 2026 16:41
lsetiawan added 9 commits June 3, 2026 10:00
…ing/iterating/validating (descriptions, references/, Common Mistakes)
…experiments/reproducibility/hardening (descriptions, mode pointer, Common Mistakes)
…handoff/meta skills (descriptions, H1, mode pointer, Common Mistakes)
… coding agents

Align skill and command content with the obra/superpowers cross-agent
interoperability pattern so the skills work beyond Claude Code:

- Replace ${CLAUDE_PLUGIN_ROOT} template/reference paths with skill-relative
  paths; point the shared interaction-modes protocol at the
  using-research-workflows skill by name instead of a plugin-root path.
- Drop "(via the Skill tool)" from the 10 thin command wrappers so skill
  invocation is platform-neutral.
- Genericize Claude Code tool names in prose (WebSearch/WebFetch, Edit,
  TodoWrite status tokens).
- Namespace internal cross-skill references as ai-research-workflows:<skill>
  (matching the plugin:skill form already used for cross-plugin refs); keep
  the meta-skill's decision tree and workflow-chain diagrams as short labels
  with a bridging note.
- Add a single-agent fallback to researching-codebases for platforms without
  parallel sub-agent support.

Agent file and README intentionally unchanged; no per-platform manifests added.
…/research to unified researching skill; bump to 0.3.0
@lsetiawan lsetiawan changed the title refactor(ai-research-workflows): skills-first redesign (11 skills, thin command wrappers, adaptive interaction) refactor(ai-research-workflows): skills-first redesign (10 skills, unified research, cross-agent portable) Jun 3, 2026
lsetiawan added 23 commits June 3, 2026 15:29
…cipline to implementing-plans (mirror executing-plans)

Mirror the discipline structure of superpowers executing-plans while keeping
this plugin's deliberate divergences (no worktrees, no subagent-driven
execution, no per-skill announce):

- Add a Checklist spine with flagged review gates (steps 2, 5, 7)
- Add 'Review the plan before building': critical pre-flight review, a branch
  hard stop (no implementing on main/master without consent), and an
  RSE assumption check (data/deps/seeds/env)
- Add a per-phase provenance trigger routing to ensuring-reproducibility
- Add a completion gate plus matching Common Mistakes and Quality-checklist items

Validated with the writing-skills subagent pressure test (RED->GREEN, Opus +
Haiku): agents cite the new branch hard-stop and provenance rules as decisive,
where the old skill left them unstated.
Add the superpowers enforcement spine (Iron Law, Red Flags rationalization
table) that these skills lacked, closing loopholes witnessed via writing-skills
subagent pressure-tests (RED->GREEN verified on Opus + Haiku):

- using-research-workflows: passive router -> forcing function. Adds an
  EXTREMELY-IMPORTANT 1%-rule mandate, instruction/skill priority, and a
  research-flavored Red Flags table; reconciles the 'don't force the full
  workflow' line so it can no longer be weaponized to skip skills entirely;
  description now fires on every research-software turn.
- ensuring-reproducibility: closes the '(or explicitly deferred with reason)' +
  'Where feasible' escape hatches that let a deadline researcher skip the
  clean-room reproduction. Deferral now bounded to technical impossibility; adds
  a minimal-clean-room floor, code-commit + hardware capture, and Red Flags.
- hardening-research-code: Iron Law that a regression baseline is not a
  correctness check; MUST-check-derivable-reference gate before pinning;
  UNVERIFIED labeling; watch-it-fail-first; tolerance-selection guidance.

Diagnosis and evidence recorded in project memory.
…ls (P1)

RED->GREEN verified via writing-skills subagent pressure-tests:

- planning-implementations: restore writing-plans' dropped teeth — a
  NO-placeholders blocking rule (forbids 'add appropriate error handling',
  'write tests for the above') and bite-sized test-first task granularity
  (failing test -> fail -> minimal code -> pass -> commit). plan-template now
  models test-first phases and adds a Reproducibility & Correctness criteria
  block. (Witnessed RED: the old skill produced placeholder, tests-last phases.)
- validating-implementations: Iron Law 'no verdict without fresh output you
  produced yourself' — explicitly covers a teammate's green report, not just
  checkmarks; adds a reproducibility/correctness validation step + Red Flags.
- running-experiments: Iron Law 'an experiment is real measured code, or it is
  not an experiment' — forbids estimating one side; adds Red Flags and
  reproducibility-wired benchmarks (repeated runs, variance, fixed seeds).

Known residual: plan-template still has leftover placeholder scaffolding
(Testing Strategy / Edge Cases bullets) to trim in P2.
…y polish

P1 remainder:
- creating-handoffs: capture research state (seeds, env/lockfile, data
  versions/checksums, partial results, in-flight jobs) and add a 'Report true
  state' gate forbidding clean-looking handoffs that hide failing tests,
  uncommitted work, or unreproduced results. Template gains Reproducibility &
  Data State and Verification State / Known-Broken sections.
  (Application-tested: captures all research state + surfaces all broken state.)

P2:
- implementing-plans: remove the duplicate 'Phase completion workflow' section
  (the loop is already covered by the Checklist + Verification section); the
  GREEN-tested gates are untouched.
- researching: record codebase state (commit SHA + date) so file:line findings
  can be re-checked as code evolves.
- iterating-plans: add a concrete post-edit consistency scan and re-trigger of
  reproducibility/numerical success criteria when approach changes.
- plan-template: reframe Testing Strategy so it complements the in-phase
  test-first unit tests instead of encouraging a tests-last batch.

Note: emoji left as-is (used across 7 plugin files — it is the convention).
@lsetiawan lsetiawan marked this pull request as ready for review June 17, 2026 05:55
@lsetiawan lsetiawan merged commit db8e73b into main Jun 17, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant