Schliff

Your AI instruction files silently degrade — and nothing catches it. A trigger phrase rots. An edge case slips. Your SKILL.md balloons past its token budget. No error, no red test — just an agent that quietly gets worse.

A deterministic quality scorer for AI instruction files. Same input, same score — every time, on every machine. Think the Ruff for SKILL.md, CLAUDE.md, and AGENTS.md. It measures the things linters miss, the same way every time, so degradation shows up as a number that drops instead of a bug you chase.

Schliff scores the instruction files that drive your AI agents — skills, system prompts, project memory — against an explicit, versioned rubric, so you can gate a release on the number in CI. No LLM judge in the critical path. No network. No randomness. Just a rule engine you can read, pin, and trust.

pip install schliff
schliff score path/to/SKILL.md

schliff v8.3.0

  structure      ████████░░   78/100  good
  triggers       ███████░░░   72/100  good
  quality        ██████░░░░   64/100  fair
  edges          █████░░░░░   55/100  fair
  efficiency     ████████░░   80/100  good
  composability  ███████░░░   70/100  good
  clarity        ██████████  100/100  perfect

  Structural Score  ██████████████░░░░░░  71.2/100  [C]

  Tokens: 740 / 1,000 (ok)

No model in the loop produced that number. Run it again on another laptop and you get 71.2 again. That is the whole point.

A real catch

A SKILL.md for ShieldClaw, a prompt-injection-defense plugin, scored 68.3 [C] — and Schliff showed exactly why: composability 20/100 (no scope boundaries, no I/O contract, no handoffs), and 3 of 7 dimensions unmeasurable because there was no eval suite. After adding the missing scope section and an eval suite, the same file scored 94.6 [A] on all 7 dimensions.

	Score	Grade	Dimensions measured
Before	68.3	C	4/7 (no eval suite)
After	94.6	A	7/7

Defects you'd otherwise ship caught as a number that's too low — see the full case study.

Why deterministic?

Most "AI quality" tools ask another LLM to grade your prompt. That makes the score non-reproducible (re-run it, get a different number), un-auditable (the rubric lives in a hidden prompt), and trivially gameable (write for the judge, not the user). A score you can't reproduce isn't a measurement — it's a vibe. You can't gate a release on a number that drifts.

Schliff takes the opposite position:

Reproducible. The headline composite is computed from a canonical, versioned weight registry. Calibration is off by default, so verify, badge, and the leaderboard return the same score on your laptop and in CI.
Auditable. Every dimension is a readable scorer in scripts/scoring/. The weights are a dict you can open. There is no hidden judge prompt.
Anti-gaming by design. A dedicated guard layer (guards.py) plus per-scorer heuristics detect padding, keyword stuffing, and structure-mimicry instead of rewarding them.
Zero core dependencies. Core Schliff is stdlib-only and runs on Python ≥ 3.10. (Optional [evolve] / [judge] extras pull in LLM clients for an opt-in smoke-test only — never for scoring.)

Because the number is stable, it does real work:

Diff it across two commits to see exactly what a refactor cost or earned.
Gate a pull request on a minimum score, with a non-zero exit code below the line.
Compare two files side by side on the same rubric.

An optional LLM judge exists for exploratory work, but it is never part of the deterministic score. The number you gate on is rule-based, end to end.

The 8 scored dimensions

For the SKILL.md family, Schliff runs 8 scorers per file. 7 of them form the headline composite; security and runtime are reported as separate opt-in signals so a security warning never silently inflates or deflates your quality grade.

Dimension	Weight	In headline?
`structure`	0.15	✅
`triggers`	0.20	✅
`quality`	0.20	✅
`edges`	0.15	✅
`efficiency`	0.10	✅
`composability`	0.10	✅
`clarity`	0.05	✅
`security`	0.05	Separate signal (gate threshold 70)
`runtime`	—	Separate signal (no profile weight)

The seven headline weights are renormalized to sum to 1.0 — that is the canonical basis.

Note: security is a side signal for the SKILL.md / CLAUDE.md / .cursorrules / AGENTS.md family, but a core 0.15 headline dimension for the system_prompt format, which uses its own scorer set. Only runtime is excluded everywhere.

The composite: a full-denominator model

Schliff does not quietly renormalize across whatever you happened to measure. Unmeasured dimensions contribute 0 and stay in the denominator — so coverage gaps lower your ceiling instead of quietly disappearing. Your score ceiling equals your measurement coverage. Measure 4 of the 7 headline dimensions and your maximum possible score is capped accordingly, with an explicit warning:

ℹ Scored 4/7 dimensions — the score can't exceed 42% until the rest
  are measured. Run /schliff:init to add an eval suite and score:
  triggers, quality, edges.

This is deliberate. A partial measurement is an honest partial score, never a flattering one. Unmeasured work is missing points, not invisible. To lift the ceiling, measure more — don't hide the gap.

Grade scale

S ≥ 95 · A ≥ 85 · B ≥ 75 · C ≥ 65 · D ≥ 50 · E ≥ 35 · F < 35

Multi-format support

One engine, five instruction-file formats — each with its own token budget and scorer set:

Format	Token budget	Scorers
`SKILL.md`	1,000	shared 8-scorer registry
`CLAUDE.md`	2,000	shared 8-scorer registry
`.cursorrules`	500	shared 8-scorer registry
`AGENTS.md`	3,000	shared 8-scorer registry
system prompts	1,500	dedicated set (`structure_prompt`, `output_contract`, `efficiency`, `clarity`, `security`, `composability`, `completeness`)

Format is auto-detected; override with --format (skill, claude, cursor, agents, system-prompt).

Install

pip install schliff                  # core, stdlib-only
pip install "schliff[evolve,judge]"  # optional LLM-judge / evolve extras

Install	Pulls in	When you need it
`schliff`	stdlib only	Scoring, verify, badge, CI — everything that gates a release
`schliff[judge]`	LLM client	Opt-in exploratory LLM-judge smoke-test (never scoring)
`schliff[evolve]`	LLM client	Opt-in autonomous-improvement extras

GitHub Action

Gate pull requests on instruction-file quality. The action defaults to your repo-root AGENTS.md and posts a scored comment on every PR:

# .github/workflows/agents-lint.yml
name: AGENTS.md Lint
on: [pull_request]
jobs:
  score:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: Zandereins/schliff@v1
        with:
          minimum-score: '75'   # optional: fail the PR below this score

By default it scores AGENTS.md at the repo root; set skill-path: to lint a SKILL.md, CLAUDE.md, or .cursorrules instead.

Prefer not to depend on a third-party action? The dependency-light equivalent:

      - run: pip install schliff
      - run: schliff verify AGENTS.md --min-score 75

schliff verify exits non-zero below the threshold — a clean CI gate either way.

pre-commit

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/Zandereins/schliff
    rev: v8.3.0
    hooks:
      - id: schliff-verify
        args: ['--min-score', '75']

CLI

schliff <command> [path] [options]

Command	What it does
`score`	Score a file and print the grade bar
`verify`	CI gate — exit 0/1 based on a minimum score
`doctor`	Scan and grade every installed skill
`badge`	Generate a Markdown score badge
`diff`	Explain score changes between two git commits
`compare`	Compare two files side by side
`suggest`	Rank fixes by estimated score impact
`report`	Generate a Markdown score report
`demo`	Score a built-in bad skill to see Schliff in action
`evolve`	Improve an instruction file's score
`version`	Print the version

The version is single-sourced: the CLI resolves it at runtime via importlib.metadata.version("schliff"), falling back to dev from a source checkout.

Optional: closing the loop

Beyond grading, Schliff can apply fixes. The improvement engine measures first, then fixes (not the other way around):

Score the file across all dimensions.
Generate deterministic patch gradients for the weakest dimensions.
Apply the safe, rule-based patches automatically — ~32% of suggested fixes apply deterministically through the apply gate (confidence=high, single-edit; canonical measurement: measure_patch_ratio.py). The rest are handed to an optional LLM.
Re-score and keep the change only if the score improved — otherwise revert.
Stop on plateau detection or when the target is reached.

It also carries cross-session episodic memory (episodic_store.py), so improvement runs learn from prior attempts instead of repeating them. Drive it from Claude Code with /schliff:auto, or use schliff evolve directly. This is an optional convenience layer — the deterministic score is the product.

→ 7 deterministic fixes available. Run `/schliff:auto` to apply.

How it works

The full methodology — scorer internals, the full-denominator composite, the anti-gaming guards, and the calibration model — lives in docs/SCORING.md. Calibration is strictly opt-in: ambient auto-calibrated weights apply only when SCHLIFF_CALIBRATED_WEIGHTS is set and only for the interactive score command, and Schliff emits a weight_source=calibrated warning flagging that such scores are not comparable to the canonical scale. Everything that gates a release stays canonical.

scripts/
├── cli.py                  # CLI entrypoint + dynamic version resolution
├── scoring/
│   ├── registry.py         # canonical weights, scorer lists, headline exclusions
│   ├── composite.py        # full-denominator composite model
│   ├── formats.py          # format detection + token budgets
│   ├── guards.py           # anti-gaming detection
│   └── structure.py · triggers.py · quality.py · edges.py · …
├── text_gradient.py        # deterministic patch gradients (apply gate)
├── episodic_store.py       # cross-session episodic memory
└── measure_patch_ratio.py  # canonical source for the patch-ratio claim

Positioning

LLM-judge tools ask a model how good your prompt feels — a different answer every run. Schliff computes how good it measurably is — the same answer every run, in a number you can pin to a commit and gate a release on.

Ruff lints your Python. Biome lints your JS. Schliff lints the instruction files that drive your AI — deterministically, with no model in the loop.

Contributing & links

⭐ Star the repo: github.com/Zandereins/schliff
📖 Docs: docs/SCORING.md
🧪 Playground: schliff-playground.vercel.app — paste a SKILL.md, get a live structural score (or schliff demo in the CLI)
🏆 Leaderboard: schliff-leaderboard.vercel.app

Structural score = the composite renormalized over the dimensions Schliff can measure deterministically without an eval suite (structure, efficiency, composability, clarity). It is what the web playground reports. The full 7-dimension composite additionally folds in triggers, quality, and edges — which require an eval suite (schliff init).

Validated by 1,231 tests (unit + integration) in skills/schliff/tests, with separate self and proof suites via test-self.sh and test-integration.sh.

Name		Name	Last commit message	Last commit date
Latest commit History 344 Commits
.claude-plugin		.claude-plugin
.github		.github
benchmarks		benchmarks
commands/schliff		commands/schliff
demo		demo
docs		docs
playground		playground
scripts/launch		scripts/launch
skills		skills
web/leaderboard		web/leaderboard
.gitignore		.gitignore
.pre-commit-hooks.yaml		.pre-commit-hooks.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASING.md		RELEASING.md
SECURITY.md		SECURITY.md
action.yml		action.yml
install.sh		install.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Schliff

A real catch

Why deterministic?

The 8 scored dimensions

The composite: a full-denominator model

Grade scale

Multi-format support

Install

GitHub Action

pre-commit

CLI

Optional: closing the loop

How it works

Positioning

Contributing & links

License

About

Uh oh!

Releases 14

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Schliff

A real catch

Why deterministic?

The 8 scored dimensions

The composite: a full-denominator model

Grade scale

Multi-format support

Install

GitHub Action

pre-commit

CLI

Optional: closing the loop

How it works

Positioning

Contributing & links

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 14

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages