Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions docs/evals/monthly/2026-06-22-image.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# ahd eval-image · editorial-illustration · 2026-06-22T13:12:35.077Z

```yaml ahd-replay
schema_version: 1
kind: eval-image
ahd_version: 0.11.0
ahd_commit: 2fd291864ef39c898f6e0dc31f49973f108ae445
git_dirty: true
node_version: v22.22.2
platform: linux-x64
invoked_at: 2026-06-22T13:06:50.593Z
token:
path: /home/runner/work/ahd/ahd/tokens/editorial-illustration.yml
hash: sha256:c2c79dec1b06
brief:
path: briefs/editorial-illustration.yml
hash: sha256:ede77b5c41cf
sampling:
n: 5
temperature: null
seed: null
models:
- id: @cf/black-forest-labs/flux-1-schnell
provider: cloudflare-workers-ai-image
provider_request_ids: 9 captured
- id: @cf/bytedance/stable-diffusion-xl-lightning
provider: cloudflare-workers-ai-image
provider_request_ids: 10 captured
- id: @cf/stabilityai/stable-diffusion-xl-base-1.0
provider: cloudflare-workers-ai-image
provider_request_ids: 10 captured
- id: @cf/lykon/dreamshaper-8-lcm
provider: cloudflare-workers-ai-image
provider_request_ids: 10 captured
conditions:
requested: [raw, compiled]
effective: [raw, compiled]
```

Replay this run:

```sh
git checkout 2fd291864ef3
npm ci && npm run build
/opt/hostedtoolcache/node/22.22.2/x64/bin/node /home/runner/work/ahd/ahd/bin/ahd.js eval-image editorial-illustration --brief briefs/editorial-illustration.yml --models cfimg:@cf/black-forest-labs/flux-1-schnell,cfimg:@cf/bytedance/stable-diffusion-xl-lightning,cfimg:@cf/stabilityai/stable-diffusion-xl-base-1.0,cfimg:@cf/lykon/dreamshaper-8-lcm --n 5 --critic anthropic --report docs/evals/monthly/2026-06-22-image.md
```

- Brief: `briefs/editorial-illustration.yml`
- Samples per cell: **5**

## Per-model slop reduction (vision critic)

| model | raw attempted → critiqued | compiled attempted → critiqued | raw mean tells | compiled mean tells | Δ | reduction |
|---|---:|---:|---:|---:|---:|---:|
| `@cf/black-forest-labs/flux-1-schnell` | 5 → 5 | 5 → 4 | 0.40 | 1.00 | -0.60 | -150.0% |
| `@cf/bytedance/stable-diffusion-xl-lightning` | 5 → 5 | 5 → 5 | 1.00 | 0.40 | 0.60 | 60.0% |
| `@cf/stabilityai/stable-diffusion-xl-base-1.0` | 5 → 5 | 5 → 5 | 1.40 | 0.40 | 1.00 | 71.4% |
| `@cf/lykon/dreamshaper-8-lcm` | 5 → 5 | 5 → 5 | 1.60 | 0.60 | 1.00 | 62.5% |

## Per-tell frequency

| tell | @cf/black-forest-labs/flux-1-schnell/raw | @cf/black-forest-labs/flux-1-schnell/compiled | @cf/bytedance/stable-diffusion-xl-lightning/raw | @cf/bytedance/stable-diffusion-xl-lightning/compiled | @cf/stabilityai/stable-diffusion-xl-base-1.0/raw | @cf/stabilityai/stable-diffusion-xl-base-1.0/compiled | @cf/lykon/dreamshaper-8-lcm/raw | @cf/lykon/dreamshaper-8-lcm/compiled |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| ahd/image/no-malformed-anatomy | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 20% |
| ahd/image/no-midjourney-face-symmetry | 0% | 0% | 0% | 0% | 0% | 0% | 60% | 0% |
| ahd/image/no-stock-diversity-casting | 0% | 0% | 0% | 0% | 20% | 0% | 0% | 0% |
| ahd/no-ai-illustration | 0% | 0% | 0% | 0% | 20% | 40% | 80% | 40% |
| ahd/no-corporate-memphis | 20% | 75% | 100% | 20% | 100% | 0% | 0% | 0% |
| ahd/require-asymmetry | 20% | 25% | 0% | 20% | 0% | 0% | 20% | 0% |

## Caveats
- Image samples are scored by the vision critic over the AHD vision ruleset (13 rules: 9 web/graphic + 4 image-specific).
- The critic is itself an LLM. Verdicts are not independent of model training; run with --critic mock for deterministic tests and report both.
- Per-cell counts are separate: attempted (runs initiated) / errored (API errors) / critiqued (scored). A large gap indicates rate-limit or generator failures, not that a run 'passed' the taxonomy.
- Raw condition: brief as prose with no AHD style direction or forbidden list. Compiled condition: token-driven positive + negative prompts.
- The compiled negative prompt includes image-specific slop patterns (corporate memphis, malformed anatomy, iridescent blobs, decorative cursive). The raw condition does not.
114 changes: 114 additions & 0 deletions docs/evals/monthly/2026-06-22-image.replay.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
{
"schema_version": 1,
"kind": "eval-image",
"ahd_version": "0.11.0",
"ahd_commit": "2fd291864ef39c898f6e0dc31f49973f108ae445",
"git_dirty": true,
"node_version": "v22.22.2",
"platform": "linux-x64",
"invoked_at": "2026-06-22T13:06:50.593Z",
"argv": [
"/opt/hostedtoolcache/node/22.22.2/x64/bin/node",
"/home/runner/work/ahd/ahd/bin/ahd.js",
"eval-image",
"editorial-illustration",
"--brief",
"briefs/editorial-illustration.yml",
"--models",
"cfimg:@cf/black-forest-labs/flux-1-schnell,cfimg:@cf/bytedance/stable-diffusion-xl-lightning,cfimg:@cf/stabilityai/stable-diffusion-xl-base-1.0,cfimg:@cf/lykon/dreamshaper-8-lcm",
"--n",
"5",
"--critic",
"anthropic",
"--report",
"docs/evals/monthly/2026-06-22-image.md"
],
"token": {
"path": "/home/runner/work/ahd/ahd/tokens/editorial-illustration.yml",
"hash": "sha256:c2c79dec1b06fc45877d2b99be5c2a776aec12c226e0aeac38d384d698d9b721"
},
"brief": {
"path": "briefs/editorial-illustration.yml",
"hash": "sha256:ede77b5c41cf91ecac53272cf8d0e6c20ed5639ceb457c410a6cda03e7bca0f2"
},
"sampling": {
"n": 5,
"temperature": null,
"seed": null
},
"models": [
{
"id": "@cf/black-forest-labs/flux-1-schnell",
"provider": "cloudflare-workers-ai-image",
"provider_request_ids": [
"a0fb78fabf2f279a-LAX",
"a0fb790bcb5f279a-LAX",
"a0fb792108db279a-LAX",
"a0fb793a0dc0279a-LAX",
"a0fb79531d89279a-LAX",
"a0fb796d7ebc279a-LAX",
"a0fb798678c2279a-LAX",
"a0fb799deae4279a-LAX",
"a0fb79aceb78279a-LAX"
]
},
{
"id": "@cf/bytedance/stable-diffusion-xl-lightning",
"provider": "cloudflare-workers-ai-image",
"provider_request_ids": [
"a0fb79c618a7279a-LAX",
"a0fb79ec4b60279a-LAX",
"a0fb7a143842279a-LAX",
"a0fb7a39beda279a-LAX",
"a0fb7a60fd4f279a-LAX",
"a0fb7a8aa837279a-LAX",
"a0fb7aac1d84279a-LAX",
"a0fb7acdcddb279a-LAX",
"a0fb7afa9f35279a-LAX",
"a0fb7b1dc942279a-LAX"
]
},
{
"id": "@cf/stabilityai/stable-diffusion-xl-base-1.0",
"provider": "cloudflare-workers-ai-image",
"provider_request_ids": [
"a0fb7b492a50279a-LAX",
"a0fb7b993fd8279a-LAX",
"a0fb7bfd6b29279a-LAX",
"a0fb7c513d93279a-LAX",
"a0fb7cb86f711360-LAX",
"a0fb7d083b601360-LAX",
"a0fb7d520e641360-LAX",
"a0fb7da5dd0f1360-LAX",
"a0fb7df17d4f1360-LAX",
"a0fb7e427fc31360-LAX"
]
},
{
"id": "@cf/lykon/dreamshaper-8-lcm",
"provider": "cloudflare-workers-ai-image",
"provider_request_ids": [
"a0fb7ea17d1b2f73-LAX",
"a0fb7ee539b62f73-LAX",
"a0fb7f273fb12f73-LAX",
"a0fb7f76caa6cb92-LAX",
"a0fb7fb99b4ccb92-LAX",
"a0fb7fff8932cb92-LAX",
"a0fb80492e12cb92-LAX",
"a0fb8088c93acb92-LAX",
"a0fb80cbdf04cb92-LAX",
"a0fb81109e86cb92-LAX"
]
}
],
"conditions": {
"requested": [
"raw",
"compiled"
],
"effective": [
"raw",
"compiled"
]
}
}
93 changes: 93 additions & 0 deletions docs/evals/monthly/2026-06-22-source.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# ahd eval · swiss-editorial · 2026-06-22T12:52:02.326Z

```yaml ahd-replay
schema_version: 1
kind: eval-live
ahd_version: 0.11.0
ahd_commit: 2fd291864ef39c898f6e0dc31f49973f108ae445
git_dirty: true
node_version: v22.22.2
platform: linux-x64
invoked_at: 2026-06-22T12:25:17.118Z
token:
path: /home/runner/work/ahd/ahd/tokens/swiss-editorial.yml
hash: sha256:380a3d833d94
brief:
path: briefs/landing.yml
hash: sha256:8b7d42759643
sampling:
n: 30
temperature: null
seed: null
models:
- id: @cf/google/gemma-4-26b-a4b-it
provider: cloudflare-workers-ai
provider_request_ids: 55 captured
- id: @cf/meta/llama-4-scout-17b-16e-instruct
provider: cloudflare-workers-ai
provider_request_ids: 60 captured
- id: @cf/mistralai/mistral-small-3.1-24b-instruct
provider: cloudflare-workers-ai
provider_request_ids: 60 captured
- id: @cf/openai/gpt-oss-120b
provider: cloudflare-workers-ai
provider_request_ids: 58 captured
- id: @cf/qwen/qwen3-30b-a3b-fp8
provider: cloudflare-workers-ai
provider_request_ids: 60 captured
conditions:
requested: [raw, compiled]
effective: [raw, compiled]
```

Replay this run:

```sh
git checkout 2fd291864ef3
npm ci && npm run build
/opt/hostedtoolcache/node/22.22.2/x64/bin/node /home/runner/work/ahd/ahd/bin/ahd.js eval-live swiss-editorial --brief briefs/landing.yml --models cf:@cf/google/gemma-4-26b-a4b-it,cf:@cf/meta/llama-4-scout-17b-16e-instruct,cf:@cf/mistralai/mistral-small-3.1-24b-instruct,cf:@cf/openai/gpt-oss-120b,cf:@cf/qwen/qwen3-30b-a3b-fp8 --n 30 --sample-concurrency 6 --out evals --report docs/evals/monthly/2026-06-22-source.md
```

## Run

- Brief: `briefs/landing.yml`
- Samples per cell: **30**
- Max tokens: 12000
- Models:
- `@cf/google/gemma-4-26b-a4b-it` (cloudflare-workers-ai) · spec `cf:@cf/google/gemma-4-26b-a4b-it`
- `@cf/meta/llama-4-scout-17b-16e-instruct` (cloudflare-workers-ai) · spec `cf:@cf/meta/llama-4-scout-17b-16e-instruct`
- `@cf/mistralai/mistral-small-3.1-24b-instruct` (cloudflare-workers-ai) · spec `cf:@cf/mistralai/mistral-small-3.1-24b-instruct`
- `@cf/openai/gpt-oss-120b` (cloudflare-workers-ai) · spec `cf:@cf/openai/gpt-oss-120b`
- `@cf/qwen/qwen3-30b-a3b-fp8` (cloudflare-workers-ai) · spec `cf:@cf/qwen/qwen3-30b-a3b-fp8`

## Per-model slop reduction

| model | raw attempted → scored | compiled attempted → scored | raw mean tells | compiled mean tells | Δ | reduction |
|---|---:|---:|---:|---:|---:|---:|
| `@cf/google/gemma-4-26b-a4b-it` | 30 → 28 | 30 → 25 | 2.64 | 1.20 | 1.44 | 54.6% |
| `@cf/meta/llama-4-scout-17b-16e-instruct` | 30 → 30 | 30 → 30 | 2.23 | 2.00 | 0.23 | 10.4% |
| `@cf/mistralai/mistral-small-3.1-24b-instruct` | 30 → 30 | 30 → 30 | 3.40 | 1.27 | 2.13 | 62.7% |
| `@cf/openai/gpt-oss-120b` | 30 → 30 | 30 → 28 | 3.43 | 0.79 | 2.65 | 77.1% |
| `@cf/qwen/qwen3-30b-a3b-fp8` | 30 → 30 | 30 → 30 | 1.77 | 1.57 | 0.20 | 11.3% |

## Per-tell frequency (scored samples only)

| tell | @cf/google/gemma-4-26b-a4b-it/raw | @cf/google/gemma-4-26b-a4b-it/compiled | @cf/meta/llama-4-scout-17b-16e-instruct/raw | @cf/meta/llama-4-scout-17b-16e-instruct/compiled | @cf/mistralai/mistral-small-3.1-24b-instruct/raw | @cf/mistralai/mistral-small-3.1-24b-instruct/compiled | @cf/openai/gpt-oss-120b/raw | @cf/openai/gpt-oss-120b/compiled | @cf/qwen/qwen3-30b-a3b-fp8/raw | @cf/qwen/qwen3-30b-a3b-fp8/compiled |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| ahd/a11y/heading-skip | 0% | 12% | 0% | 0% | 0% | 7% | 0% | 0% | 0% | 0% |
| ahd/line-height-per-size | 79% | 0% | 27% | 100% | 50% | 17% | 100% | 0% | 47% | 43% |
| ahd/no-default-grotesque | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 4% | 0% | 0% |
| ahd/no-em-dashes-in-prose | 0% | 0% | 0% | 0% | 0% | 0% | 3% | 0% | 0% | 0% |
| ahd/radius-hierarchy | 64% | 4% | 23% | 100% | 100% | 3% | 80% | 0% | 17% | 33% |
| ahd/require-named-grid | 0% | 0% | 73% | 0% | 100% | 57% | 17% | 0% | 7% | 0% |
| ahd/require-type-pairing | 21% | 0% | 100% | 0% | 90% | 0% | 70% | 0% | 90% | 3% |
| ahd/tracking-per-size | 0% | 20% | 0% | 0% | 0% | 30% | 0% | 11% | 0% | 0% |
| ahd/weight-variety | 100% | 84% | 0% | 0% | 0% | 13% | 73% | 64% | 17% | 77% |

## Caveats
- Scoring runs the deterministic AHD linter (38 source-level rules) over every sample that passes a basic HTML sanity check.
- Counts reported per cell: attempted (runs initiated) / errored (API / runtime errors) / extractionFailed (response contained no usable HTML) / scored (linted). A large gap between attempted and scored is a signal that the model is struggling with the instruction, not that it passed the taxonomy.
- Raw condition: the brief is expanded as plain prose (intent + audience + surfaces + mustInclude + mustAvoid) with no AHD system prompt, no style token, no forbidden list. Compiled condition: same brief plus the AHD-compiled system prompt. The only thing that differs between conditions is the AHD intervention.
- Vision-only tells (14 rules in the critic) are not scored in this pipeline; run the critic on rendered screenshots for full taxonomy coverage.
- Tells-per-page is a proxy metric: a thin page has little surface for rules to fire against. Read the Δ alongside the actual rendered HTML, not in isolation.
- Model versions change. See the run manifest for exact canonical model ids.
Loading