Ad-Astra-Computing · jasonodoom · Jun 22, 2026
diff --git a/docs/evals/monthly/2026-06-22-image.md b/docs/evals/monthly/2026-06-22-image.md
@@ -0,0 +1,76 @@
+# ahd eval-image · editorial-illustration · 2026-06-22T13:12:35.077Z
+
+```yaml ahd-replay
+schema_version: 1
+kind: eval-image
+ahd_version: 0.11.0
+ahd_commit: 2fd291864ef39c898f6e0dc31f49973f108ae445
+git_dirty: true
+node_version: v22.22.2
+platform: linux-x64
+invoked_at: 2026-06-22T13:06:50.593Z
+token:
+  path: /home/runner/work/ahd/ahd/tokens/editorial-illustration.yml
+  hash: sha256:c2c79dec1b06
+brief:
+  path: briefs/editorial-illustration.yml
+  hash: sha256:ede77b5c41cf
+sampling:
+  n: 5
+  temperature: null
+  seed: null
+models:
+  - id: @cf/black-forest-labs/flux-1-schnell
+    provider: cloudflare-workers-ai-image
+    provider_request_ids: 9 captured
+  - id: @cf/bytedance/stable-diffusion-xl-lightning
+    provider: cloudflare-workers-ai-image
+    provider_request_ids: 10 captured
+  - id: @cf/stabilityai/stable-diffusion-xl-base-1.0
+    provider: cloudflare-workers-ai-image
+    provider_request_ids: 10 captured
+  - id: @cf/lykon/dreamshaper-8-lcm
+    provider: cloudflare-workers-ai-image
+    provider_request_ids: 10 captured
+conditions:
+  requested: [raw, compiled]
+  effective: [raw, compiled]
+```
+
+Replay this run:
+
+```sh
+git checkout 2fd291864ef3
+npm ci && npm run build
+/opt/hostedtoolcache/node/22.22.2/x64/bin/node /home/runner/work/ahd/ahd/bin/ahd.js eval-image editorial-illustration --brief briefs/editorial-illustration.yml --models cfimg:@cf/black-forest-labs/flux-1-schnell,cfimg:@cf/bytedance/stable-diffusion-xl-lightning,cfimg:@cf/stabilityai/stable-diffusion-xl-base-1.0,cfimg:@cf/lykon/dreamshaper-8-lcm --n 5 --critic anthropic --report docs/evals/monthly/2026-06-22-image.md
+```
+
+- Brief: `briefs/editorial-illustration.yml`
+- Samples per cell: **5**
+
+## Per-model slop reduction (vision critic)
+
+| model | raw attempted → critiqued | compiled attempted → critiqued | raw mean tells | compiled mean tells | Δ | reduction |
+|---|---:|---:|---:|---:|---:|---:|
+| `@cf/black-forest-labs/flux-1-schnell` | 5 → 5 | 5 → 4 | 0.40 | 1.00 | -0.60 | -150.0% |
+| `@cf/bytedance/stable-diffusion-xl-lightning` | 5 → 5 | 5 → 5 | 1.00 | 0.40 | 0.60 | 60.0% |
+| `@cf/stabilityai/stable-diffusion-xl-base-1.0` | 5 → 5 | 5 → 5 | 1.40 | 0.40 | 1.00 | 71.4% |
+| `@cf/lykon/dreamshaper-8-lcm` | 5 → 5 | 5 → 5 | 1.60 | 0.60 | 1.00 | 62.5% |
+
+## Per-tell frequency
+
+| tell | @cf/black-forest-labs/flux-1-schnell/raw | @cf/black-forest-labs/flux-1-schnell/compiled | @cf/bytedance/stable-diffusion-xl-lightning/raw | @cf/bytedance/stable-diffusion-xl-lightning/compiled | @cf/stabilityai/stable-diffusion-xl-base-1.0/raw | @cf/stabilityai/stable-diffusion-xl-base-1.0/compiled | @cf/lykon/dreamshaper-8-lcm/raw | @cf/lykon/dreamshaper-8-lcm/compiled |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|
+| ahd/image/no-malformed-anatomy | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 20% |
+| ahd/image/no-midjourney-face-symmetry | 0% | 0% | 0% | 0% | 0% | 0% | 60% | 0% |
+| ahd/image/no-stock-diversity-casting | 0% | 0% | 0% | 0% | 20% | 0% | 0% | 0% |
+| ahd/no-ai-illustration | 0% | 0% | 0% | 0% | 20% | 40% | 80% | 40% |
+| ahd/no-corporate-memphis | 20% | 75% | 100% | 20% | 100% | 0% | 0% | 0% |
+| ahd/require-asymmetry | 20% | 25% | 0% | 20% | 0% | 0% | 20% | 0% |
+
+## Caveats
+- Image samples are scored by the vision critic over the AHD vision ruleset (13 rules: 9 web/graphic + 4 image-specific).
+- The critic is itself an LLM. Verdicts are not independent of model training; run with --critic mock for deterministic tests and report both.
+- Per-cell counts are separate: attempted (runs initiated) / errored (API errors) / critiqued (scored). A large gap indicates rate-limit or generator failures, not that a run 'passed' the taxonomy.
+- Raw condition: brief as prose with no AHD style direction or forbidden list. Compiled condition: token-driven positive + negative prompts.
+- The compiled negative prompt includes image-specific slop patterns (corporate memphis, malformed anatomy, iridescent blobs, decorative cursive). The raw condition does not.
diff --git a/docs/evals/monthly/2026-06-22-image.replay.json b/docs/evals/monthly/2026-06-22-image.replay.json
@@ -0,0 +1,114 @@
+{
+  "schema_version": 1,
+  "kind": "eval-image",
+  "ahd_version": "0.11.0",
+  "ahd_commit": "2fd291864ef39c898f6e0dc31f49973f108ae445",
+  "git_dirty": true,
+  "node_version": "v22.22.2",
+  "platform": "linux-x64",
+  "invoked_at": "2026-06-22T13:06:50.593Z",
+  "argv": [
+    "/opt/hostedtoolcache/node/22.22.2/x64/bin/node",
+    "/home/runner/work/ahd/ahd/bin/ahd.js",
+    "eval-image",
+    "editorial-illustration",
+    "--brief",
+    "briefs/editorial-illustration.yml",
+    "--models",
+    "cfimg:@cf/black-forest-labs/flux-1-schnell,cfimg:@cf/bytedance/stable-diffusion-xl-lightning,cfimg:@cf/stabilityai/stable-diffusion-xl-base-1.0,cfimg:@cf/lykon/dreamshaper-8-lcm",
+    "--n",
+    "5",
+    "--critic",
+    "anthropic",
+    "--report",
+    "docs/evals/monthly/2026-06-22-image.md"
+  ],
+  "token": {
+    "path": "/home/runner/work/ahd/ahd/tokens/editorial-illustration.yml",
+    "hash": "sha256:c2c79dec1b06fc45877d2b99be5c2a776aec12c226e0aeac38d384d698d9b721"
+  },
+  "brief": {
+    "path": "briefs/editorial-illustration.yml",
+    "hash": "sha256:ede77b5c41cf91ecac53272cf8d0e6c20ed5639ceb457c410a6cda03e7bca0f2"
+  },
+  "sampling": {
+    "n": 5,
+    "temperature": null,
+    "seed": null
+  },
+  "models": [
+    {
+      "id": "@cf/black-forest-labs/flux-1-schnell",
+      "provider": "cloudflare-workers-ai-image",
+      "provider_request_ids": [
+        "a0fb78fabf2f279a-LAX",
+        "a0fb790bcb5f279a-LAX",
+        "a0fb792108db279a-LAX",
+        "a0fb793a0dc0279a-LAX",
+        "a0fb79531d89279a-LAX",
+        "a0fb796d7ebc279a-LAX",
+        "a0fb798678c2279a-LAX",
+        "a0fb799deae4279a-LAX",
+        "a0fb79aceb78279a-LAX"
+      ]
+    },
+    {
+      "id": "@cf/bytedance/stable-diffusion-xl-lightning",
+      "provider": "cloudflare-workers-ai-image",
+      "provider_request_ids": [
+        "a0fb79c618a7279a-LAX",
+        "a0fb79ec4b60279a-LAX",
+        "a0fb7a143842279a-LAX",
+        "a0fb7a39beda279a-LAX",
+        "a0fb7a60fd4f279a-LAX",
+        "a0fb7a8aa837279a-LAX",
+        "a0fb7aac1d84279a-LAX",
+        "a0fb7acdcddb279a-LAX",
+        "a0fb7afa9f35279a-LAX",
+        "a0fb7b1dc942279a-LAX"
+      ]
+    },
+    {
+      "id": "@cf/stabilityai/stable-diffusion-xl-base-1.0",
+      "provider": "cloudflare-workers-ai-image",
+      "provider_request_ids": [
+        "a0fb7b492a50279a-LAX",
+        "a0fb7b993fd8279a-LAX",
+        "a0fb7bfd6b29279a-LAX",
+        "a0fb7c513d93279a-LAX",
+        "a0fb7cb86f711360-LAX",
+        "a0fb7d083b601360-LAX",
+        "a0fb7d520e641360-LAX",
+        "a0fb7da5dd0f1360-LAX",
+        "a0fb7df17d4f1360-LAX",
+        "a0fb7e427fc31360-LAX"
+      ]
+    },
+    {
+      "id": "@cf/lykon/dreamshaper-8-lcm",
+      "provider": "cloudflare-workers-ai-image",
+      "provider_request_ids": [
+        "a0fb7ea17d1b2f73-LAX",
+        "a0fb7ee539b62f73-LAX",
+        "a0fb7f273fb12f73-LAX",
+        "a0fb7f76caa6cb92-LAX",
+        "a0fb7fb99b4ccb92-LAX",
+        "a0fb7fff8932cb92-LAX",
+        "a0fb80492e12cb92-LAX",
+        "a0fb8088c93acb92-LAX",
+        "a0fb80cbdf04cb92-LAX",
+        "a0fb81109e86cb92-LAX"
+      ]
+    }
+  ],
+  "conditions": {
+    "requested": [
+      "raw",
+      "compiled"
+    ],
+    "effective": [
+      "raw",
+      "compiled"
+    ]
+  }
+}
diff --git a/docs/evals/monthly/2026-06-22-source.md b/docs/evals/monthly/2026-06-22-source.md
@@ -0,0 +1,93 @@
+# ahd eval · swiss-editorial · 2026-06-22T12:52:02.326Z
+
+```yaml ahd-replay
+schema_version: 1
+kind: eval-live
+ahd_version: 0.11.0
+ahd_commit: 2fd291864ef39c898f6e0dc31f49973f108ae445
+git_dirty: true
+node_version: v22.22.2
+platform: linux-x64
+invoked_at: 2026-06-22T12:25:17.118Z
+token:
+  path: /home/runner/work/ahd/ahd/tokens/swiss-editorial.yml
+  hash: sha256:380a3d833d94
+brief:
+  path: briefs/landing.yml
+  hash: sha256:8b7d42759643
+sampling:
+  n: 30
+  temperature: null
+  seed: null
+models:
+  - id: @cf/google/gemma-4-26b-a4b-it
+    provider: cloudflare-workers-ai
+    provider_request_ids: 55 captured
+  - id: @cf/meta/llama-4-scout-17b-16e-instruct
+    provider: cloudflare-workers-ai
+    provider_request_ids: 60 captured
+  - id: @cf/mistralai/mistral-small-3.1-24b-instruct
+    provider: cloudflare-workers-ai
+    provider_request_ids: 60 captured
+  - id: @cf/openai/gpt-oss-120b
+    provider: cloudflare-workers-ai
+    provider_request_ids: 58 captured
+  - id: @cf/qwen/qwen3-30b-a3b-fp8
+    provider: cloudflare-workers-ai
+    provider_request_ids: 60 captured
+conditions:
+  requested: [raw, compiled]
+  effective: [raw, compiled]
+```
+
+Replay this run:
+
+```sh
+git checkout 2fd291864ef3
+npm ci && npm run build
+/opt/hostedtoolcache/node/22.22.2/x64/bin/node /home/runner/work/ahd/ahd/bin/ahd.js eval-live swiss-editorial --brief briefs/landing.yml --models cf:@cf/google/gemma-4-26b-a4b-it,cf:@cf/meta/llama-4-scout-17b-16e-instruct,cf:@cf/mistralai/mistral-small-3.1-24b-instruct,cf:@cf/openai/gpt-oss-120b,cf:@cf/qwen/qwen3-30b-a3b-fp8 --n 30 --sample-concurrency 6 --out evals --report docs/evals/monthly/2026-06-22-source.md
+```
+
+## Run
+
+- Brief: `briefs/landing.yml`
+- Samples per cell: **30**
+- Max tokens: 12000
+- Models:
+  - `@cf/google/gemma-4-26b-a4b-it` (cloudflare-workers-ai) · spec `cf:@cf/google/gemma-4-26b-a4b-it`
+  - `@cf/meta/llama-4-scout-17b-16e-instruct` (cloudflare-workers-ai) · spec `cf:@cf/meta/llama-4-scout-17b-16e-instruct`
+  - `@cf/mistralai/mistral-small-3.1-24b-instruct` (cloudflare-workers-ai) · spec `cf:@cf/mistralai/mistral-small-3.1-24b-instruct`
+  - `@cf/openai/gpt-oss-120b` (cloudflare-workers-ai) · spec `cf:@cf/openai/gpt-oss-120b`
+  - `@cf/qwen/qwen3-30b-a3b-fp8` (cloudflare-workers-ai) · spec `cf:@cf/qwen/qwen3-30b-a3b-fp8`
+
+## Per-model slop reduction
+
+| model | raw attempted → scored | compiled attempted → scored | raw mean tells | compiled mean tells | Δ | reduction |
+|---|---:|---:|---:|---:|---:|---:|
+| `@cf/google/gemma-4-26b-a4b-it` | 30 → 28 | 30 → 25 | 2.64 | 1.20 | 1.44 | 54.6% |
+| `@cf/meta/llama-4-scout-17b-16e-instruct` | 30 → 30 | 30 → 30 | 2.23 | 2.00 | 0.23 | 10.4% |
+| `@cf/mistralai/mistral-small-3.1-24b-instruct` | 30 → 30 | 30 → 30 | 3.40 | 1.27 | 2.13 | 62.7% |
+| `@cf/openai/gpt-oss-120b` | 30 → 30 | 30 → 28 | 3.43 | 0.79 | 2.65 | 77.1% |
+| `@cf/qwen/qwen3-30b-a3b-fp8` | 30 → 30 | 30 → 30 | 1.77 | 1.57 | 0.20 | 11.3% |
+
+## Per-tell frequency (scored samples only)
+
+| tell | @cf/google/gemma-4-26b-a4b-it/raw | @cf/google/gemma-4-26b-a4b-it/compiled | @cf/meta/llama-4-scout-17b-16e-instruct/raw | @cf/meta/llama-4-scout-17b-16e-instruct/compiled | @cf/mistralai/mistral-small-3.1-24b-instruct/raw | @cf/mistralai/mistral-small-3.1-24b-instruct/compiled | @cf/openai/gpt-oss-120b/raw | @cf/openai/gpt-oss-120b/compiled | @cf/qwen/qwen3-30b-a3b-fp8/raw | @cf/qwen/qwen3-30b-a3b-fp8/compiled |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| ahd/a11y/heading-skip | 0% | 12% | 0% | 0% | 0% | 7% | 0% | 0% | 0% | 0% |
+| ahd/line-height-per-size | 79% | 0% | 27% | 100% | 50% | 17% | 100% | 0% | 47% | 43% |
+| ahd/no-default-grotesque | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 4% | 0% | 0% |
+| ahd/no-em-dashes-in-prose | 0% | 0% | 0% | 0% | 0% | 0% | 3% | 0% | 0% | 0% |
+| ahd/radius-hierarchy | 64% | 4% | 23% | 100% | 100% | 3% | 80% | 0% | 17% | 33% |
+| ahd/require-named-grid | 0% | 0% | 73% | 0% | 100% | 57% | 17% | 0% | 7% | 0% |
+| ahd/require-type-pairing | 21% | 0% | 100% | 0% | 90% | 0% | 70% | 0% | 90% | 3% |
+| ahd/tracking-per-size | 0% | 20% | 0% | 0% | 0% | 30% | 0% | 11% | 0% | 0% |
+| ahd/weight-variety | 100% | 84% | 0% | 0% | 0% | 13% | 73% | 64% | 17% | 77% |
+
+## Caveats
+- Scoring runs the deterministic AHD linter (38 source-level rules) over every sample that passes a basic HTML sanity check.
+- Counts reported per cell: attempted (runs initiated) / errored (API / runtime errors) / extractionFailed (response contained no usable HTML) / scored (linted). A large gap between attempted and scored is a signal that the model is struggling with the instruction, not that it passed the taxonomy.
+- Raw condition: the brief is expanded as plain prose (intent + audience + surfaces + mustInclude + mustAvoid) with no AHD system prompt, no style token, no forbidden list. Compiled condition: same brief plus the AHD-compiled system prompt. The only thing that differs between conditions is the AHD intervention.
+- Vision-only tells (14 rules in the critic) are not scored in this pipeline; run the critic on rendered screenshots for full taxonomy coverage.
+- Tells-per-page is a proxy metric: a thin page has little surface for rules to fire against. Read the Δ alongside the actual rendered HTML, not in isolation.
+- Model versions change. See the run manifest for exact canonical model ids.