TL;DR. First — the Sampler::Argmax + .constrain_with(JsonSchema) pattern is wonderful; it lets us guarantee well-formed structured output by construction, which is the right primitive. While building on it, we noticed that on Qwen3-8B the constraint appears to take effect under the canonical system → user → cue shape but not under system → user → user → cue (two consecutive user calls before cue). We think the non-alternating pattern is the root cause and is technically non-idiomatic — every pie example uses strict alternation — but the API accepts it silently and we wanted to ask whether constraint enforcement should be robust to it (or whether calling out the expected turn pattern in the docs would be enough).
Environment. pie 0.3.0 · Qwen/Qwen3-8B · portable + Metal · macOS.
Canonical pattern (works as expected)
ctx.system(SYSTEM_PROMPT);
ctx.user(USER_MESSAGE);
ctx.cue();
let text = ctx.generate(Sampler::Argmax)
.max_tokens(N)
.constrain_with(JsonSchema(SCHEMA))?
.collect_text()
.await?;
// → valid JSON, schema-conforming.
This matches every example we found in pie's inferlets/ directory (json-schema-validation, constrained-decoding, template-generation, demo-grammar, agent-react, agent-codeact, recursion-of-thought).
What we tried (and why)
PR-Agent forwards an OpenAI-shaped messages array. The initial inferlet walked the array with the appropriate role-filler, then appended a small trailing user-turn (/no_think-style directive) before ctx.cue():
for msg in &input.messages { // ends with a user-role message
match msg.role.as_str() {
"system" => { ctx.system(&msg.content); }
"user" => { ctx.user(&msg.content); }
"assistant" => { ctx.assistant(&msg.content); }
_ => return Err("unsupported role".into()),
}
}
ctx.user("/no_think\n\nReply with valid JSON only matching the schema above.");
ctx.cue();
// same generate(...).constrain_with(...).collect_text() as above
// → runaway `<think>` block, schema not honored.
That produces two consecutive <|im_start|>user...<|im_end|> blocks with no assistant (or cue → generate) between them. We realized afterward this isn't a pattern any pie example uses — all canonical flows maintain strict user/assistant alternation.
Variant isolation
To isolate whether the trigger was PR-Agent's content, prompt size, or the turn shape, we ran six variants on the same engine/model:
| Variant |
Setup |
Output |
| 1 |
neutral sys + 1 neutral user + cue |
JSON ✓ |
| 2 |
neutral sys + 2 neutral users + cue |
JSON ✓ |
| 3 |
neutral sys + 1 long (16k tok) user + cue |
JSON ✓ |
| 4 |
neutral sys + 1 long user + 1 short user + cue |
JSON ✓ |
| 5 |
PR-Agent sys + PR-Agent user + cue |
JSON ✓ |
| 6 |
PR-Agent sys + PR-Agent user + 1 short user + cue |
NL ✗ |
Variant 5 (canonical alternation, real PR-Agent content) works. Variant 6 (same content + one trailing non-alternating user turn) breaks. Removing that trailing user-turn restores constraint enforcement — that's the workaround we shipped as pr-review@0.1.2.
The question
Grammar constraints are documented as enforcing structural validity at every generated token, which (naively) reads as "robust to any prompt shape." On the other hand, non-alternating turn sequences are non-idiomatic and out-of-distribution for chat-tuned models, so unexpected behavior on them is at least defensible.
Possible resolutions, listed from cheapest to most invasive:
- Document the canonical turn pattern. Note in the
Context filler docs (or the JsonSchema constraint docs) that turn-fillers are designed for alternating sequences, and that constraint enforcement is most reliable on patterns matching the example inferlets.
- Soft warn at the SDK boundary. Optionally log a debug message when two consecutive same-role fillers are appended.
- Investigate whether the constraint mechanism could be made prompt-shape-robust. Lower priority if (1) is in place.
We're happy with (1) — the workaround is trivial once you know to look for it, and downstream we've documented the gotcha for our users. Filing this in case the observation is useful to the project.
Reproducer
- Inferlets:
pr-review and pr-review-diag* at https://github.com/shsym/pie-pr-review (current pr-review@0.1.2 ships the canonical-alternation workaround).
- Trigger content: captured PR-Agent payload at
tests/e2e/payloads/2026-05-14-pr1-pr-agent.json — feed it through the broken-code path above to reproduce.
The variant table isolates the trigger: schema, prompt size, model, and sampler are constant across variants 1–6 and only variant 6 fails.
Thanks for the great work on grammar-constrained generation — it's been a big quality-of-life win for our use case.
TL;DR. First — the
Sampler::Argmax + .constrain_with(JsonSchema)pattern is wonderful; it lets us guarantee well-formed structured output by construction, which is the right primitive. While building on it, we noticed that on Qwen3-8B the constraint appears to take effect under the canonicalsystem → user → cueshape but not undersystem → user → user → cue(two consecutiveusercalls beforecue). We think the non-alternating pattern is the root cause and is technically non-idiomatic — every pie example uses strict alternation — but the API accepts it silently and we wanted to ask whether constraint enforcement should be robust to it (or whether calling out the expected turn pattern in the docs would be enough).Environment. pie 0.3.0 ·
Qwen/Qwen3-8B· portable + Metal · macOS.Canonical pattern (works as expected)
This matches every example we found in pie's
inferlets/directory (json-schema-validation,constrained-decoding,template-generation,demo-grammar,agent-react,agent-codeact,recursion-of-thought).What we tried (and why)
PR-Agent forwards an OpenAI-shaped
messagesarray. The initial inferlet walked the array with the appropriate role-filler, then appended a small trailing user-turn (/no_think-style directive) beforectx.cue():That produces two consecutive
<|im_start|>user...<|im_end|>blocks with noassistant(orcue → generate) between them. We realized afterward this isn't a pattern any pie example uses — all canonical flows maintain strict user/assistant alternation.Variant isolation
To isolate whether the trigger was PR-Agent's content, prompt size, or the turn shape, we ran six variants on the same engine/model:
Variant 5 (canonical alternation, real PR-Agent content) works. Variant 6 (same content + one trailing non-alternating user turn) breaks. Removing that trailing user-turn restores constraint enforcement — that's the workaround we shipped as
pr-review@0.1.2.The question
Grammar constraints are documented as enforcing structural validity at every generated token, which (naively) reads as "robust to any prompt shape." On the other hand, non-alternating turn sequences are non-idiomatic and out-of-distribution for chat-tuned models, so unexpected behavior on them is at least defensible.
Possible resolutions, listed from cheapest to most invasive:
Contextfiller docs (or the JsonSchema constraint docs) that turn-fillers are designed for alternating sequences, and that constraint enforcement is most reliable on patterns matching the example inferlets.We're happy with (1) — the workaround is trivial once you know to look for it, and downstream we've documented the gotcha for our users. Filing this in case the observation is useful to the project.
Reproducer
pr-reviewandpr-review-diag*at https://github.com/shsym/pie-pr-review (currentpr-review@0.1.2ships the canonical-alternation workaround).tests/e2e/payloads/2026-05-14-pr1-pr-agent.json— feed it through the broken-code path above to reproduce.The variant table isolates the trigger: schema, prompt size, model, and sampler are constant across variants 1–6 and only variant 6 fails.
Thanks for the great work on grammar-constrained generation — it's been a big quality-of-life win for our use case.