Pre-existing assertion failures on Qwen3.5-35B-A3B-4bit (--concurrent + --enable-grammar-constraints): streaming tool-call coercion drift + same-seed non-determinism

## Summary

Running `Scripts/test-assertions.sh --tier full` against `mlx-community/Qwen3.5-35B-A3B-4bit` with `--concurrent 4 --enable-prefix-caching --enable-grammar-constraints --tool-call-parser afm_adaptive_xml` on port 9999 reproduces the same 20 failures on `main` (`ab90ac1`) AND on its parent (`a8dbffa`, pre-PR-126 / pre-tokenize+OpenAPI). They are not regressions from the PR-122/123/124-from-PR-126 bundle that just landed — but **they are not all attributed to existing open issues**, so this issue tracks the unattributed cluster as a pre-existing-but-unowned set.

Eight of the 20 fails are already covered by **#86** (Concurrent x8 prefix cache + grammar returns empty responses) — Section 13 grammar tests + the two Concurrent x8 shared-prefix tests. This issue tracks the **remaining 12** that #86 does not name.

## Reproducer

```bash
MACAFM_MLX_MODEL_CACHE=/path/to/cache \
  afm mlx -m mlx-community/Qwen3.5-35B-A3B-4bit \
  --port 9999 --tool-call-parser afm_adaptive_xml \
  --enable-prefix-caching --enable-grammar-constraints --concurrent 4 &

./Scripts/test-assertions.sh --tier full \
  --model mlx-community/Qwen3.5-35B-A3B-4bit \
  --port 9999 --grammar-constraints
```

Expected: PASS on the listed assertions.
Actual: FAIL — same set on both `main` and `a8dbffa`.

## Failure clusters not covered by #86

### Cluster A: streaming XML tool-call argument coercion (Sections 11, 12)

The model emits XML tool calls under `afm_adaptive_xml` with malformed/duplicated argument bodies during streaming (`{"location":"Tokyo","unit":"celsius"}{"location":"Tokyo","unit":"celsius"}`) or wrong types (`celsius=str(string("22.5"))`).

| Assertion | Section | Symptom |
|---|---|---|
| Streaming: XML tool call assembles valid JSON args | 11 | `args not valid JSON: {...}{...}` (duplicated JSON object) |
| Streaming: array param is JSON array (not string) | 11 | array passed as string |
| Parameter values are correct string types | 11 | `location=str(), unit=str()` (empty) |
| Streaming tool call emits valid deltas | 12 | `assembled args not valid JSON: {...}{...}` |
| Number (float) and boolean coercion | 12 | `celsius=str(string("22.5")), enabled=str(string("True"))` |
| Streaming array coercion (incremental path) | 12 | `args not valid JSON: {...}{...}` |

CLAUDE.md already documents at the model level: "Qwen3-Coder XML format … Known issue: sometimes emits duplicate `<parameter=key>` tags or JSON objects instead of strings for tool parameters." The streaming path under `--concurrent` appears to amplify this. The non-streaming versions of the same assertions PASS.

### Cluster B: same-seed non-determinism (Sections 6, 16)

Two assertions issue the exact same prompt + seed + sampling params twice and assert byte-equal completions. They fail.

| Assertion | Section | Symptom |
|---|---|---|
| streaming parity (same seed → same output) | 16 | `mismatch` |
| cache idempotency (same seed → same output) | 6 | `mismatch` |

Possible cause: MoE expert dispatch under `--concurrent` batch decode is order-sensitive, so two requests at slightly different batch positions can take different routing paths even with identical seed + temperature. This may not be a server bug — but per the CLAUDE.md rule 7 ("never claim 'non-determinism' without proof"), it needs investigation rather than dismissal.

### Cluster C: strict-wiring streaming (Section 14)

| Assertion | Symptom |
|---|---|
| Streaming json_schema strict:true returns valid JSON | bad JSON / no completion |
| Streaming tool strict:true returns valid tool call | bad / missing tool call |

These overlap with #86's "concurrent + grammar" theme but exercise specifically the streaming + strict combination on tool calls vs json_schema, not the concurrent-x8 prefix path #86 names. Consider extending #86 or treating as separate.

### Cluster D: complex schema (Section 13)

| Assertion | Symptom |
|---|---|
| Complex schema: string + int + array + object | invalid JSON arguments: Extra data |

Borderline — could be #86 (it's a grammar test), or a separate xgrammar-on-this-shape issue.

## Evidence per CLAUDE.md rule 7

- Branch run: `test-reports/assertions-report-20260510_095413.html` (commit `ab90ac1`, all 4 PRs landed)
- Baseline run: `test-reports/assertions-report-20260510_100154.html` (commit `a8dbffa`, parent of PR-126)
- `comm -3` on the two failure lists returns empty → identical failure sets → all fails predate PR-126.

## Suggested next steps

1. Reproduce Cluster A on a smaller Qwen3-Coder model in serial mode (no `--concurrent`) to isolate whether the streaming JSON duplication is concurrency-induced or generic to Qwen3-Coder's XML format under streaming.
2. For Cluster B, add per-step expert-routing logging in `BatchScheduler` and verify whether two same-seed runs at different batch positions take divergent dispatches.
3. For Cluster C/D, decide whether to fold them into #86 (concurrent + grammar umbrella) or split into focused tickets after step 1 + 2.

## Out of scope

- Fixing #86 itself — tracked separately.
- Cluster A on non-Qwen3-Coder models — not yet measured.

## Affected versions

- `main` at `ab90ac1` (current HEAD)
- `a8dbffa` (parent of PR-126) — same fails
- Likely all earlier commits where this combination is exercised

## Test environment

- Apple Silicon (this run: M3 Ultra-class), macOS 26
- Model: `mlx-community/Qwen3.5-35B-A3B-4bit`
- Server flags: `--port 9999 --tool-call-parser afm_adaptive_xml --enable-prefix-caching --enable-grammar-constraints --concurrent 4`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-existing assertion failures on Qwen3.5-35B-A3B-4bit (--concurrent + --enable-grammar-constraints): streaming tool-call coercion drift + same-seed non-determinism #127

Summary

Reproducer

Failure clusters not covered by #86

Cluster A: streaming XML tool-call argument coercion (Sections 11, 12)

Cluster B: same-seed non-determinism (Sections 6, 16)

Cluster C: strict-wiring streaming (Section 14)

Cluster D: complex schema (Section 13)

Evidence per CLAUDE.md rule 7

Suggested next steps

Out of scope

Affected versions

Test environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Assertion	Section	Symptom
Streaming: XML tool call assembles valid JSON args	11	`args not valid JSON: {...}{...}` (duplicated JSON object)
Streaming: array param is JSON array (not string)	11	array passed as string
Parameter values are correct string types	11	`location=str(), unit=str()` (empty)
Streaming tool call emits valid deltas	12	`assembled args not valid JSON: {...}{...}`
Number (float) and boolean coercion	12	`celsius=str(string("22.5")), enabled=str(string("True"))`
Streaming array coercion (incremental path)	12	`args not valid JSON: {...}{...}`

Assertion	Section	Symptom
streaming parity (same seed → same output)	16	`mismatch`
cache idempotency (same seed → same output)	6	`mismatch`

Assertion	Symptom
Streaming json_schema strict:true returns valid JSON	bad JSON / no completion
Streaming tool strict:true returns valid tool call	bad / missing tool call

Pre-existing assertion failures on Qwen3.5-35B-A3B-4bit (--concurrent + --enable-grammar-constraints): streaming tool-call coercion drift + same-seed non-determinism #127

Description

Summary

Reproducer

Failure clusters not covered by #86

Cluster A: streaming XML tool-call argument coercion (Sections 11, 12)

Cluster B: same-seed non-determinism (Sections 6, 16)

Cluster C: strict-wiring streaming (Section 14)

Cluster D: complex schema (Section 13)

Evidence per CLAUDE.md rule 7

Suggested next steps

Out of scope

Affected versions

Test environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions