You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running Scripts/test-assertions.sh --tier full against mlx-community/Qwen3.5-35B-A3B-4bit with --concurrent 4 --enable-prefix-caching --enable-grammar-constraints --tool-call-parser afm_adaptive_xml on port 9999 reproduces the same 20 failures on main (ab90ac1) AND on its parent (a8dbffa, pre-PR-126 / pre-tokenize+OpenAPI). They are not regressions from the PR-122/123/124-from-PR-126 bundle that just landed — but they are not all attributed to existing open issues, so this issue tracks the unattributed cluster as a pre-existing-but-unowned set.
Eight of the 20 fails are already covered by #86 (Concurrent x8 prefix cache + grammar returns empty responses) — Section 13 grammar tests + the two Concurrent x8 shared-prefix tests. This issue tracks the remaining 12 that #86 does not name.
Cluster A: streaming XML tool-call argument coercion (Sections 11, 12)
The model emits XML tool calls under afm_adaptive_xml with malformed/duplicated argument bodies during streaming ({"location":"Tokyo","unit":"celsius"}{"location":"Tokyo","unit":"celsius"}) or wrong types (celsius=str(string("22.5"))).
Assertion
Section
Symptom
Streaming: XML tool call assembles valid JSON args
11
args not valid JSON: {...}{...} (duplicated JSON object)
CLAUDE.md already documents at the model level: "Qwen3-Coder XML format … Known issue: sometimes emits duplicate <parameter=key> tags or JSON objects instead of strings for tool parameters." The streaming path under --concurrent appears to amplify this. The non-streaming versions of the same assertions PASS.
Two assertions issue the exact same prompt + seed + sampling params twice and assert byte-equal completions. They fail.
Assertion
Section
Symptom
streaming parity (same seed → same output)
16
mismatch
cache idempotency (same seed → same output)
6
mismatch
Possible cause: MoE expert dispatch under --concurrent batch decode is order-sensitive, so two requests at slightly different batch positions can take different routing paths even with identical seed + temperature. This may not be a server bug — but per the CLAUDE.md rule 7 ("never claim 'non-determinism' without proof"), it needs investigation rather than dismissal.
These overlap with #86's "concurrent + grammar" theme but exercise specifically the streaming + strict combination on tool calls vs json_schema, not the concurrent-x8 prefix path #86 names. Consider extending #86 or treating as separate.
Cluster D: complex schema (Section 13)
Assertion
Symptom
Complex schema: string + int + array + object
invalid JSON arguments: Extra data
Borderline — could be #86 (it's a grammar test), or a separate xgrammar-on-this-shape issue.
Evidence per CLAUDE.md rule 7
Branch run: test-reports/assertions-report-20260510_095413.html (commit ab90ac1, all 4 PRs landed)
Baseline run: test-reports/assertions-report-20260510_100154.html (commit a8dbffa, parent of PR-126)
comm -3 on the two failure lists returns empty → identical failure sets → all fails predate PR-126.
Suggested next steps
Reproduce Cluster A on a smaller Qwen3-Coder model in serial mode (no --concurrent) to isolate whether the streaming JSON duplication is concurrency-induced or generic to Qwen3-Coder's XML format under streaming.
For Cluster B, add per-step expert-routing logging in BatchScheduler and verify whether two same-seed runs at different batch positions take divergent dispatches.
Summary
Running
Scripts/test-assertions.sh --tier fullagainstmlx-community/Qwen3.5-35B-A3B-4bitwith--concurrent 4 --enable-prefix-caching --enable-grammar-constraints --tool-call-parser afm_adaptive_xmlon port 9999 reproduces the same 20 failures onmain(ab90ac1) AND on its parent (a8dbffa, pre-PR-126 / pre-tokenize+OpenAPI). They are not regressions from the PR-122/123/124-from-PR-126 bundle that just landed — but they are not all attributed to existing open issues, so this issue tracks the unattributed cluster as a pre-existing-but-unowned set.Eight of the 20 fails are already covered by #86 (Concurrent x8 prefix cache + grammar returns empty responses) — Section 13 grammar tests + the two Concurrent x8 shared-prefix tests. This issue tracks the remaining 12 that #86 does not name.
Reproducer
MACAFM_MLX_MODEL_CACHE=/path/to/cache \ afm mlx -m mlx-community/Qwen3.5-35B-A3B-4bit \ --port 9999 --tool-call-parser afm_adaptive_xml \ --enable-prefix-caching --enable-grammar-constraints --concurrent 4 & ./Scripts/test-assertions.sh --tier full \ --model mlx-community/Qwen3.5-35B-A3B-4bit \ --port 9999 --grammar-constraintsExpected: PASS on the listed assertions.
Actual: FAIL — same set on both
mainanda8dbffa.Failure clusters not covered by #86
Cluster A: streaming XML tool-call argument coercion (Sections 11, 12)
The model emits XML tool calls under
afm_adaptive_xmlwith malformed/duplicated argument bodies during streaming ({"location":"Tokyo","unit":"celsius"}{"location":"Tokyo","unit":"celsius"}) or wrong types (celsius=str(string("22.5"))).args not valid JSON: {...}{...}(duplicated JSON object)location=str(), unit=str()(empty)assembled args not valid JSON: {...}{...}celsius=str(string("22.5")), enabled=str(string("True"))args not valid JSON: {...}{...}CLAUDE.md already documents at the model level: "Qwen3-Coder XML format … Known issue: sometimes emits duplicate
<parameter=key>tags or JSON objects instead of strings for tool parameters." The streaming path under--concurrentappears to amplify this. The non-streaming versions of the same assertions PASS.Cluster B: same-seed non-determinism (Sections 6, 16)
Two assertions issue the exact same prompt + seed + sampling params twice and assert byte-equal completions. They fail.
mismatchmismatchPossible cause: MoE expert dispatch under
--concurrentbatch decode is order-sensitive, so two requests at slightly different batch positions can take different routing paths even with identical seed + temperature. This may not be a server bug — but per the CLAUDE.md rule 7 ("never claim 'non-determinism' without proof"), it needs investigation rather than dismissal.Cluster C: strict-wiring streaming (Section 14)
These overlap with #86's "concurrent + grammar" theme but exercise specifically the streaming + strict combination on tool calls vs json_schema, not the concurrent-x8 prefix path #86 names. Consider extending #86 or treating as separate.
Cluster D: complex schema (Section 13)
Borderline — could be #86 (it's a grammar test), or a separate xgrammar-on-this-shape issue.
Evidence per CLAUDE.md rule 7
test-reports/assertions-report-20260510_095413.html(commitab90ac1, all 4 PRs landed)test-reports/assertions-report-20260510_100154.html(commita8dbffa, parent of PR-126)comm -3on the two failure lists returns empty → identical failure sets → all fails predate PR-126.Suggested next steps
--concurrent) to isolate whether the streaming JSON duplication is concurrency-induced or generic to Qwen3-Coder's XML format under streaming.BatchSchedulerand verify whether two same-seed runs at different batch positions take divergent dispatches.Out of scope
Affected versions
mainatab90ac1(current HEAD)a8dbffa(parent of PR-126) — same failsTest environment
mlx-community/Qwen3.5-35B-A3B-4bit--port 9999 --tool-call-parser afm_adaptive_xml --enable-prefix-caching --enable-grammar-constraints --concurrent 4