Skip to content

Pre-existing assertion failures on Qwen3.5-35B-A3B-4bit (--concurrent + --enable-grammar-constraints): streaming tool-call coercion drift + same-seed non-determinism #127

@scouzi1966

Description

@scouzi1966

Summary

Running Scripts/test-assertions.sh --tier full against mlx-community/Qwen3.5-35B-A3B-4bit with --concurrent 4 --enable-prefix-caching --enable-grammar-constraints --tool-call-parser afm_adaptive_xml on port 9999 reproduces the same 20 failures on main (ab90ac1) AND on its parent (a8dbffa, pre-PR-126 / pre-tokenize+OpenAPI). They are not regressions from the PR-122/123/124-from-PR-126 bundle that just landed — but they are not all attributed to existing open issues, so this issue tracks the unattributed cluster as a pre-existing-but-unowned set.

Eight of the 20 fails are already covered by #86 (Concurrent x8 prefix cache + grammar returns empty responses) — Section 13 grammar tests + the two Concurrent x8 shared-prefix tests. This issue tracks the remaining 12 that #86 does not name.

Reproducer

MACAFM_MLX_MODEL_CACHE=/path/to/cache \
  afm mlx -m mlx-community/Qwen3.5-35B-A3B-4bit \
  --port 9999 --tool-call-parser afm_adaptive_xml \
  --enable-prefix-caching --enable-grammar-constraints --concurrent 4 &

./Scripts/test-assertions.sh --tier full \
  --model mlx-community/Qwen3.5-35B-A3B-4bit \
  --port 9999 --grammar-constraints

Expected: PASS on the listed assertions.
Actual: FAIL — same set on both main and a8dbffa.

Failure clusters not covered by #86

Cluster A: streaming XML tool-call argument coercion (Sections 11, 12)

The model emits XML tool calls under afm_adaptive_xml with malformed/duplicated argument bodies during streaming ({"location":"Tokyo","unit":"celsius"}{"location":"Tokyo","unit":"celsius"}) or wrong types (celsius=str(string("22.5"))).

Assertion Section Symptom
Streaming: XML tool call assembles valid JSON args 11 args not valid JSON: {...}{...} (duplicated JSON object)
Streaming: array param is JSON array (not string) 11 array passed as string
Parameter values are correct string types 11 location=str(), unit=str() (empty)
Streaming tool call emits valid deltas 12 assembled args not valid JSON: {...}{...}
Number (float) and boolean coercion 12 celsius=str(string("22.5")), enabled=str(string("True"))
Streaming array coercion (incremental path) 12 args not valid JSON: {...}{...}

CLAUDE.md already documents at the model level: "Qwen3-Coder XML format … Known issue: sometimes emits duplicate <parameter=key> tags or JSON objects instead of strings for tool parameters." The streaming path under --concurrent appears to amplify this. The non-streaming versions of the same assertions PASS.

Cluster B: same-seed non-determinism (Sections 6, 16)

Two assertions issue the exact same prompt + seed + sampling params twice and assert byte-equal completions. They fail.

Assertion Section Symptom
streaming parity (same seed → same output) 16 mismatch
cache idempotency (same seed → same output) 6 mismatch

Possible cause: MoE expert dispatch under --concurrent batch decode is order-sensitive, so two requests at slightly different batch positions can take different routing paths even with identical seed + temperature. This may not be a server bug — but per the CLAUDE.md rule 7 ("never claim 'non-determinism' without proof"), it needs investigation rather than dismissal.

Cluster C: strict-wiring streaming (Section 14)

Assertion Symptom
Streaming json_schema strict:true returns valid JSON bad JSON / no completion
Streaming tool strict:true returns valid tool call bad / missing tool call

These overlap with #86's "concurrent + grammar" theme but exercise specifically the streaming + strict combination on tool calls vs json_schema, not the concurrent-x8 prefix path #86 names. Consider extending #86 or treating as separate.

Cluster D: complex schema (Section 13)

Assertion Symptom
Complex schema: string + int + array + object invalid JSON arguments: Extra data

Borderline — could be #86 (it's a grammar test), or a separate xgrammar-on-this-shape issue.

Evidence per CLAUDE.md rule 7

  • Branch run: test-reports/assertions-report-20260510_095413.html (commit ab90ac1, all 4 PRs landed)
  • Baseline run: test-reports/assertions-report-20260510_100154.html (commit a8dbffa, parent of PR-126)
  • comm -3 on the two failure lists returns empty → identical failure sets → all fails predate PR-126.

Suggested next steps

  1. Reproduce Cluster A on a smaller Qwen3-Coder model in serial mode (no --concurrent) to isolate whether the streaming JSON duplication is concurrency-induced or generic to Qwen3-Coder's XML format under streaming.
  2. For Cluster B, add per-step expert-routing logging in BatchScheduler and verify whether two same-seed runs at different batch positions take divergent dispatches.
  3. For Cluster C/D, decide whether to fold them into Concurrent x8 prefix cache + grammar returns empty responses #86 (concurrent + grammar umbrella) or split into focused tickets after step 1 + 2.

Out of scope

Affected versions

  • main at ab90ac1 (current HEAD)
  • a8dbffa (parent of PR-126) — same fails
  • Likely all earlier commits where this combination is exercised

Test environment

  • Apple Silicon (this run: M3 Ultra-class), macOS 26
  • Model: mlx-community/Qwen3.5-35B-A3B-4bit
  • Server flags: --port 9999 --tool-call-parser afm_adaptive_xml --enable-prefix-caching --enable-grammar-constraints --concurrent 4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions