Skip to content

Metal SSD streaming: tool-call-quality exact path misses routed expert views #455

Description

@andreaborio

Summary

./ds4_test --tool-call-quality fails under Metal SSD streaming in the exact / quality=true path. The fast path completes first; the exact path stops before tool-call parsing because ds4_session_eval() fails after Metal cannot wrap routed-expert model ranges.

I found this while validating #454, but it reproduces on a clean upstream/main worktree at 80ebbc3, so it appears independent of that PR.

Environment

  • Machine: Apple M5 Pro
  • RAM: 64 GiB
  • Backend: Metal SSD streaming
  • Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
  • Reproduced on: clean upstream/main at 80ebbc3

Repro

make ds4_test

env DS4_TEST_MODEL=/path/to/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf \
    DS4_TEST_SSD_STREAMING=1 \
    DS4_TEST_SSD_STREAMING_CACHE_GB=16 \
    ./ds4_test --tool-call-quality

Relevant log

tool-call-quality:
ds4-test: tool-call quality fast path
...
ds4-test: tool-call quality exact path
ds4: Metal SSD streaming mode enabled; full model residency and warmup are skipped
ds4: SSD streaming initial metal model map restricted to token embedding (1 spans, 0.99 GiB tensor span)
ds4: metal backend initialized for graph diagnostics
ds4: Metal model range 0.01..0.53 GiB is not covered by mapped model views
ds4: Metal model range 1.19..1.70 GiB is not covered by mapped model views
ds4: Metal model range 0.53..1.19 GiB is not covered by mapped model views
tests/ds4_test.c:2008: assertion failed: decode_ok
tests/ds4_test.c:2010: assertion failed: calls.len > 0
tests/ds4_test.c:2011: assertion failed: calls.len > 0 && !strcmp(calls.v[0].name, "list_files")
tool-call-quality: ERR

Notes

With DS4_METAL_STREAMING_MAP_TRACE=1, the exact path maps decode/static spans successfully, then fails on routed expert ranges that are not covered by the current model views.

My read is that the SSD-streaming decode spans intentionally exclude uniform routed expert tensors because the fast path serves those via the streaming expert cache. In quality=true, though, the selected-slot fast kernels are disabled, so the exact fallback asks for the full gate/up/down routed tensors through ds4_gpu_wrap_model_range(), which requires those ranges to already be covered by mapped model views.

So this looks like a Metal SSD-streaming exact-path mapping issue rather than a DSML/tool-call parser issue: the tool-call assertions are just cascading after decode stops.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions