Summary
./ds4_test --tool-call-quality fails under Metal SSD streaming in the exact / quality=true path. The fast path completes first; the exact path stops before tool-call parsing because ds4_session_eval() fails after Metal cannot wrap routed-expert model ranges.
I found this while validating #454, but it reproduces on a clean upstream/main worktree at 80ebbc3, so it appears independent of that PR.
Environment
- Machine: Apple M5 Pro
- RAM: 64 GiB
- Backend: Metal SSD streaming
- Model:
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
- Reproduced on: clean
upstream/main at 80ebbc3
Repro
make ds4_test
env DS4_TEST_MODEL=/path/to/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf \
DS4_TEST_SSD_STREAMING=1 \
DS4_TEST_SSD_STREAMING_CACHE_GB=16 \
./ds4_test --tool-call-quality
Relevant log
tool-call-quality:
ds4-test: tool-call quality fast path
...
ds4-test: tool-call quality exact path
ds4: Metal SSD streaming mode enabled; full model residency and warmup are skipped
ds4: SSD streaming initial metal model map restricted to token embedding (1 spans, 0.99 GiB tensor span)
ds4: metal backend initialized for graph diagnostics
ds4: Metal model range 0.01..0.53 GiB is not covered by mapped model views
ds4: Metal model range 1.19..1.70 GiB is not covered by mapped model views
ds4: Metal model range 0.53..1.19 GiB is not covered by mapped model views
tests/ds4_test.c:2008: assertion failed: decode_ok
tests/ds4_test.c:2010: assertion failed: calls.len > 0
tests/ds4_test.c:2011: assertion failed: calls.len > 0 && !strcmp(calls.v[0].name, "list_files")
tool-call-quality: ERR
Notes
With DS4_METAL_STREAMING_MAP_TRACE=1, the exact path maps decode/static spans successfully, then fails on routed expert ranges that are not covered by the current model views.
My read is that the SSD-streaming decode spans intentionally exclude uniform routed expert tensors because the fast path serves those via the streaming expert cache. In quality=true, though, the selected-slot fast kernels are disabled, so the exact fallback asks for the full gate/up/down routed tensors through ds4_gpu_wrap_model_range(), which requires those ranges to already be covered by mapped model views.
So this looks like a Metal SSD-streaming exact-path mapping issue rather than a DSML/tool-call parser issue: the tool-call assertions are just cascading after decode stops.
Summary
./ds4_test --tool-call-qualityfails under Metal SSD streaming in the exact /quality=truepath. The fast path completes first; the exact path stops before tool-call parsing becauseds4_session_eval()fails after Metal cannot wrap routed-expert model ranges.I found this while validating #454, but it reproduces on a clean
upstream/mainworktree at80ebbc3, so it appears independent of that PR.Environment
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.ggufupstream/mainat80ebbc3Repro
make ds4_test env DS4_TEST_MODEL=/path/to/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf \ DS4_TEST_SSD_STREAMING=1 \ DS4_TEST_SSD_STREAMING_CACHE_GB=16 \ ./ds4_test --tool-call-qualityRelevant log
Notes
With
DS4_METAL_STREAMING_MAP_TRACE=1, the exact path maps decode/static spans successfully, then fails on routed expert ranges that are not covered by the current model views.My read is that the SSD-streaming decode spans intentionally exclude uniform routed expert tensors because the fast path serves those via the streaming expert cache. In
quality=true, though, the selected-slot fast kernels are disabled, so the exact fallback asks for the full gate/up/down routed tensors throughds4_gpu_wrap_model_range(), which requires those ranges to already be covered by mapped model views.So this looks like a Metal SSD-streaming exact-path mapping issue rather than a DSML/tool-call parser issue: the tool-call assertions are just cascading after decode stops.