What happened?
With a CUDA build using GGML_SCHED_MAX_COPIES=2, llama-server produces bad output when graph reuse is enabled. The output repeats almost every token:
The capital capital of of France France is is Paris Paris.. Paris Paris is is not not only only the the capital capital but but also also the the largest largest city city in in France France..
Adding --no-graph-reuse fixes the output. I expected graph reuse not to change the generated text.
Name and Version
llama-server.exe --version
version: 4633 (b3dfb785)
built with MSVC 19.51.36248.0
What operating system are you seeing the problem on?
Windows
Relevant log output
llama_init_from_model: graph_reuse = 1
llama_init_from_model: pipeline parallelism enabled (n_copies=2)
llama_init_from_model: graph splits = 3
What happened?
With a CUDA build using
GGML_SCHED_MAX_COPIES=2,llama-serverproduces bad output when graph reuse is enabled. The output repeats almost every token:Adding
--no-graph-reusefixes the output. I expected graph reuse not to change the generated text.Name and Version
What operating system are you seeing the problem on?
Windows
Relevant log output