Skip to content

Bug: GGML_SCHED_MAX_COPIES=2 produces repeated tokens when graph reuse is enabled #2006

Description

@KeinNiemand

What happened?

With a CUDA build using GGML_SCHED_MAX_COPIES=2, llama-server produces bad output when graph reuse is enabled. The output repeats almost every token:

The capital capital of of France France is is Paris Paris.. Paris Paris is is not not only only the the capital capital but but also also the the largest largest city city in in France France..

Adding --no-graph-reuse fixes the output. I expected graph reuse not to change the generated text.

Name and Version

llama-server.exe --version
version: 4633 (b3dfb785)
built with MSVC 19.51.36248.0

What operating system are you seeing the problem on?

Windows

Relevant log output

llama_init_from_model: graph_reuse   = 1
llama_init_from_model: pipeline parallelism enabled (n_copies=2)
llama_init_from_model: graph splits = 3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions