Skip to content

Higher inference throughput on DGX possible? #469

Description

@underdogest

In this repo https://github.com/GanyX19/deepseek-v4-1m-on-dgx-spark/blob/main/docs/benchmark-results.md the same model at fp8 gets 37t/s single stream and 100t/s aggregated. Although it uses 2x DGX sparks, the token generation speed shouldn't scale with more GPUs since it's sequential, right? Is there still some untapped potential to get 4 times the generation speed with this engine? Or am I missing something?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions