Higher inference throughput on DGX possible?

In this repo https://github.com/GanyX19/deepseek-v4-1m-on-dgx-spark/blob/main/docs/benchmark-results.md the same model at fp8 gets 37t/s single stream and 100t/s aggregated. Although it uses 2x DGX sparks, the token generation speed shouldn't scale with more GPUs since it's sequential, right? Is there still some untapped potential to get 4 times the generation speed with this engine? Or am I missing something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Higher inference throughput on DGX possible? #469

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Higher inference throughput on DGX possible? #469

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions