In this repo https://github.com/GanyX19/deepseek-v4-1m-on-dgx-spark/blob/main/docs/benchmark-results.md the same model at fp8 gets 37t/s single stream and 100t/s aggregated. Although it uses 2x DGX sparks, the token generation speed shouldn't scale with more GPUs since it's sequential, right? Is there still some untapped potential to get 4 times the generation speed with this engine? Or am I missing something?
In this repo https://github.com/GanyX19/deepseek-v4-1m-on-dgx-spark/blob/main/docs/benchmark-results.md the same model at fp8 gets 37t/s single stream and 100t/s aggregated. Although it uses 2x DGX sparks, the token generation speed shouldn't scale with more GPUs since it's sequential, right? Is there still some untapped potential to get 4 times the generation speed with this engine? Or am I missing something?