Carbide needs to support large deployments with up to 10,000 managed DPUs running host-based networking in a single fabric.
Current platform limits appear to be below this target. We need to work with the host-based networking stack owners, and account for the migration to DPF/Kubernetes, to understand the current scale limits, identify the bottlenecks, and determine what changes are required to support 10,000 managed DPUs in one deployment.
Problem
Large deployments may require a single controller domain and fabric to manage 10,000 DPUs. If the networking stack or controller integration is limited below that scale, deployment bring-up and steady-state operations may fail or require unsupported partitioning.
We need a clear 10,000-DPU scale plan across Infra Controller, the DPU management model, the host-based networking stack, and the DPF/Kubernetes deployment model.
Potential areas to investigate
- Confirm the current supported DPU scale limits for host-based networking and the DPF/Kubernetes deployment model.
- Identify whether the current limits are caused by fabric scale, control-plane state, agent behavior, controller integration, DPF/Kubernetes scale, API limits, or operational validation coverage.
- Determine what needs to change to support 10,000 managed DPUs in a single fabric.
- Define test or benchmark coverage for validating 10,000-DPU scale.
- Document any required configuration, deployment constraints, or staged rollout guidance.
Acceptance criteria
- Current host-based networking and DPF/Kubernetes DPU scale limits are confirmed with the owning teams.
- Required changes to support 10,000 managed DPUs are identified and tracked.
- Infra Controller assumptions about DPU scale are reviewed and updated where needed.
- A validation plan exists for 10,000 managed DPUs in a single deployment.
- Follow-up issues are filed for any implementation work discovered during investigation.
Related to #2939
Carbide needs to support large deployments with up to 10,000 managed DPUs running host-based networking in a single fabric.
Current platform limits appear to be below this target. We need to work with the host-based networking stack owners, and account for the migration to DPF/Kubernetes, to understand the current scale limits, identify the bottlenecks, and determine what changes are required to support 10,000 managed DPUs in one deployment.
Problem
Large deployments may require a single controller domain and fabric to manage 10,000 DPUs. If the networking stack or controller integration is limited below that scale, deployment bring-up and steady-state operations may fail or require unsupported partitioning.
We need a clear 10,000-DPU scale plan across Infra Controller, the DPU management model, the host-based networking stack, and the DPF/Kubernetes deployment model.
Potential areas to investigate
Acceptance criteria
Related to #2939