Skip to content

Support 10,000 managed DPUs in a single Carbide deployment #2940

Description

@williampnvidia

Carbide needs to support large deployments with up to 10,000 managed DPUs running host-based networking in a single fabric.

Current platform limits appear to be below this target. We need to work with the host-based networking stack owners, and account for the migration to DPF/Kubernetes, to understand the current scale limits, identify the bottlenecks, and determine what changes are required to support 10,000 managed DPUs in one deployment.

Problem

Large deployments may require a single controller domain and fabric to manage 10,000 DPUs. If the networking stack or controller integration is limited below that scale, deployment bring-up and steady-state operations may fail or require unsupported partitioning.

We need a clear 10,000-DPU scale plan across Infra Controller, the DPU management model, the host-based networking stack, and the DPF/Kubernetes deployment model.

Potential areas to investigate

  • Confirm the current supported DPU scale limits for host-based networking and the DPF/Kubernetes deployment model.
  • Identify whether the current limits are caused by fabric scale, control-plane state, agent behavior, controller integration, DPF/Kubernetes scale, API limits, or operational validation coverage.
  • Determine what needs to change to support 10,000 managed DPUs in a single fabric.
  • Define test or benchmark coverage for validating 10,000-DPU scale.
  • Document any required configuration, deployment constraints, or staged rollout guidance.

Acceptance criteria

  • Current host-based networking and DPF/Kubernetes DPU scale limits are confirmed with the owning teams.
  • Required changes to support 10,000 managed DPUs are identified and tracked.
  • Infra Controller assumptions about DPU scale are reviewed and updated where needed.
  • A validation plan exists for 10,000 managed DPUs in a single deployment.
  • Follow-up issues are filed for any implementation work discovered during investigation.

Related to #2939

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions