Skip to content

Scale Infra Controller to 10,000 managed nodes #2939

Description

@williampnvidia

Infra Controller needs to scale to support deployments with 10,000 managed nodes under a single controller domain. At this size, core workflows must remain reliable and operationally usable even when large numbers of nodes are discovered, provisioned, monitored, or updated concurrently.

This issue tracks the broader scale target and the work needed to validate and improve controller behavior at 10,000-node scale.

Problem

Large deployments put pressure on several parts of the system:

  1. Ingestion throughput and discovery
  2. Inventory size, persistence, and query performance
  3. Provisioning workflows such as DHCP/PXE allocation, firmware updates, OS installation, and lifecycle transitions
  4. Health and telemetry ingestion
  5. API and CLI responsiveness under high load
  6. Background monitors that poll or scrape external systems

The controller should handle this scale without requiring operators to manually stagger routine workflows or recover from avoidable timeout/backpressure failures.

Feature Description

  1. Bulk node ingestion and registration

Scenario: A site operator brings a large deployment online, causing up to 10,000 nodes and related infrastructure endpoints to check in, register, or refresh state.

Success criteria: Nodes are discovered, validated, and registered without dropped updates, database contention, discovery deadlocks, or excessive retry storms.

  1. Fleet-wide provisioning and lifecycle workflows

Scenario: The operator runs coordinated workflows across a large fleet, such as OS provisioning, firmware updates, tenant sanitization, or other lifecycle transitions.

Success criteria: Workflows complete reliably across the fleet within operational maintenance windows, with bounded concurrency, clear progress reporting, and retry behavior that does not overload shared services.

  1. Controller responsiveness under peak load

Scenario: The deployment is active and continuously reporting inventory, health, and telemetry data while operators and automation query the API.

Success criteria: API and CLI operations remain responsive. Inventory and health queries complete within acceptable latency even while background ingestion and monitoring are active.

  1. Efficient monitoring at large scale

Scenario: Background monitors scrape or poll BMCs, fabric managers, inventory sources, and other external systems across the fleet.

Success criteria: Monitoring completes within acceptable refresh windows, avoids unbounded full-site payloads where possible, and prevents slow or unreachable endpoints from blocking progress for healthy endpoints.

Acceptance Criteria

  • Infra Controller has a documented 10,000-node scale target and performance profile.
  • Key workflows are tested or benchmarked at representative scale.
  • Ingestion, provisioning, monitoring, and API workflows have bounded concurrency and observable backpressure behavior.
  • Large inventory payloads are reduced or paginated where needed.
  • Slow, unreachable, or unhealthy endpoints do not disproportionately block fleet-wide progress.
  • Metrics and logs make it possible to identify scale bottlenecks.
  • Follow-up issues are linked for specific scaling improvements.

Scaling follow-ups

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions