Scale Infra Controller to 10,000 managed nodes

Infra Controller needs to scale to support deployments with 10,000 managed nodes under a single controller domain. At this size, core workflows must remain reliable and operationally usable even when large numbers of nodes are discovered, provisioned, monitored, or updated concurrently.

This issue tracks the broader scale target and the work needed to validate and improve controller behavior at 10,000-node scale.

### Problem

Large deployments put pressure on several parts of the system:

1. Ingestion throughput and discovery
2. Inventory size, persistence, and query performance
3. Provisioning workflows such as DHCP/PXE allocation, firmware updates, OS installation, and lifecycle transitions
4. Health and telemetry ingestion
5. API and CLI responsiveness under high load
6. Background monitors that poll or scrape external systems

The controller should handle this scale without requiring operators to manually stagger routine workflows or recover from avoidable timeout/backpressure failures.

### Feature Description

1. **Bulk node ingestion and registration**

Scenario: A site operator brings a large deployment online, causing up to 10,000 nodes and related infrastructure endpoints to check in, register, or refresh state.

Success criteria: Nodes are discovered, validated, and registered without dropped updates, database contention, discovery deadlocks, or excessive retry storms.

2. **Fleet-wide provisioning and lifecycle workflows**

Scenario: The operator runs coordinated workflows across a large fleet, such as OS provisioning, firmware updates, tenant sanitization, or other lifecycle transitions.

Success criteria: Workflows complete reliably across the fleet within operational maintenance windows, with bounded concurrency, clear progress reporting, and retry behavior that does not overload shared services.

3. **Controller responsiveness under peak load**

Scenario: The deployment is active and continuously reporting inventory, health, and telemetry data while operators and automation query the API.

Success criteria: API and CLI operations remain responsive. Inventory and health queries complete within acceptable latency even while background ingestion and monitoring are active.

4. **Efficient monitoring at large scale**

Scenario: Background monitors scrape or poll BMCs, fabric managers, inventory sources, and other external systems across the fleet.

Success criteria: Monitoring completes within acceptable refresh windows, avoids unbounded full-site payloads where possible, and prevents slow or unreachable endpoints from blocking progress for healthy endpoints.

### Acceptance Criteria

- Infra Controller has a documented 10,000-node scale target and performance profile.
- Key workflows are tested or benchmarked at representative scale.
- Ingestion, provisioning, monitoring, and API workflows have bounded concurrency and observable backpressure behavior.
- Large inventory payloads are reduced or paginated where needed.
- Slow, unreachable, or unhealthy endpoints do not disproportionately block fleet-wide progress.
- Metrics and logs make it possible to identify scale bottlenecks.
- Follow-up issues are linked for specific scaling improvements.


### Scaling follow-ups

  - [ ] #2940


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scale Infra Controller to 10,000 managed nodes #2939

Problem

Feature Description

Acceptance Criteria

Scaling follow-ups

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Scale Infra Controller to 10,000 managed nodes #2939

Description

Problem

Feature Description

Acceptance Criteria

Scaling follow-ups

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions