Is this a new feature, an enhancement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
High
Please provide a clear description of problem this feature solves
Today, when a tenant reports a machine issue that needs provider action, the common path is disruptive full repair: the tenant releases the instance, NICo cleans or quarantines the machine, and repair proceeds afterward.
That is too heavy for issues that may be investigated or resolved while the tenant keeps the instance assignment. It can cause avoidable downtime, loss of assignment context, manual coordination, and ambiguity about whether the machine should remain blocked from normal allocation.
Online Repair solves this by allowing a privileged tenant admin or provider admin to place an assigned instance into Repairing state without releasing it, while applying a repair health override that clearly signals the platform and repair automation that the machine is under active repair.
Feature Description
Online Repair enables a controlled, tenant-retained repair workflow for assigned machines.
The workflow allows an authorized user to start online repair using the machine update API:
PATCH /v2/org/{org}/nico/machine/{machineId}
When online repair is enabled, NICo validates that the machine is assigned to an instance and that the assigned instance is in Ready state. It records online repair metadata, captures the reported health issue, applies the online repair health override, and moves the assigned instance from Ready to Repairing.
The tenant keeps the instance assignment throughout the workflow. The repair state remains active until online repair is explicitly disabled. When disabled, NICo clears the repair health override and metadata, and the instance can move back to Ready.
The feature also enforces safety controls: required acknowledgments, repair policy input, valid health issue details, RBAC restrictions, and mutual exclusion between online repair and full repair. If online repair cannot resolve the issue, the workflow can be escalated by first clearing online repair and then using the full repair release path.
Describe your ideal solution
No response
Describe any alternatives you have considered
No response
Additional context
No response
Code of Conduct
Is this a new feature, an enhancement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
High
Please provide a clear description of problem this feature solves
Today, when a tenant reports a machine issue that needs provider action, the common path is disruptive full repair: the tenant releases the instance, NICo cleans or quarantines the machine, and repair proceeds afterward.
That is too heavy for issues that may be investigated or resolved while the tenant keeps the instance assignment. It can cause avoidable downtime, loss of assignment context, manual coordination, and ambiguity about whether the machine should remain blocked from normal allocation.
Online Repair solves this by allowing a privileged tenant admin or provider admin to place an assigned instance into Repairing state without releasing it, while applying a repair health override that clearly signals the platform and repair automation that the machine is under active repair.
Feature Description
Online Repair enables a controlled, tenant-retained repair workflow for assigned machines.
The workflow allows an authorized user to start online repair using the machine update API:
PATCH /v2/org/{org}/nico/machine/{machineId}
When online repair is enabled, NICo validates that the machine is assigned to an instance and that the assigned instance is in Ready state. It records online repair metadata, captures the reported health issue, applies the online repair health override, and moves the assigned instance from Ready to Repairing.
The tenant keeps the instance assignment throughout the workflow. The repair state remains active until online repair is explicitly disabled. When disabled, NICo clears the repair health override and metadata, and the instance can move back to Ready.
The feature also enforces safety controls: required acknowledgments, repair policy input, valid health issue details, RBAC restrictions, and mutual exclusion between online repair and full repair. If online repair cannot resolve the issue, the workflow can be escalated by first clearing online repair and then using the full repair release path.
Describe your ideal solution
No response
Describe any alternatives you have considered
No response
Additional context
No response
Code of Conduct