What
Turn the economic plane from advisory into enforcement. Cross-repo with defilantech/infercost.
- Foreman gate: the Workload reconciler checks the referenced team's InferCost
TokenBudget before emitting AgenticTasks. If exhausted, the Workload enters a BudgetExhausted phase and does not dispatch.
- ModelRouter gate: a ModelRouter policy can reference a
TokenBudget; requests are HTTP 429'd when the budget is exhausted, synchronously in the request path.
- Cost-rate writeback: InferCost writes a
costRatePerHour status onto each InferenceService it tracks (from its DCGM-sampled marginal cost), enabling cost-aware routing/scheduling at runtime, not just retrospective reports.
Why
Today TokenBudget produces reports but enforces nothing. A Workload can run unbounded against the B200 fleet with $0 remaining budget. Enterprise chargeback to business units needs hard gates, not advisory dashboards. This is the coupling that makes the economic plane part of the governed loop.
Approach / relationships
Definition of done
An exhausted TokenBudget blocks new Foreman dispatch and 429s router traffic; each tracked InferenceService carries a live cost rate consumable by routing/scheduling.
What
Turn the economic plane from advisory into enforcement. Cross-repo with defilantech/infercost.
TokenBudgetbefore emitting AgenticTasks. If exhausted, the Workload enters aBudgetExhaustedphase and does not dispatch.TokenBudget; requests are HTTP 429'd when the budget is exhausted, synchronously in the request path.costRatePerHourstatus onto each InferenceService it tracks (from its DCGM-sampled marginal cost), enabling cost-aware routing/scheduling at runtime, not just retrospective reports.Why
Today TokenBudget produces reports but enforces nothing. A Workload can run unbounded against the B200 fleet with $0 remaining budget. Enterprise chargeback to business units needs hard gates, not advisory dashboards. This is the coupling that makes the economic plane part of the governed loop.
Approach / relationships
Definition of done
An exhausted TokenBudget blocks new Foreman dispatch and 429s router traffic; each tracked InferenceService carries a live cost rate consumable by routing/scheduling.