Skip to content

[FEATURE] Wire InferCost TokenBudget as a synchronous admission gate (not a report) #631

@Defilan

Description

@Defilan

What

Turn the economic plane from advisory into enforcement. Cross-repo with defilantech/infercost.

  • Foreman gate: the Workload reconciler checks the referenced team's InferCost TokenBudget before emitting AgenticTasks. If exhausted, the Workload enters a BudgetExhausted phase and does not dispatch.
  • ModelRouter gate: a ModelRouter policy can reference a TokenBudget; requests are HTTP 429'd when the budget is exhausted, synchronously in the request path.
  • Cost-rate writeback: InferCost writes a costRatePerHour status onto each InferenceService it tracks (from its DCGM-sampled marginal cost), enabling cost-aware routing/scheduling at runtime, not just retrospective reports.

Why

Today TokenBudget produces reports but enforces nothing. A Workload can run unbounded against the B200 fleet with $0 remaining budget. Enterprise chargeback to business units needs hard gates, not advisory dashboards. This is the coupling that makes the economic plane part of the governed loop.

Approach / relationships

Definition of done

An exhausted TokenBudget blocks new Foreman dispatch and 429s router traffic; each tracked InferenceService carries a live cost rate consumable by routing/scheduling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/multi-tenancyMulti-tenancy and resource isolationarea/routingMulti-backend routing, model router CRD, policy-aware dispatchcomponent/controllerRelated to the operator controllerenhancementNew feature or requestkind/featureNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions