Skip to content

Artanis fleet-overseer automation: autonomous control loop orchestrating stress + healing + scaling + external-yield (on artanis-administrator-tick, approval-gated) #6321

Description

@AtlantisPleb

Owner-directed strict-bug-override: GitHub issues in this repo are normally strict, reproducible bugs only (apps/openagents.com/AGENTS.md). This issue is filed under explicit owner direction. Discussion/commentary belongs in the Product Promises Forum.

Child of #6316.

Why

The continuous stress harness, external-wins admission, reliability hardening, and throughput tuning all need a continuous operator. The owner wants Artanis to be that autonomous overseer -- watching fleet health + throughput + external demand, orchestrating the stress load (start/scale/back-off), triggering heal/scale/quarantine within approval-gated authority, and reporting -- as an extension of the existing artanis-administrator-tick + artanis-approval-gates, not a new system. Today Artanis runs a bounded live autonomous loop fenced by typed schemas and approval gates, but it does not read GPU/replica/throughput/demand; the GPU sensor (glm-pool-heartbeat.ts) runs as a separate tick. This wires Artanis onto that sensor as the orchestrator.

Scope

  • New artanis-fleet-overseer-tick.ts exporting runArtanisFleetOverseerTick(db, deps) + a ...Scheduled Effect wrapper, mirroring runArtanisAdminTick/runArtanisAdminTickScheduled exactly: env-gated by ARTANIS_FLEET_OVERSEER_ENABLED, self-bounded cadence, every outcome a D1 row in artanis_fleet_overseer_decisions, schema-invalid -> blocked. Registered as one new observedEffect('ArtanisFleet.tick', ...) in the Worker scheduled Promise.all beside ArtanisAdmin.tick.
  • Watch (assembleContext): read the existing glmPoolHeartbeatRoutingStateOracle (per-replica health/warm/draining + the new live-headroom read surface), aggregate throughput/goodput from telemetry, and external demand from the token-usage ledger. Do NOT build a new fleet collector -- the GLM heartbeat stays the sensor.
  • Decide: reuse artanisMindComplete with a bounded S.Union action vocabulary validated by a typed schema.
  • Act -- authority split:
    • Autonomous (no spend, non-destructive): start/scale/back-off the internal stress load (yield to external demand using the same external-wins signal); re-admit recovered / warm idle owned replicas; emit public-safe health/throughput reports.
    • Owner-gated (pending ArtanisApprovalGateRecord, effective only with operator approval + receipt): request paid scale-out (reuse provider_call/deployment); quarantine a replica (add a new fleet_mutation risky kind to ArtanisRiskyActionKind + ARTANIS_RISKY_ACTION_KINDS + rollbackRequiredKinds, with test + INVARIANTS.md update in the same change); any wallet spend/settlement (already gated, never widened).
  • Report: fold a public-safe fleet health/throughput/goodput signal into the existing artanis-health.ts snapshot (new ArtanisHealthSignalKind) so stale/blocked fleet health structurally blocks overclaiming.

Acceptance (measurable)

  • The overseer tick runs on the scheduled handler, env-gated, self-bounded, with every outcome a D1 row; a schema-invalid mind proposal yields a blocked row and dispatches nothing.
  • Autonomously starts/scales/backs-off the stress harness keyed on live external demand, with 0 external-request failures during a demand spike.
  • A replica-quarantine or paid-scale-out proposal emits a pending approval gate and does NOT execute until artanisApprovalGateEffective (operator approval + receipt); wallet spend stays gated.
  • A fleet health/throughput signal appears in the Artanis health snapshot and a stale/blocked signal blocks overclaiming.
  • Public-safe: no raw origin URLs, IPs, bearer material, prompts, or wallet material in any projection.

Refs

  • docs/inference/2026-06-25-glm-fleet-max-throughput-stress-and-artanis-overseer.md (§5)
  • apps/openagents.com/workers/api/src/artanis-administrator-tick.ts, artanis-approval-gates.ts, artanis-scheduled-runner.ts, artanis-health.ts, artanis-spend.ts, artanis-mind.ts, inference/glm-pool-heartbeat.ts
  • The continuous stress harness, external-wins admission, and reliability hardening issues; MASTER: Maximize GLM-5.2-REAP usage in Khala (full pool, durability, throughput, quality) #6316

Strict-bug-override: filed as direction work under owner mandate, not a strict bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions