Skip to content

feat: managed-mode management API (status, sync-now) and fail-closed startup#190

Open
Zygimantass wants to merge 1 commit into
ironsh:mainfrom
Zygimantass:feat/managed-mode-status-sync-failclosed
Open

feat: managed-mode management API (status, sync-now) and fail-closed startup#190
Zygimantass wants to merge 1 commit into
ironsh:mainfrom
Zygimantass:feat/managed-mode-status-sync-failclosed

Conversation

@Zygimantass

Copy link
Copy Markdown

Problem

Two related gaps for managed (control-plane-synced) proxies:

  1. No way to observe or accelerate config application. A control plane that reassigns a proxy's principal (e.g. binding a warm sandbox to a session at claim time) has no way to know when the proxy has actually applied the new config — the only option is sleeping longer than the 5s poll interval and hoping. We hit exactly this in production: the first LLM call after a principal reassignment beat the proxy's next sync poll by ~350ms and went upstream with the placeholder credential still in the header (401 from the upstream API).
  2. Fail-open startup. A freshly started managed proxy serves requests with whatever pipeline it has — including the empty pre-first-sync pipeline — so requests during startup pass through un-transformed, leaking placeholder credentials upstream.

Changes

  • Managed-mode management server: management.listen is now allowed in managed mode, serving:
    • GET /v1/status{config_hash, principal_id, principal_status, synced_once, last_sync_at} — the applied control-plane state, so an operator/control plane can verify which principal's config the proxy is enforcing before routing traffic through it
    • POST /v1/sync → non-blocking immediate sync request (poller Poke()); callers poll /v1/status to observe the result
    • /v1/reload stays standalone-only (404 in managed mode — no file to re-read)
  • IRON_MANAGEMENT_LISTEN env override, consistent with the other IRON_* vars (managed proxies have no config file)
  • SyncResponse now parses the control plane's status / principal_id fields; the poller records them in a snapshot (retained across hash-match responses, seeded from the startup sync)
  • Fail-closed startup: new proxy.Options.Ready gates request handling with 503 until the first control-plane config is applied (wired in managed mode; standalone unaffected)

Tests

  • poller: poke triggers immediate sync; status snapshot updates and retains principal across hash-match responses; seed semantics
  • management: auth/method/availability matrix for /v1/status and /v1/sync; /v1/reload 404s in managed mode
  • proxy: requests 503 while not ready, flow once ready

go test ./... — 28/28 packages pass.

Managed proxies previously ran with no management server at all, and a
proxy served requests with whatever pipeline it had - including the empty
pre-first-sync pipeline - passing placeholder credentials upstream
un-transformed.

- allow management.listen in managed mode: it now serves GET /v1/status
  (the applied control-plane state: config hash, principal, last sync)
  and POST /v1/sync (request an immediate out-of-band sync). /v1/reload
  stays standalone-only since there is no file to re-read.
- IRON_MANAGEMENT_LISTEN env override, matching the other IRON_* vars.
- the sync poller records a status snapshot (config_hash, principal_id,
  principal_status, synced_once, last_sync_at) and accepts a non-blocking
  Poke() that wakes the poll loop early.
- fail closed during startup: proxy.Options.Ready gates request handling
  with 503 until the first control-plane config has been applied, so a
  freshly started or restarted managed proxy can never leak placeholder
  credentials to upstream APIs.

The motivating incident: a sandbox control plane reassigned a proxy's
principal and routed traffic immediately; the first LLM call beat the
proxy's next 5s sync poll by ~350ms and went upstream with a placeholder
key (401). With this change the control plane can POST /v1/sync and poll
GET /v1/status until principal_id matches before routing traffic.

Amp-Thread-ID: https://ampcode.com/threads/T-019eb82e-5bc9-707f-9d5c-cdb6c9d16926
Co-authored-by: Amp <amp@ampcode.com>
Zygimantass added a commit to paradigmxyz/centaur that referenced this pull request Jun 12, 2026
…rier (#526)

Claiming a warm sandbox reassigns its iron-proxy principal in
iron-control, but managed proxies only pick the change up on their next
/proxy/sync poll (5s cadence). The claim path papered over this with a
blind 6s sleep; the Rust harness fires its first LLM call within
milliseconds of stdin, and any gap sends the placeholder credential
upstream (observed as Anthropic 401s in prod when the first call beat
the poll by ~350ms, exe_42484b5303b9401db9d5af4a9112fd91).

Replace the sleep with a barrier against the proxy's managed-mode
management API (ironsh/iron-proxy#190): POST /v1/sync pokes an
immediate out-of-band sync, then GET /v1/status is polled until the
proxy reports the claimed principal's config applied (typically well
under a second instead of 6s). Wiring:

- proxy pods get IRON_MANAGEMENT_LISTEN=:9092 and a per-pod random
  IRON_MANAGEMENT_API_KEY; the barrier reads both back off the live pod
  so it survives api-rs restarts and env overrides
- the proxy NetworkPolicy gains an api-rs -> :9092 ingress rule
  (sandboxes still cannot reach the management port)
- proxy images without the management API never answer on the port;
  after a 2s probe window the barrier falls back to the previous fixed
  delay, so behavior is unchanged until the image is bumped
- the barrier never fails the claim: post-#190 proxies fail closed
  (503) until their first sync, so a timeout degrades to a brief
  retryable window instead of a failed execution

Amp-Thread-ID: https://ampcode.com/threads/T-019eb82e-5bc9-707f-9d5c-cdb6c9d16926

Co-authored-by: Amp <amp@ampcode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant