Skip to content

fix: 100% CPU hang on many mappings; auth tokens fetch in parallel#94

Merged
bryantbiggs merged 3 commits into
mainfrom
fix/93-and-token-coalescing
May 17, 2026
Merged

fix: 100% CPU hang on many mappings; auth tokens fetch in parallel#94
bryantbiggs merged 3 commits into
mainfrom
fix/93-and-token-coalescing

Conversation

@bryantbiggs

Copy link
Copy Markdown
Member

Summary

  • Resolves Performance: sync hangs silently at 100% CPU #93. elect_leaders reran a pairwise O(n²·b) scan per greedy round, giving O(n³·b) in the number of distinct source manifests. With ~hundreds of mappings the loop took minutes on a single-thread runtime, surfacing as a silent 100% CPU hang after discovery. Rewritten with an inverted index blob_remaining_count so each round costs O(n·b̄); total O(n²·b̄). A n=150 stress run drops from ~2.4s to <50ms in a debug build.
  • Per-scope token coalescing across all 6 Bearer auth providers. The per-provider token cache mutex was held across token_exchange::exchange(), serializing distinct-scope fetches even when their cache entries were independent — a real throughput funnel under high-fanout Chainguard / Docker Hardened sync. A new TokenCache helper in auth/token_cache.rs owns the live token map plus a per-scope Arc<Mutex<()>> map; same-scope fetches coalesce via the per-scope mutex with a double-checked cache read, distinct scopes run in parallel. basic, anonymous, docker-config, ecr-public, gcp, and acr all route through it.
  • +6 tests, all gates green. 1 regression test for the cubic scaling (timing-bounded), 5 helper unit tests covering same-scope coalescing / distinct-scope parallelism (deadlock-detected via Barrier) / failure-path retry / clear() behavior, and 1 wiremock integration test on basic.rs for end-to-end failure-path wiring.

Test plan

  • CI: cargo test --workspace --locked, cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo deny check
  • Review the TokenCache helper contract (auth/token_cache.rs) — it owns the parallelism guarantee; per-provider tests only verify wiring
  • Spot-check provider migrations: basic is the canonical example, anonymous / docker-config mirror it 1:1, ecr-public / gcp thread their SDK credential refresh through the fetch closure, acr does the same with the AAD refresh+access two-step
  • Behavioral verification on the issue Performance: sync hangs silently at 100% CPU #93 reproducer environment (Chainguard + DHI source, ECR FIPS target, 27 mappings) — the workaround latest: 20 should no longer be required

The greedy multi-leader election ran a pairwise scan inside its outer
loop, giving O(n^3 * b) where n is the number of distinct source
manifests entering the promotion phase. With ~hundreds of mappings the
loop took minutes on a single thread, fully starving the current_thread
runtime and surfacing as a silent hang at 100% CPU after discovery.

Replace the inner pairwise scan with an inverted index that maps each
blob to the number of remaining groups containing it. A candidate's
marginal coverage is then the sum of `count[d] - 1` over its blobs not
already in the leader union. Total cost drops to O(n^2 * b_avg) and a
n=150 stress run drops from ~2.4s to under 50ms in a debug build.
The per-provider token cache mutex was held across the
`token_exchange::exchange()` HTTP call, serializing every concurrent
fetch even when distinct scopes were independent. Under high-fanout
sync against scope-keyed registries (Chainguard, Docker Hardened) this
became a throughput funnel.

Introduce a shared `TokenCache` helper in `auth/token_cache.rs` that
owns both the live token map and a per-scope `Arc<Mutex<()>>` map.
Concurrent callers for the same scope coalesce through the per-scope
mutex with a double-checked cache read; distinct scopes hold distinct
mutexes and run in parallel. The helper carries the contract: provider
tests verify wiring only.

Migrate the `basic` provider as the canonical caller. Helper unit
tests cover same-scope coalescing, distinct-scope parallelism (deadlock
detection via Barrier), error-path retry semantics, and clear()
behavior. An integration-level failure-path test in basic.rs verifies
the wiring end-to-end through wiremock.
Mechanical migration of the remaining five Bearer-token providers to
the shared `TokenCache` helper introduced in the previous commit:

- anonymous (Docker token exchange, no creds)
- docker-config (Docker config.json credentials)
- ecr-public (AWS SDK credentials, OCI exchange)
- gcp (Google ADC credentials, OCI exchange)
- acr (Azure AD refresh + access token, ACR OAuth2 endpoints)

All six providers now share the same per-scope coalescing contract:
distinct scopes fetch tokens in parallel while same-scope concurrent
callers coalesce to one exchange. The contract is owned by the helper
unit tests; per-provider tests verify the wiring through wiremock or
mock SDK clients.
@bryantbiggs bryantbiggs changed the title fix: sync hang (#93) and per-scope auth token coalescing fix: sync hang and per-scope auth token coalescing May 17, 2026
@bryantbiggs bryantbiggs changed the title fix: sync hang and per-scope auth token coalescing fix: 100% CPU hang on many mappings; auth tokens fetch in parallel May 17, 2026
@bryantbiggs bryantbiggs merged commit c4e1e34 into main May 17, 2026
16 checks passed
@bryantbiggs bryantbiggs deleted the fix/93-and-token-coalescing branch May 17, 2026 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance: sync hangs silently at 100% CPU

1 participant