fix: 100% CPU hang on many mappings; auth tokens fetch in parallel#94
Merged
Conversation
The greedy multi-leader election ran a pairwise scan inside its outer loop, giving O(n^3 * b) where n is the number of distinct source manifests entering the promotion phase. With ~hundreds of mappings the loop took minutes on a single thread, fully starving the current_thread runtime and surfacing as a silent hang at 100% CPU after discovery. Replace the inner pairwise scan with an inverted index that maps each blob to the number of remaining groups containing it. A candidate's marginal coverage is then the sum of `count[d] - 1` over its blobs not already in the leader union. Total cost drops to O(n^2 * b_avg) and a n=150 stress run drops from ~2.4s to under 50ms in a debug build.
The per-provider token cache mutex was held across the `token_exchange::exchange()` HTTP call, serializing every concurrent fetch even when distinct scopes were independent. Under high-fanout sync against scope-keyed registries (Chainguard, Docker Hardened) this became a throughput funnel. Introduce a shared `TokenCache` helper in `auth/token_cache.rs` that owns both the live token map and a per-scope `Arc<Mutex<()>>` map. Concurrent callers for the same scope coalesce through the per-scope mutex with a double-checked cache read; distinct scopes hold distinct mutexes and run in parallel. The helper carries the contract: provider tests verify wiring only. Migrate the `basic` provider as the canonical caller. Helper unit tests cover same-scope coalescing, distinct-scope parallelism (deadlock detection via Barrier), error-path retry semantics, and clear() behavior. An integration-level failure-path test in basic.rs verifies the wiring end-to-end through wiremock.
Mechanical migration of the remaining five Bearer-token providers to the shared `TokenCache` helper introduced in the previous commit: - anonymous (Docker token exchange, no creds) - docker-config (Docker config.json credentials) - ecr-public (AWS SDK credentials, OCI exchange) - gcp (Google ADC credentials, OCI exchange) - acr (Azure AD refresh + access token, ACR OAuth2 endpoints) All six providers now share the same per-scope coalescing contract: distinct scopes fetch tokens in parallel while same-scope concurrent callers coalesce to one exchange. The contract is owned by the helper unit tests; per-provider tests verify the wiring through wiremock or mock SDK clients.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
elect_leadersreran a pairwise O(n²·b) scan per greedy round, giving O(n³·b) in the number of distinct source manifests. With ~hundreds of mappings the loop took minutes on a single-thread runtime, surfacing as a silent 100% CPU hang after discovery. Rewritten with an inverted indexblob_remaining_countso each round costs O(n·b̄); total O(n²·b̄). A n=150 stress run drops from ~2.4s to <50ms in a debug build.token_exchange::exchange(), serializing distinct-scope fetches even when their cache entries were independent — a real throughput funnel under high-fanout Chainguard / Docker Hardened sync. A newTokenCachehelper inauth/token_cache.rsowns the live token map plus a per-scopeArc<Mutex<()>>map; same-scope fetches coalesce via the per-scope mutex with a double-checked cache read, distinct scopes run in parallel.basic,anonymous,docker-config,ecr-public,gcp, andacrall route through it.Barrier) / failure-path retry /clear()behavior, and 1 wiremock integration test onbasic.rsfor end-to-end failure-path wiring.Test plan
cargo test --workspace --locked,cargo fmt --check,cargo clippy --workspace --all-targets -- -D warnings,cargo deny checkTokenCachehelper contract (auth/token_cache.rs) — it owns the parallelism guarantee; per-provider tests only verify wiringbasicis the canonical example,anonymous/docker-configmirror it 1:1,ecr-public/gcpthread their SDK credential refresh through the fetch closure,acrdoes the same with the AAD refresh+access two-steplatest: 20should no longer be required