Skip to content

[FEATURE] Handle 429 rate-limit responses from CF CAPI #269

Description

@gergely-szabo-sap

Context

In large landscapes (500+ managed resources), the CF provider can hit CF CAPI rate limits. CF CAPI has three independent rate limiters, all operator-configured and disabled by default (CF CAPI rate limiting docs):

Rate limiter BOSH config key Unit Algorithm When it applies
Global cc.rate_limiter.general.per_minute requests/min across all users Counter reset every reset_interval seconds All requests
Unauthenticated cc.rate_limiter.unauthenticated.per_minute requests/min for unauthenticated callers Counter reset every reset_interval seconds Unauthenticated requests only
Per-user cc.rate_limiter.per_user.per_minute requests/min per authenticated user Token bucket (tokens replenished every reset_interval seconds) Authenticated requests, per user

The cc.rate_limiter.reset_interval property (default: 60 seconds) controls how frequently counters reset (global) or tokens are replenished (per-user). Rate limiting is off by default — operators must set cc.rate_limiter.enabled: true and configure limits. The rate limiter uses a shared store (CC database), so limits are enforced globally across all Cloud Controller instances.

The Crossplane provider authenticates as a single CF user, so the per-user rate limiter is the one most likely to trigger. However, in deployments where other clients also hit the same CC, the global rate limiter may be hit first.

When CF CAPI returns HTTP 429, the provider treats it like any other error — Crossplane's managed reconciler returns a ReconcileError and the workqueue applies generic exponential backoff (1s→60s). This means: (1) the resource condition shows ReconcileError even though the issue is transient rate limiting, (2) the provider may retry before the rate limit resets or wait longer than necessary, (3) operators cannot distinguish rate-limit issues from genuine failures.

CF CAPI has two distinct 429 mechanisms with different headers and retry semantics:

Time-based rate limits (error codes 10013, 10014, 10018) — the CC's own rate limiters. Responses include:

Header On 429 On success (authenticated) Format
X-RateLimit-Limit Configured limit value Per-user limit value Integer
X-RateLimit-Remaining 0 Remaining requests in window Integer
X-RateLimit-Reset Window reset time Window reset time for user Unix timestamp (seconds)
Retry-After Seconds to wait before retrying Integer (seconds)

Concurrent request limit (error code 10016, CF-ServiceBrokerRateLimitExceeded) — limits concurrent in-flight requests from CC to service brokers, per user, per CC instance. This is NOT a time-based rate limit; the limit frees up as in-flight requests complete. Responses include a Retry-After header with an absolute time (not relative seconds) suggesting when to retry — computed as current time plus a random value between 0.5x and 1.5x of cc.broker_client_timeout_seconds (default 60s, so 30–90s range). Unlike the general rate limiters, this applies per CC instance rather than globally across the CC database.

Why This Requires Provider Code Changes

The managed reconciler only sees a generic error from Reconcile() — it cannot distinguish HTTP 429 from other failures. To return RequeueAfter with precise timing (e.g., based on X-Ratelimit-Reset), crossplane-runtime would need a new error interface like RateLimitedError that carries a RetryAfter duration. Without that, the provider can still add 429 detection, but it must rely on generic exponential backoff rather than precise RequeueAfter timing.

Scope

  • Detect HTTP 429 and CF rate-limit error codes (10013, 10014, 10016, 10018) from CF CAPI responses
  • For time-based rate limits (10013, 10014, 10018): parse Retry-After header (preferred — directly gives seconds to wait) and fall back to X-RateLimit-Reset (Unix seconds) to compute reconcile.Result{RequeueAfter: duration}. Return this instead of an error.
  • For concurrent-request rate limits (10016): parse the Retry-After header (absolute time — differs from time-based 429s where it's relative seconds). Compute requeueAfter from the absolute time. If the header is absent, fall back to a short RequeueAfter (e.g., 2–5s) since the limit frees up as in-flight requests complete.
  • Set a specific condition or event on the managed resource indicating rate-limit deferral

Out of scope

Hint: Smaller feature wishes tend to be implemented and shipped quicker!

  • Client-side token bucket rate limiters (longer-term)
  • Proactive throttling based on X-RateLimit-Remaining from successful responses (longer-term optimization — could reduce 429s before they happen)
  • Rate-limit handling for non-CAPI CF endpoints
  • Ecosystem-wide RateLimitedError interface in crossplane-runtime (separate cross-cutting concern)

Technical Steps

  1. Add a isRateLimited(err error) (time.Duration, bool) helper in internal/clients/error.go
  2. Use go-cfclient's resource.IsRateLimitExceededError(err) to detect CF rate limit errors (codes 10013, 10014, 10016, 10018)
  3. For time-based limits (10013, 10014, 10018): parse Retry-After header from the 429 response (preferred — directly gives seconds to wait). Fall back to X-RateLimit-Reset (Unix seconds, camelCase with capital L — differs from BTP's X-Ratelimit-Reset in milliseconds). Compute: requeueAfter = time.Until(time.Unix(resetTimestamp, 0))
  4. For concurrent-request limits (10016): parse Retry-After header (absolute time — not relative seconds like time-based 429s). Compute: requeueAfter = parsedAbsoluteTime.Sub(time.Now()). The value is current time ± random(0.5x–1.5x of cc.broker_client_timeout_seconds, default 60s, so 30–90s range). If header absent, fall back to short backoff (2–5s).
  5. In each controller's Observe()/Create()/Update()/Delete(), check for rate limits before returning errors
  6. Add unit tests with mocked 429 responses: (a) time-based with Retry-After (relative seconds) and X-RateLimit-Reset headers, (b) concurrent-request (10016) with Retry-After (absolute time), (c) 10016 with missing Retry-After (fallback to short backoff)

Workarounds & Alternatives

  • Set --max-reconcile-rate conservatively to stay below the CF CAPI per-user limit
  • Increase --poll interval (e.g., 3-5m) to reduce background polling pressure

Additional context

  • CF CAPI rate limits: operator-configured via cc.rate_limiter.*, disabled by default (docs, capi-release BOSH spec)
  • Configuration: cc.rate_limiter.general.per_minute (global), cc.rate_limiter.unauthenticated.per_minute, cc.rate_limiter.per_user.per_minute, cc.rate_limiter.reset_interval (seconds, default 60), cc.rate_limiter.enabled (default: false)
  • Rate limiting uses shared CC database — enforced globally across all CC instances
  • All V3 API endpoints are subject to rate limiting with no per-endpoint exemptions
  • CF error codes: go-cfclient/resource/error_cf.goCF-RateLimitExceeded (10013), CF-IPBasedRateLimitExceeded (10014), CF-ServiceBrokerRateLimitExceeded (10016), CF-RateLimitV2APIExceeded (10018)
  • CF CAPI has two distinct 429 mechanisms: time-based rate limits (10013, 10014, 10018) include Retry-After (relative seconds) + X-RateLimit-* headers; concurrent request limit (10016) includes Retry-After (absolute time) but no X-RateLimit-* headers
  • Header naming: CF time-based 429s use X-RateLimit-Reset (capital L, Unix seconds) and Retry-After (seconds); BTP uses X-Ratelimit-Reset (lowercase l, Unix milliseconds, no Retry-After). Code parsing both must handle both variants.
  • 10016 applies per CC instance (not globally via CC database like the general rate limiters), and applies only to service broker endpoints (v2 service instances/bindings/keys, v3 parameters endpoints)
  • 10016 Retry-After value: current time + random(0.5x–1.5x of cc.broker_client_timeout_seconds, default 60s) → typically 30–90s range
  • The managed reconciler cannot handle 429s generically — it sees only a error interface. Generic 429 handling requires either a RateLimitedError interface in crossplane-runtime (ecosystem coordination) or per-provider controller changes (this issue).

Metadata

Metadata

No fields configured for Feature.

Projects

Status
Refinement

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions