Context
In large landscapes (500+ managed resources), the CF provider can hit CF CAPI rate limits. CF CAPI has three independent rate limiters, all operator-configured and disabled by default (CF CAPI rate limiting docs):
| Rate limiter |
BOSH config key |
Unit |
Algorithm |
When it applies |
| Global |
cc.rate_limiter.general.per_minute |
requests/min across all users |
Counter reset every reset_interval seconds |
All requests |
| Unauthenticated |
cc.rate_limiter.unauthenticated.per_minute |
requests/min for unauthenticated callers |
Counter reset every reset_interval seconds |
Unauthenticated requests only |
| Per-user |
cc.rate_limiter.per_user.per_minute |
requests/min per authenticated user |
Token bucket (tokens replenished every reset_interval seconds) |
Authenticated requests, per user |
The cc.rate_limiter.reset_interval property (default: 60 seconds) controls how frequently counters reset (global) or tokens are replenished (per-user). Rate limiting is off by default — operators must set cc.rate_limiter.enabled: true and configure limits. The rate limiter uses a shared store (CC database), so limits are enforced globally across all Cloud Controller instances.
The Crossplane provider authenticates as a single CF user, so the per-user rate limiter is the one most likely to trigger. However, in deployments where other clients also hit the same CC, the global rate limiter may be hit first.
When CF CAPI returns HTTP 429, the provider treats it like any other error — Crossplane's managed reconciler returns a ReconcileError and the workqueue applies generic exponential backoff (1s→60s). This means: (1) the resource condition shows ReconcileError even though the issue is transient rate limiting, (2) the provider may retry before the rate limit resets or wait longer than necessary, (3) operators cannot distinguish rate-limit issues from genuine failures.
CF CAPI has two distinct 429 mechanisms with different headers and retry semantics:
Time-based rate limits (error codes 10013, 10014, 10018) — the CC's own rate limiters. Responses include:
| Header |
On 429 |
On success (authenticated) |
Format |
X-RateLimit-Limit |
Configured limit value |
Per-user limit value |
Integer |
X-RateLimit-Remaining |
0 |
Remaining requests in window |
Integer |
X-RateLimit-Reset |
Window reset time |
Window reset time for user |
Unix timestamp (seconds) |
Retry-After |
Seconds to wait before retrying |
— |
Integer (seconds) |
Concurrent request limit (error code 10016, CF-ServiceBrokerRateLimitExceeded) — limits concurrent in-flight requests from CC to service brokers, per user, per CC instance. This is NOT a time-based rate limit; the limit frees up as in-flight requests complete. Responses include a Retry-After header with an absolute time (not relative seconds) suggesting when to retry — computed as current time plus a random value between 0.5x and 1.5x of cc.broker_client_timeout_seconds (default 60s, so 30–90s range). Unlike the general rate limiters, this applies per CC instance rather than globally across the CC database.
Why This Requires Provider Code Changes
The managed reconciler only sees a generic error from Reconcile() — it cannot distinguish HTTP 429 from other failures. To return RequeueAfter with precise timing (e.g., based on X-Ratelimit-Reset), crossplane-runtime would need a new error interface like RateLimitedError that carries a RetryAfter duration. Without that, the provider can still add 429 detection, but it must rely on generic exponential backoff rather than precise RequeueAfter timing.
Scope
- Detect HTTP 429 and CF rate-limit error codes (10013, 10014, 10016, 10018) from CF CAPI responses
- For time-based rate limits (10013, 10014, 10018): parse
Retry-After header (preferred — directly gives seconds to wait) and fall back to X-RateLimit-Reset (Unix seconds) to compute reconcile.Result{RequeueAfter: duration}. Return this instead of an error.
- For concurrent-request rate limits (10016): parse the
Retry-After header (absolute time — differs from time-based 429s where it's relative seconds). Compute requeueAfter from the absolute time. If the header is absent, fall back to a short RequeueAfter (e.g., 2–5s) since the limit frees up as in-flight requests complete.
- Set a specific condition or event on the managed resource indicating rate-limit deferral
Out of scope
Hint: Smaller feature wishes tend to be implemented and shipped quicker!
- Client-side token bucket rate limiters (longer-term)
- Proactive throttling based on
X-RateLimit-Remaining from successful responses (longer-term optimization — could reduce 429s before they happen)
- Rate-limit handling for non-CAPI CF endpoints
- Ecosystem-wide
RateLimitedError interface in crossplane-runtime (separate cross-cutting concern)
Technical Steps
- Add a
isRateLimited(err error) (time.Duration, bool) helper in internal/clients/error.go
- Use go-cfclient's
resource.IsRateLimitExceededError(err) to detect CF rate limit errors (codes 10013, 10014, 10016, 10018)
- For time-based limits (10013, 10014, 10018): parse
Retry-After header from the 429 response (preferred — directly gives seconds to wait). Fall back to X-RateLimit-Reset (Unix seconds, camelCase with capital L — differs from BTP's X-Ratelimit-Reset in milliseconds). Compute: requeueAfter = time.Until(time.Unix(resetTimestamp, 0))
- For concurrent-request limits (10016): parse
Retry-After header (absolute time — not relative seconds like time-based 429s). Compute: requeueAfter = parsedAbsoluteTime.Sub(time.Now()). The value is current time ± random(0.5x–1.5x of cc.broker_client_timeout_seconds, default 60s, so 30–90s range). If header absent, fall back to short backoff (2–5s).
- In each controller's
Observe()/Create()/Update()/Delete(), check for rate limits before returning errors
- Add unit tests with mocked 429 responses: (a) time-based with
Retry-After (relative seconds) and X-RateLimit-Reset headers, (b) concurrent-request (10016) with Retry-After (absolute time), (c) 10016 with missing Retry-After (fallback to short backoff)
Workarounds & Alternatives
- Set
--max-reconcile-rate conservatively to stay below the CF CAPI per-user limit
- Increase
--poll interval (e.g., 3-5m) to reduce background polling pressure
Additional context
- CF CAPI rate limits: operator-configured via
cc.rate_limiter.*, disabled by default (docs, capi-release BOSH spec)
- Configuration:
cc.rate_limiter.general.per_minute (global), cc.rate_limiter.unauthenticated.per_minute, cc.rate_limiter.per_user.per_minute, cc.rate_limiter.reset_interval (seconds, default 60), cc.rate_limiter.enabled (default: false)
- Rate limiting uses shared CC database — enforced globally across all CC instances
- All V3 API endpoints are subject to rate limiting with no per-endpoint exemptions
- CF error codes:
go-cfclient/resource/error_cf.go — CF-RateLimitExceeded (10013), CF-IPBasedRateLimitExceeded (10014), CF-ServiceBrokerRateLimitExceeded (10016), CF-RateLimitV2APIExceeded (10018)
- CF CAPI has two distinct 429 mechanisms: time-based rate limits (10013, 10014, 10018) include
Retry-After (relative seconds) + X-RateLimit-* headers; concurrent request limit (10016) includes Retry-After (absolute time) but no X-RateLimit-* headers
- Header naming: CF time-based 429s use
X-RateLimit-Reset (capital L, Unix seconds) and Retry-After (seconds); BTP uses X-Ratelimit-Reset (lowercase l, Unix milliseconds, no Retry-After). Code parsing both must handle both variants.
- 10016 applies per CC instance (not globally via CC database like the general rate limiters), and applies only to service broker endpoints (v2 service instances/bindings/keys, v3 parameters endpoints)
- 10016
Retry-After value: current time + random(0.5x–1.5x of cc.broker_client_timeout_seconds, default 60s) → typically 30–90s range
- The managed reconciler cannot handle 429s generically — it sees only a
error interface. Generic 429 handling requires either a RateLimitedError interface in crossplane-runtime (ecosystem coordination) or per-provider controller changes (this issue).
Context
In large landscapes (500+ managed resources), the CF provider can hit CF CAPI rate limits. CF CAPI has three independent rate limiters, all operator-configured and disabled by default (CF CAPI rate limiting docs):
cc.rate_limiter.general.per_minutereset_intervalsecondscc.rate_limiter.unauthenticated.per_minutereset_intervalsecondscc.rate_limiter.per_user.per_minutereset_intervalseconds)The
cc.rate_limiter.reset_intervalproperty (default: 60 seconds) controls how frequently counters reset (global) or tokens are replenished (per-user). Rate limiting is off by default — operators must setcc.rate_limiter.enabled: trueand configure limits. The rate limiter uses a shared store (CC database), so limits are enforced globally across all Cloud Controller instances.The Crossplane provider authenticates as a single CF user, so the per-user rate limiter is the one most likely to trigger. However, in deployments where other clients also hit the same CC, the global rate limiter may be hit first.
When CF CAPI returns HTTP 429, the provider treats it like any other error — Crossplane's managed reconciler returns a
ReconcileErrorand the workqueue applies generic exponential backoff (1s→60s). This means: (1) the resource condition showsReconcileErroreven though the issue is transient rate limiting, (2) the provider may retry before the rate limit resets or wait longer than necessary, (3) operators cannot distinguish rate-limit issues from genuine failures.CF CAPI has two distinct 429 mechanisms with different headers and retry semantics:
Time-based rate limits (error codes 10013, 10014, 10018) — the CC's own rate limiters. Responses include:
X-RateLimit-LimitX-RateLimit-RemainingX-RateLimit-ResetRetry-AfterConcurrent request limit (error code 10016,
CF-ServiceBrokerRateLimitExceeded) — limits concurrent in-flight requests from CC to service brokers, per user, per CC instance. This is NOT a time-based rate limit; the limit frees up as in-flight requests complete. Responses include aRetry-Afterheader with an absolute time (not relative seconds) suggesting when to retry — computed as current time plus a random value between 0.5x and 1.5x ofcc.broker_client_timeout_seconds(default 60s, so 30–90s range). Unlike the general rate limiters, this applies per CC instance rather than globally across the CC database.Why This Requires Provider Code Changes
The managed reconciler only sees a generic
errorfromReconcile()— it cannot distinguish HTTP 429 from other failures. To returnRequeueAfterwith precise timing (e.g., based onX-Ratelimit-Reset),crossplane-runtimewould need a new error interface likeRateLimitedErrorthat carries aRetryAfterduration. Without that, the provider can still add 429 detection, but it must rely on generic exponential backoff rather than preciseRequeueAftertiming.Scope
Retry-Afterheader (preferred — directly gives seconds to wait) and fall back toX-RateLimit-Reset(Unix seconds) to computereconcile.Result{RequeueAfter: duration}. Return this instead of an error.Retry-Afterheader (absolute time — differs from time-based 429s where it's relative seconds). ComputerequeueAfterfrom the absolute time. If the header is absent, fall back to a shortRequeueAfter(e.g., 2–5s) since the limit frees up as in-flight requests complete.Out of scope
Hint: Smaller feature wishes tend to be implemented and shipped quicker!
X-RateLimit-Remainingfrom successful responses (longer-term optimization — could reduce 429s before they happen)RateLimitedErrorinterface incrossplane-runtime(separate cross-cutting concern)Technical Steps
isRateLimited(err error) (time.Duration, bool)helper ininternal/clients/error.goresource.IsRateLimitExceededError(err)to detect CF rate limit errors (codes 10013, 10014, 10016, 10018)Retry-Afterheader from the 429 response (preferred — directly gives seconds to wait). Fall back toX-RateLimit-Reset(Unix seconds, camelCase with capital L — differs from BTP'sX-Ratelimit-Resetin milliseconds). Compute:requeueAfter = time.Until(time.Unix(resetTimestamp, 0))Retry-Afterheader (absolute time — not relative seconds like time-based 429s). Compute:requeueAfter = parsedAbsoluteTime.Sub(time.Now()). The value is current time ± random(0.5x–1.5x ofcc.broker_client_timeout_seconds, default 60s, so 30–90s range). If header absent, fall back to short backoff (2–5s).Observe()/Create()/Update()/Delete(), check for rate limits before returning errorsRetry-After(relative seconds) andX-RateLimit-Resetheaders, (b) concurrent-request (10016) withRetry-After(absolute time), (c) 10016 with missingRetry-After(fallback to short backoff)Workarounds & Alternatives
--max-reconcile-rateconservatively to stay below the CF CAPI per-user limit--pollinterval (e.g., 3-5m) to reduce background polling pressureAdditional context
cc.rate_limiter.*, disabled by default (docs, capi-release BOSH spec)cc.rate_limiter.general.per_minute(global),cc.rate_limiter.unauthenticated.per_minute,cc.rate_limiter.per_user.per_minute,cc.rate_limiter.reset_interval(seconds, default 60),cc.rate_limiter.enabled(default: false)go-cfclient/resource/error_cf.go—CF-RateLimitExceeded(10013),CF-IPBasedRateLimitExceeded(10014),CF-ServiceBrokerRateLimitExceeded(10016),CF-RateLimitV2APIExceeded(10018)Retry-After(relative seconds) +X-RateLimit-*headers; concurrent request limit (10016) includesRetry-After(absolute time) but noX-RateLimit-*headersX-RateLimit-Reset(capital L, Unix seconds) andRetry-After(seconds); BTP usesX-Ratelimit-Reset(lowercase l, Unix milliseconds, noRetry-After). Code parsing both must handle both variants.Retry-Aftervalue: current time + random(0.5x–1.5x ofcc.broker_client_timeout_seconds, default 60s) → typically 30–90s rangeerrorinterface. Generic 429 handling requires either aRateLimitedErrorinterface incrossplane-runtime(ecosystem coordination) or per-provider controller changes (this issue).