[FEATURE] Handle 429 rate-limit responses from CF CAPI

**Context**

In large landscapes (500+ managed resources), the CF provider can hit CF CAPI rate limits. CF CAPI has three independent rate limiters, all operator-configured and **disabled by default** ([CF CAPI rate limiting docs](https://docs.cloudfoundry.org/running/rate-limit-cloud-controller-api.html)):

| Rate limiter | BOSH config key | Unit | Algorithm | When it applies |
|---|---|---|---|---|
| Global | `cc.rate_limiter.general.per_minute` | requests/min across all users | Counter reset every `reset_interval` seconds | All requests |
| Unauthenticated | `cc.rate_limiter.unauthenticated.per_minute` | requests/min for unauthenticated callers | Counter reset every `reset_interval` seconds | Unauthenticated requests only |
| Per-user | `cc.rate_limiter.per_user.per_minute` | requests/min per authenticated user | Token bucket (tokens replenished every `reset_interval` seconds) | Authenticated requests, per user |

The `cc.rate_limiter.reset_interval` property (default: 60 seconds) controls how frequently counters reset (global) or tokens are replenished (per-user). Rate limiting is off by default — operators must set `cc.rate_limiter.enabled: true` and configure limits. The rate limiter uses a shared store (CC database), so limits are enforced globally across all Cloud Controller instances.

The Crossplane provider authenticates as a single CF user, so the **per-user rate limiter** is the one most likely to trigger. However, in deployments where other clients also hit the same CC, the **global rate limiter** may be hit first.

When CF CAPI returns HTTP 429, the provider treats it like any other error — Crossplane's managed reconciler returns a `ReconcileError` and the workqueue applies generic exponential backoff (1s→60s). This means: (1) the resource condition shows `ReconcileError` even though the issue is transient rate limiting, (2) the provider may retry before the rate limit resets or wait longer than necessary, (3) operators cannot distinguish rate-limit issues from genuine failures.

CF CAPI has two distinct 429 mechanisms with different headers and retry semantics:

**Time-based rate limits (error codes 10013, 10014, 10018)** — the CC's own rate limiters. Responses include:

| Header | On 429 | On success (authenticated) | Format |
|---|---|---|---|
| `X-RateLimit-Limit` | Configured limit value | Per-user limit value | Integer |
| `X-RateLimit-Remaining` | 0 | Remaining requests in window | Integer |
| `X-RateLimit-Reset` | Window reset time | Window reset time for user | Unix timestamp (seconds) |
| `Retry-After` | Seconds to wait before retrying | — | Integer (seconds) |

**Concurrent request limit (error code 10016, `CF-ServiceBrokerRateLimitExceeded`)** — limits concurrent in-flight requests from CC to service brokers, per user, per CC instance. This is NOT a time-based rate limit; the limit frees up as in-flight requests complete. Responses include a `Retry-After` header with an **absolute time** (not relative seconds) suggesting when to retry — computed as current time plus a random value between 0.5x and 1.5x of `cc.broker_client_timeout_seconds` (default 60s, so 30–90s range). Unlike the general rate limiters, this applies **per CC instance** rather than globally across the CC database.

**Why This Requires Provider Code Changes**

The managed reconciler only sees a generic `error` from `Reconcile()` — it cannot distinguish HTTP 429 from other failures. To return `RequeueAfter` with precise timing (e.g., based on `X-Ratelimit-Reset`), `crossplane-runtime` would need a new error interface like `RateLimitedError` that carries a `RetryAfter` duration. Without that, the provider can still add 429 detection, but it must rely on generic exponential backoff rather than precise `RequeueAfter` timing.

**Scope**

- Detect HTTP 429 and CF rate-limit error codes (10013, 10014, 10016, 10018) from CF CAPI responses
- For time-based rate limits (10013, 10014, 10018): parse `Retry-After` header (preferred — directly gives seconds to wait) and fall back to `X-RateLimit-Reset` (Unix seconds) to compute `reconcile.Result{RequeueAfter: duration}`. Return this instead of an error.
- For concurrent-request rate limits (10016): parse the `Retry-After` header (absolute time — differs from time-based 429s where it's relative seconds). Compute `requeueAfter` from the absolute time. If the header is absent, fall back to a short `RequeueAfter` (e.g., 2–5s) since the limit frees up as in-flight requests complete.
- Set a specific condition or event on the managed resource indicating rate-limit deferral

**Out of scope**

_Hint_: Smaller feature wishes tend to be implemented and shipped quicker!

- Client-side token bucket rate limiters (longer-term)
- Proactive throttling based on `X-RateLimit-Remaining` from successful responses (longer-term optimization — could reduce 429s before they happen)
- Rate-limit handling for non-CAPI CF endpoints
- Ecosystem-wide `RateLimitedError` interface in `crossplane-runtime` (separate cross-cutting concern)

**Technical Steps**

1. Add a `isRateLimited(err error) (time.Duration, bool)` helper in `internal/clients/error.go`
2. Use go-cfclient's `resource.IsRateLimitExceededError(err)` to detect CF rate limit errors (codes 10013, 10014, 10016, 10018)
3. For time-based limits (10013, 10014, 10018): parse `Retry-After` header from the 429 response (preferred — directly gives seconds to wait). Fall back to `X-RateLimit-Reset` (Unix seconds, camelCase with capital L — differs from BTP's `X-Ratelimit-Reset` in milliseconds). Compute: `requeueAfter = time.Until(time.Unix(resetTimestamp, 0))`
4. For concurrent-request limits (10016): parse `Retry-After` header (absolute time — not relative seconds like time-based 429s). Compute: `requeueAfter = parsedAbsoluteTime.Sub(time.Now())`. The value is current time ± random(0.5x–1.5x of `cc.broker_client_timeout_seconds`, default 60s, so 30–90s range). If header absent, fall back to short backoff (2–5s).
5. In each controller's `Observe()`/`Create()`/`Update()`/`Delete()`, check for rate limits before returning errors
6. Add unit tests with mocked 429 responses: (a) time-based with `Retry-After` (relative seconds) and `X-RateLimit-Reset` headers, (b) concurrent-request (10016) with `Retry-After` (absolute time), (c) 10016 with missing `Retry-After` (fallback to short backoff)

**Workarounds & Alternatives**

- Set `--max-reconcile-rate` conservatively to stay below the CF CAPI per-user limit
- Increase `--poll` interval (e.g., 3-5m) to reduce background polling pressure

**Additional context**

- CF CAPI rate limits: operator-configured via `cc.rate_limiter.*`, disabled by default ([docs](https://docs.cloudfoundry.org/running/rate-limit-cloud-controller-api.html), [capi-release BOSH spec](https://github.com/cloudfoundry/capi-release/blob/main/jobs/cloud_controller_ng/spec))
- Configuration: `cc.rate_limiter.general.per_minute` (global), `cc.rate_limiter.unauthenticated.per_minute`, `cc.rate_limiter.per_user.per_minute`, `cc.rate_limiter.reset_interval` (seconds, default 60), `cc.rate_limiter.enabled` (default: false)
- Rate limiting uses shared CC database — enforced globally across all CC instances
- All V3 API endpoints are subject to rate limiting with no per-endpoint exemptions
- CF error codes: [`go-cfclient/resource/error_cf.go`](https://github.com/cloudfoundry/go-cfclient/blob/main/resource/error_cf.go) — `CF-RateLimitExceeded` (10013), `CF-IPBasedRateLimitExceeded` (10014), `CF-ServiceBrokerRateLimitExceeded` (10016), `CF-RateLimitV2APIExceeded` (10018)
- CF CAPI has two distinct 429 mechanisms: time-based rate limits (10013, 10014, 10018) include `Retry-After` (relative seconds) + `X-RateLimit-*` headers; concurrent request limit (10016) includes `Retry-After` (absolute time) but no `X-RateLimit-*` headers
- Header naming: CF time-based 429s use `X-RateLimit-Reset` (capital L, Unix seconds) and `Retry-After` (seconds); BTP uses `X-Ratelimit-Reset` (lowercase l, Unix milliseconds, no `Retry-After`). Code parsing both must handle both variants.
- 10016 applies per CC instance (not globally via CC database like the general rate limiters), and applies only to service broker endpoints (v2 service instances/bindings/keys, v3 parameters endpoints)
- 10016 `Retry-After` value: current time + random(0.5x–1.5x of `cc.broker_client_timeout_seconds`, default 60s) → typically 30–90s range
- The managed reconciler cannot handle 429s generically — it sees only a `error` interface. Generic 429 handling requires either a `RateLimitedError` interface in `crossplane-runtime` (ecosystem coordination) or per-provider controller changes (this issue).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] Handle 429 rate-limit responses from CF CAPI #269

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Rate limiter	BOSH config key	Unit	Algorithm	When it applies
Global	`cc.rate_limiter.general.per_minute`	requests/min across all users	Counter reset every `reset_interval` seconds	All requests
Unauthenticated	`cc.rate_limiter.unauthenticated.per_minute`	requests/min for unauthenticated callers	Counter reset every `reset_interval` seconds	Unauthenticated requests only
Per-user	`cc.rate_limiter.per_user.per_minute`	requests/min per authenticated user	Token bucket (tokens replenished every `reset_interval` seconds)	Authenticated requests, per user

Header	On 429	On success (authenticated)	Format
`X-RateLimit-Limit`	Configured limit value	Per-user limit value	Integer
`X-RateLimit-Remaining`	0	Remaining requests in window	Integer
`X-RateLimit-Reset`	Window reset time	Window reset time for user	Unix timestamp (seconds)
`Retry-After`	Seconds to wait before retrying	—	Integer (seconds)

Uh oh!

[FEATURE] Handle 429 rate-limit responses from CF CAPI #269

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions