Skip to content

fix: wait for operator readiness before creating Instrumentation#25

Merged
prathamesh-sonpatki merged 1 commit into
mainfrom
fix/instrumentation-webhook-readiness
Jun 1, 2026
Merged

fix: wait for operator readiness before creating Instrumentation#25
prathamesh-sonpatki merged 1 commit into
mainfrom
fix/instrumentation-webhook-readiness

Conversation

@karthikeyangs9

@karthikeyangs9 karthikeyangs9 commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Problem

Customers on slower clusters saw the setup script fail with:

[INFO] Attempt 1 of 5 to create instrumentation...
[WARN] Attempt 1 failed. Waiting before retry...
... (x5) ...
[ERROR] Failed to create instrumentation after 5 attempts.

The Instrumentation resource is gated by the OpenTelemetry Operator's mutating admission webhook, which only serves once the operator pod is Ready. The script gated readiness on a blind sleep 30, then ran kubectl apply -f instrumentation.yaml 2>/dev/null. On Autopilot the operator takes longer than 30s to come up (it provisions a node first), so every apply attempt raced an unready webhook — and 2>/dev/null discarded the real error, so the failure was undiagnosable.

Changes

  • install_operator: replaced the blind sleep 30 with kubectl rollout status deployment/opentelemetry-operator --timeout=${OPERATOR_READY_TIMEOUT:-180s}. Returns the moment the operator is Ready (faster than a flat 30s on healthy clusters) and blocks up to the timeout on slow ones.
  • create_instrumentation: waits on the operator deployment before applying, and captures stderr (2>&1) so a persistent failure surfaces the real API-server error (e.g. failed calling webhook ... connection refused) instead of a blind "Attempt failed". Also moved the "try manually" hint before log_error (which exit 1s) so it actually prints.

OPERATOR_READY_TIMEOUT is overridable for very slow clusters.

Tests

  • New tests/test_create_instrumentation.py (3 tests): waits before apply, surfaces the real error on persistent failure, succeeds on happy path.
  • tests/test_install_operator.py: added a readiness-wait assertion.
  • Verified end-to-end on a kind cluster: with both fixes, instrumentation now succeeds on the first attempt (previously it failed attempt 1 and scraped by on attempt 2).

Note for the diagnosable-but-unfixable case

If the operator never becomes Ready (ImagePullBackOff, Pending, CrashLoopBackOff, x509), no wait can fix that — but these changes now surface the real reason instead of failing silently. Check kubectl describe pods -n <ns> -l app.kubernetes.io/name=opentelemetry-operator.

The Instrumentation resource is gated by the OpenTelemetry Operator's
mutating admission webhook, which only serves once the operator pod is
Ready. The script gated on a blind `sleep 30` and then applied with stderr
discarded (2>/dev/null), so on slow clusters (e.g. GKE Autopilot, which
provisions a node before the pod starts) all 5 apply attempts raced an
unready webhook and failed with no visible error.

- install_operator: replace the blind `sleep 30` with `kubectl rollout
  status` on the operator deployment. Returns as soon as it is Ready, and
  waits up to OPERATOR_READY_TIMEOUT (default 180s) on slow clusters.
- create_instrumentation: wait on the operator deployment before applying,
  and capture stderr (2>&1) so a persistent failure surfaces the real
  API-server error instead of a blind "Attempt failed". Also moved the
  "try manually" hint before log_error (which exits 1) so it actually prints.

Tests: new tests/test_create_instrumentation.py and a readiness assertion in
tests/test_install_operator.py. Verified end-to-end on kind: instrumentation
now succeeds on the first attempt.
@prathamesh-sonpatki prathamesh-sonpatki merged commit 4818bfe into main Jun 1, 2026
16 of 18 checks passed
@prathamesh-sonpatki prathamesh-sonpatki deleted the fix/instrumentation-webhook-readiness branch June 1, 2026 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants