fix: wait for operator readiness before creating Instrumentation by karthikeyangs9 · Pull Request #25 · last9/last9-k8s-observability

karthikeyangs9 · 2026-06-01T10:33:18Z

Problem

Customers on slower clusters saw the setup script fail with:

[INFO] Attempt 1 of 5 to create instrumentation...
[WARN] Attempt 1 failed. Waiting before retry...
... (x5) ...
[ERROR] Failed to create instrumentation after 5 attempts.

The Instrumentation resource is gated by the OpenTelemetry Operator's mutating admission webhook, which only serves once the operator pod is Ready. The script gated readiness on a blind sleep 30, then ran kubectl apply -f instrumentation.yaml 2>/dev/null. On Autopilot the operator takes longer than 30s to come up (it provisions a node first), so every apply attempt raced an unready webhook — and 2>/dev/null discarded the real error, so the failure was undiagnosable.

Changes

install_operator: replaced the blind sleep 30 with kubectl rollout status deployment/opentelemetry-operator --timeout=${OPERATOR_READY_TIMEOUT:-180s}. Returns the moment the operator is Ready (faster than a flat 30s on healthy clusters) and blocks up to the timeout on slow ones.
create_instrumentation: waits on the operator deployment before applying, and captures stderr (2>&1) so a persistent failure surfaces the real API-server error (e.g. failed calling webhook ... connection refused) instead of a blind "Attempt failed". Also moved the "try manually" hint before log_error (which exit 1s) so it actually prints.

OPERATOR_READY_TIMEOUT is overridable for very slow clusters.

Tests

New tests/test_create_instrumentation.py (3 tests): waits before apply, surfaces the real error on persistent failure, succeeds on happy path.
tests/test_install_operator.py: added a readiness-wait assertion.
Verified end-to-end on a kind cluster: with both fixes, instrumentation now succeeds on the first attempt (previously it failed attempt 1 and scraped by on attempt 2).

Note for the diagnosable-but-unfixable case

If the operator never becomes Ready (ImagePullBackOff, Pending, CrashLoopBackOff, x509), no wait can fix that — but these changes now surface the real reason instead of failing silently. Check kubectl describe pods -n <ns> -l app.kubernetes.io/name=opentelemetry-operator.

The Instrumentation resource is gated by the OpenTelemetry Operator's mutating admission webhook, which only serves once the operator pod is Ready. The script gated on a blind `sleep 30` and then applied with stderr discarded (2>/dev/null), so on slow clusters (e.g. GKE Autopilot, which provisions a node before the pod starts) all 5 apply attempts raced an unready webhook and failed with no visible error. - install_operator: replace the blind `sleep 30` with `kubectl rollout status` on the operator deployment. Returns as soon as it is Ready, and waits up to OPERATOR_READY_TIMEOUT (default 180s) on slow clusters. - create_instrumentation: wait on the operator deployment before applying, and capture stderr (2>&1) so a persistent failure surfaces the real API-server error instead of a blind "Attempt failed". Also moved the "try manually" hint before log_error (which exits 1) so it actually prints. Tests: new tests/test_create_instrumentation.py and a readiness assertion in tests/test_install_operator.py. Verified end-to-end on kind: instrumentation now succeeds on the first attempt.

prathamesh-sonpatki merged commit 4818bfe into main Jun 1, 2026
16 of 18 checks passed

prathamesh-sonpatki deleted the fix/instrumentation-webhook-readiness branch June 1, 2026 11:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: wait for operator readiness before creating Instrumentation#25

fix: wait for operator readiness before creating Instrumentation#25
prathamesh-sonpatki merged 1 commit into
mainfrom
fix/instrumentation-webhook-readiness

karthikeyangs9 commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

karthikeyangs9 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Tests

Note for the diagnosable-but-unfixable case

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

karthikeyangs9 commented Jun 1, 2026 •

edited

Loading