fix: wait for operator readiness before creating Instrumentation#25
Merged
Merged
Conversation
The Instrumentation resource is gated by the OpenTelemetry Operator's mutating admission webhook, which only serves once the operator pod is Ready. The script gated on a blind `sleep 30` and then applied with stderr discarded (2>/dev/null), so on slow clusters (e.g. GKE Autopilot, which provisions a node before the pod starts) all 5 apply attempts raced an unready webhook and failed with no visible error. - install_operator: replace the blind `sleep 30` with `kubectl rollout status` on the operator deployment. Returns as soon as it is Ready, and waits up to OPERATOR_READY_TIMEOUT (default 180s) on slow clusters. - create_instrumentation: wait on the operator deployment before applying, and capture stderr (2>&1) so a persistent failure surfaces the real API-server error instead of a blind "Attempt failed". Also moved the "try manually" hint before log_error (which exits 1) so it actually prints. Tests: new tests/test_create_instrumentation.py and a readiness assertion in tests/test_install_operator.py. Verified end-to-end on kind: instrumentation now succeeds on the first attempt.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Customers on slower clusters saw the setup script fail with:
The
Instrumentationresource is gated by the OpenTelemetry Operator's mutating admission webhook, which only serves once the operator pod is Ready. The script gated readiness on a blindsleep 30, then rankubectl apply -f instrumentation.yaml 2>/dev/null. On Autopilot the operator takes longer than 30s to come up (it provisions a node first), so every apply attempt raced an unready webhook — and2>/dev/nulldiscarded the real error, so the failure was undiagnosable.Changes
install_operator: replaced the blindsleep 30withkubectl rollout status deployment/opentelemetry-operator --timeout=${OPERATOR_READY_TIMEOUT:-180s}. Returns the moment the operator is Ready (faster than a flat 30s on healthy clusters) and blocks up to the timeout on slow ones.create_instrumentation: waits on the operator deployment before applying, and captures stderr (2>&1) so a persistent failure surfaces the real API-server error (e.g.failed calling webhook ... connection refused) instead of a blind "Attempt failed". Also moved the "try manually" hint beforelog_error(whichexit 1s) so it actually prints.OPERATOR_READY_TIMEOUTis overridable for very slow clusters.Tests
tests/test_create_instrumentation.py(3 tests): waits before apply, surfaces the real error on persistent failure, succeeds on happy path.tests/test_install_operator.py: added a readiness-wait assertion.Note for the diagnosable-but-unfixable case
If the operator never becomes Ready (ImagePullBackOff, Pending, CrashLoopBackOff, x509), no wait can fix that — but these changes now surface the real reason instead of failing silently. Check
kubectl describe pods -n <ns> -l app.kubernetes.io/name=opentelemetry-operator.