Skip to content

fix(telemetry): make ThreadingInstrumentor opt-in with env-var precedence#2167

Open
jpr5 wants to merge 1 commit into
strands-agents:mainfrom
jpr5:fix/threading-instrumentor-opt-in
Open

fix(telemetry): make ThreadingInstrumentor opt-in with env-var precedence#2167
jpr5 wants to merge 1 commit into
strands-agents:mainfrom
jpr5:fix/threading-instrumentor-opt-in

Conversation

@jpr5

@jpr5 jpr5 commented Apr 20, 2026

Copy link
Copy Markdown

strands>=1.35.0 unconditionally calls ThreadingInstrumentor().instrument() in Tracer.__init__, monkey-patching concurrent.futures.ThreadPoolExecutor.submit process-wide for every strands user. It ignores the standard OTEL_PYTHON_DISABLED_INSTRUMENTATIONS=threading opt-out, double-wraps when another OTel distro (Azure Monitor, opentelemetry-distro, AWS OTel) already instrumented threading, and crashes the host if instrument() raises.

This PR is additive and non-breaking: threading instrumentation stays ON by default (unchanged behavior). It adds supported opt-out/opt-in levers and hardens the call path.

Resolver precedence (highest → lowest):

  1. OTEL_PYTHON_DISABLED_INSTRUMENTATIONS containing threading → always disabled (matches OpenTelemetry auto-loader semantics).
  2. Explicit Tracer(instrument_threading=True|False) kwarg.
  3. STRANDS_INSTRUMENT_THREADING env var (1/true/yes/on enable; 0/false/no/off disable; unrecognized non-empty → WARNING + default).
  4. Default: ENABLED.

Also adds:

  • an idempotency guard (reads _is_instrumented_by_opentelemetry to avoid stacking wrappers when the host already instrumented threading), and
  • graceful-failure handlinginstrument() exceptions are caught and logged (ERROR when explicitly requested, WARNING when auto-enabled) so telemetry failure never crashes the host.

Tests run in isolated subprocesses (BaseInstrumentor is a process-wide singleton): on-by-default, env opt-in/opt-out (incl. off), kwarg opt-in/out, disable-env precedence over kwarg, idempotency, graceful failure, ERROR-vs-WARNING severity, and warn-on-unrecognized.

Flipping the default to off is a breaking change deferred to v2 (tracked on this thread).

@jpr5 jpr5 had a problem deploying to manual-approval April 20, 2026 18:48 — with GitHub Actions Failure
@jpr5 jpr5 had a problem deploying to manual-approval April 20, 2026 18:48 — with GitHub Actions Failure
@yonib05 yonib05 added area-otel Open-telemetry related python Pull requests that update python code bug Something isn't working labels May 27, 2026
maxmilian added a commit to maxmilian/harness-sdk that referenced this pull request Jun 24, 2026
…_DISABLED / enabled (strands-agents#1059)

StrandsTelemetry had no way to turn instrumentation off, so applications that
already run their own OpenTelemetry setup still got Strands' tracer/meter
providers registered globally and its spans/metrics emitted — even when they
didn't want them (see strands-agents#1059, e.g. metrics that a downstream CloudWatch pipeline
can't flush).

Add an `enabled` switch to StrandsTelemetry that defaults to honoring the
OTel-standard `OTEL_SDK_DISABLED=true` environment variable, with an explicit
`StrandsTelemetry(enabled=False)` override for programmatic control. When
disabled, no global tracer/meter provider is registered and the `setup_*`
methods are no-ops, leaving the host application's OpenTelemetry setup
untouched. The default (enabled) path is unchanged, so this is backward
compatible.

Scoped to `telemetry/config.py` (the location @zastrowm suggested on the issue);
the separate ThreadingInstrumentor opt-in work in strands-agents#2167 is untouched.

@lizradway lizradway left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flipping this to false would be a breaking change for users that we cannot do in a minor version, but can consider for v2.

however, would be happy to keep the rest if you are able to cherry pick that out.

thanks for your contribution!

strands>=1.35.0 unconditionally calls ThreadingInstrumentor().instrument() in
Tracer.__init__, monkey-patching concurrent.futures.ThreadPoolExecutor.submit
process-wide, ignoring OTEL_PYTHON_DISABLED_INSTRUMENTATIONS, double-wrapping
when another OTel distro already instrumented, and crashing the host if
instrument() raises.

This is additive and non-breaking: threading instrumentation stays ON by
default (unchanged behavior). It adds supported opt-out/opt-in levers and
hardens the call. Resolver precedence (highest->lowest):
  1. OTEL_PYTHON_DISABLED_INSTRUMENTATIONS containing 'threading' -> disabled
     (matches OpenTelemetry auto-loader semantics).
  2. Explicit Tracer(instrument_threading=True|False) kwarg.
  3. STRANDS_INSTRUMENT_THREADING env ('1/true/yes/on' enable; '0/false/no/off'
     disable; unrecognized non-empty -> WARNING + default).
  4. Default: ENABLED.

Also: an idempotency guard (reads _is_instrumented_by_opentelemetry to avoid
stacking wrappers when the host already instrumented) and graceful-failure
handling (instrument() exceptions are caught and logged -- ERROR when
explicitly requested, WARNING when auto-enabled -- so telemetry never crashes
the host).

Tests run in isolated subprocesses (BaseInstrumentor is a process-wide
singleton): on-by-default, env opt-in/opt-out incl. 'off', kwarg opt-in/out,
disable-env precedence over kwarg, idempotency, graceful failure, ERROR-vs-
WARNING severity, and warn-on-unrecognized.

Default->off is deferred to v2 (breaking).
@jpr5 jpr5 force-pushed the fix/threading-instrumentor-opt-in branch from 365c290 to 9498dc5 Compare June 28, 2026 05:30
@github-actions github-actions Bot added size/l and removed size/m labels Jun 28, 2026
@jpr5

jpr5 commented Jun 28, 2026

Copy link
Copy Markdown
Author

Makes sense — keeping the default on so this stays additive. I left in the kwarg, the env var, and the OTEL_PYTHON_DISABLED_INSTRUMENTATIONS handling, plus a guard so we don't double-wrap when something's already instrumented threading, and a try/except so a telemetry failure can't take the host down. Just pushed that.

LMK what you think @lizradway 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-otel Open-telemetry related bug Something isn't working python Pull requests that update python code size/l

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants