fix(telemetry): deliver call_completed before process exit via drain() in disconnect()#176
Merged
Merged
Conversation
…) in disconnect()
Found live in an E2E call test: call_started reached the collector but
call_completed — the one event carrying duration/cost/latency/outcome —
never did, for every short-lived script ("place call, wait, exit", the
main outbound use case).
Root cause (Python): disconnect() used a fire-and-forget flush; the
process exited right after, asyncio.run() closed the loop cancelling the
flush task mid-POST, and the 0.25s atexit fallback was too short for a
cold TLS handshake.
- Add TelemetryClient.drain() to both SDKs: flush buffered events and
await delivery (bounded), keeping the client reusable — unlike
aclose()/close(), a subsequent serve() still emits.
- disconnect() now awaits drain() in both SDKs (TS was structurally
safer — an in-flight fetch keeps the loop alive — but mirrors Python
for explicitness and parity).
- Bump Python _ATEXIT_TIMEOUT_S 0.25 -> 1.0s as a second line of defense
for scripts that never call disconnect().
- Authentic tests (real local HTTP collector) in both SDKs: drain
delivers the final event AND the client stays usable afterwards.
- Rename test_build_event_v5_new_events_and_dims -> v6 (stale name after
the schema bump).
Field-proven: re-running the same real outbound call after the fix,
call_completed landed in the collector <1s after disconnect with full
metrics (outcome=completed, duration=25s, cost=$0.06, turns=4_6).
This was referenced Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
call_startedreached Axiom butcall_completed— the one event carrying duration/cost/latency/outcome — never did, for every short-lived script ("place call, wait, exit": the main outbound use case). The connect→complete funnel read as a 100% failure rate.disconnect()used a fire-and-forget flush; the process exited right after,asyncio.run()closed the loop cancelling the flush task mid-POST, and the 0.25s atexit fallback was too short for a cold TLS handshake. (This confirms the "atexit 0.25s too short" P2 from the hardening audit — now field-proven, with a worse blast radius than assumed.)TelemetryClient.drain()in both SDKs — flush buffered events and await delivery (bounded), keeping the client reusable (unlikeaclose()/close(), a subsequentserve()still emits).disconnect()now awaits it. Python's atexit timeout bumped 0.25 → 1.0s as a second line of defense for scripts that never calldisconnect().Implementation
libraries/python/getpatter/telemetry/client.py—drain(timeout)(awaits in-flight flush task + flushes remainder, swallow-all);_ATEXIT_TIMEOUT_S0.25 → 1.0.libraries/python/getpatter/client.py—disconnect()awaitsself._telemetry.drain()(was fire-and-forgetflush_pending()).libraries/typescript/src/telemetry/client.ts—drain()mirror (TS was structurally safer — an in-flightfetchkeeps the Node loop alive — but now mirrors Python for explicitness and parity).libraries/typescript/src/client.ts—disconnect()awaitsthis.telemetry.drain().test_build_event_v5_new_events_and_dims→..._v6_...(stale name after the schema bump, flagged by review).Field evidence (real calls, Axiom dataset
getpatter_usage)call_startedin Axiom,call_completedmissing (3+ min wait).call_completedlanded <1s after disconnect with full metrics:outcome=completed, duration_seconds=25, cost_usd=0.06, turn_count_bucket=4_6.Breaking change?
No.
drain()is additive;disconnect()gains a bounded (≤2s, typically ~100ms) wait that only runs when telemetry is enabled and events are pending. Opt-out paths unchanged.Test plan
pytest tests/— 2656 passed, 8 skipped, 2 xfailednpm test(2102 passed) +npm run lint+npm run build— greendrain()same name/semantics both SDKs, disconnect() parityDocs updates
docs/telemetry.mdxevent semantics unchanged)Known follow-ups (out of scope, noted during E2E)
directionpresent oncall_startedbut empty oncall_completed(call-end payload doesn't propagate it).SCHEMA_VERSIONnot exported from the package root (internal-only).