Skip to content

fix(telemetry): deliver call_completed before process exit via drain() in disconnect()#176

Merged
nicolotognoni merged 1 commit into
mainfrom
fix/telemetry-drain-on-disconnect
Jun 12, 2026
Merged

fix(telemetry): deliver call_completed before process exit via drain() in disconnect()#176
nicolotognoni merged 1 commit into
mainfrom
fix/telemetry-drain-on-disconnect

Conversation

@nicolotognoni

Copy link
Copy Markdown
Collaborator

Summary

  • E2E-testing the schema-v6 telemetry with a real outbound phone call exposed a data-loss bug: call_started reached Axiom but call_completed — the one event carrying duration/cost/latency/outcome — never did, for every short-lived script ("place call, wait, exit": the main outbound use case). The connect→complete funnel read as a 100% failure rate.
  • Root cause (Python): disconnect() used a fire-and-forget flush; the process exited right after, asyncio.run() closed the loop cancelling the flush task mid-POST, and the 0.25s atexit fallback was too short for a cold TLS handshake. (This confirms the "atexit 0.25s too short" P2 from the hardening audit — now field-proven, with a worse blast radius than assumed.)
  • Fix: new TelemetryClient.drain() in both SDKs — flush buffered events and await delivery (bounded), keeping the client reusable (unlike aclose()/close(), a subsequent serve() still emits). disconnect() now awaits it. Python's atexit timeout bumped 0.25 → 1.0s as a second line of defense for scripts that never call disconnect().

Implementation

  • libraries/python/getpatter/telemetry/client.pydrain(timeout) (awaits in-flight flush task + flushes remainder, swallow-all); _ATEXIT_TIMEOUT_S 0.25 → 1.0.
  • libraries/python/getpatter/client.pydisconnect() awaits self._telemetry.drain() (was fire-and-forget flush_pending()).
  • libraries/typescript/src/telemetry/client.tsdrain() mirror (TS was structurally safer — an in-flight fetch keeps the Node loop alive — but now mirrors Python for explicitness and parity).
  • libraries/typescript/src/client.tsdisconnect() awaits this.telemetry.drain().
  • Authentic tests in both SDKs (real local HTTP collector, no mocks): drain delivers the final event AND the client stays usable afterwards.
  • Test rename: test_build_event_v5_new_events_and_dims..._v6_... (stale name after the schema bump, flagged by review).

Field evidence (real calls, Axiom dataset getpatter_usage)

  • Run 1 (pre-fix, 13:21Z): call answered, 32s, CallResult complete locally — call_started in Axiom, call_completed missing (3+ min wait).
  • Run 2 (post-fix, 13:34Z): call answered, 25s — call_completed landed <1s after disconnect with full metrics: outcome=completed, duration_seconds=25, cost_usd=0.06, turn_count_bucket=4_6.

Breaking change?

No. drain() is additive; disconnect() gains a bounded (≤2s, typically ~100ms) wait that only runs when telemetry is enabled and events are pending. Opt-out paths unchanged.

Test plan

  • Python: pytest tests/ — 2656 passed, 8 skipped, 2 xfailed
  • TypeScript: npm test (2102 passed) + npm run lint + npm run build — green
  • /parity-check — drain() same name/semantics both SDKs, disconnect() parity
  • E2E smoke — two REAL outbound calls (see field evidence above)

Docs updates

  • N/A (internal telemetry delivery mechanics; docs/telemetry.mdx event semantics unchanged)

Known follow-ups (out of scope, noted during E2E)

  • direction present on call_started but empty on call_completed (call-end payload doesn't propagate it).
  • TS SCHEMA_VERSION not exported from the package root (internal-only).

…) in disconnect()

Found live in an E2E call test: call_started reached the collector but
call_completed — the one event carrying duration/cost/latency/outcome —
never did, for every short-lived script ("place call, wait, exit", the
main outbound use case).

Root cause (Python): disconnect() used a fire-and-forget flush; the
process exited right after, asyncio.run() closed the loop cancelling the
flush task mid-POST, and the 0.25s atexit fallback was too short for a
cold TLS handshake.

- Add TelemetryClient.drain() to both SDKs: flush buffered events and
  await delivery (bounded), keeping the client reusable — unlike
  aclose()/close(), a subsequent serve() still emits.
- disconnect() now awaits drain() in both SDKs (TS was structurally
  safer — an in-flight fetch keeps the loop alive — but mirrors Python
  for explicitness and parity).
- Bump Python _ATEXIT_TIMEOUT_S 0.25 -> 1.0s as a second line of defense
  for scripts that never call disconnect().
- Authentic tests (real local HTTP collector) in both SDKs: drain
  delivers the final event AND the client stays usable afterwards.
- Rename test_build_event_v5_new_events_and_dims -> v6 (stale name after
  the schema bump).

Field-proven: re-running the same real outbound call after the fix,
call_completed landed in the collector <1s after disconnect with full
metrics (outcome=completed, duration=25s, cost=$0.06, turns=4_6).
@nicolotognoni nicolotognoni merged commit 66f7099 into main Jun 12, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant