Skip to content

feat(analytics): BI events for the hosted auth/HTTPS flow (GAP#1/#2/#3)#148

Merged
awkoy merged 10 commits into
mainfrom
awkoy/bi-events-auth-https
Jun 8, 2026
Merged

feat(analytics): BI events for the hosted auth/HTTPS flow (GAP#1/#2/#3)#148
awkoy merged 10 commits into
mainfrom
awkoy/bi-events-auth-https

Conversation

@awkoy

@awkoy awkoy commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Extends opik-mcp's BI/analytics so the hosted HTTP/OAuth deployment is no longer dark and every event carries the metadata needed for the most useful usage charts. Closes three gaps in the analytics layer relative to the recently-added OAuth resource-server + transport-security work.

GAP#1 — lifecycle events were dark in hosted mode. server_started/server_shutdown/startup_error were only emitted from __main__.main(), but the Docker/Helm entrypoint runs uvicorn opik_mcp.server:build_app --factory, which bypasses main(). build_app() now composes an analytics lifespan around FastMCP's session-manager lifespan and emits them itself — guarded by an env-var sentinel so a main()-driven HTTP boot is never double-counted (verified inherited by --reload spawn workers).

GAP#2 — per-request OAuth identity never reached BI. _build_event() runs synchronously in the request task, so the auth_context ContextVars are live: events now carry per-request auth_mode, token_sha256 (hash of the opik_at_ bearer — raw token never stored), request_workspace, and installation_type. Extraction mirrors resolve_opik_config exactly so BI agrees with the outbound credential.

GAP#3 — auth failures were invisible. New opik_mcp_auth_rejected event from a pure-ASGI AuthRejectionMiddleware (never buffers SSE) for 401/421/403 on authenticated paths, with bucketed rejection_reason, auth_mode, path_bucket, oauth_configured, resource_metadata_url_present. OAuth-proxy/discovery paths are skipped (a proxied-AS 401 isn't our rejection).

Also fixes a pre-existing privacy-contract bug: the McpHost/HostLlmFamily Literal allowlists were out of sync with the classifiers (codex/gemini-cli/openai/google were shipping undeclared).

Per-event additions

  • server_started: installation_type, oauth_configured, resource_uri_scheme, dns_rebinding_protection, allowed_hosts_is_default, auth_mode, lifecycle_source
  • server_shutdown: lifecycle_source; startup_error: oauth_configured (+ installation_type via common)
  • tool_called / session_initialized / ask_ollie_completed: auth_mode, token_sha256, request_workspace, installation_type
  • new opik_mcp_auth_rejected event

Notes

  • New boot_props module is the single source of truth shared by main() and the lifespan so the two boot paths cannot drift. The pure installation_type classifier moved to config (leaf) to break an import cycle.
  • Cross-team dependency: token_sha256 is only a useful BI join key once the backend exposes an opik_at_ → user mapping; until then dashboards should treat auth_mode='oauth' rows as identity_pending.
  • Privacy contract upheld: every new property is a bool-string / allowlisted Literal / bucket, or a SHA-256 identity hash; raw tokens, paths, and user prose never leave the process. Enforced by direct _build_event tests + the cross-event canary sweep.
  • Design/plan doc lives at docs/bi-events-https-plan.md (gitignored, local).

Test Plan

  • uv run pytest — 1061 passed, 2 skipped (clean env)
  • uv run ruff check . + ruff format --check . — clean
  • uv run mypy — clean (107 files)
  • Each change developed test-first (red → green) and code-reviewed; findings fixed
  • Staging: confirm server_started (lifecycle_source=lifespan) fires within minutes of a Docker --factory boot, and auth_rejected fires on an unauthenticated request
  • Backend: confirm the opik_at_ → user mapping is BI-queryable before relying on token_sha256

🤖 Generated with Claude Code

awkoy and others added 9 commits June 8, 2026 14:28
The privacy-allowlist contract in events.py promises the McpHost and
HostLlmFamily Literals stay in sync with the classifiers in
mcp_client_info.py, but the classifiers already emit buckets the Literals
never declared: McpHost was missing zed/vscode/goose/librechat/5ire/
opencode/codex/gemini-cli, and HostLlmFamily was missing openai (codex)
and google (gemini-cli). These values ship in BI today, violating the
contract.

Add sync tests that drive the classifier functions (covering the hardcoded
'other'/'unknown' fallbacks, not just the pattern tables) to pin the
invariant, and add the missing values. Also correct the stale module
docstring that pointed at wrappers.py instead of mcp_client_info.py.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Schema foundation for the new auth-rejection BI event (GAP#3). Adds the
event-name constant (exported from the analytics package surface), the
PathBucket Literal, and bucket_path() which collapses a request path to a
low-cardinality {mcp,health,well_known,other} bucket — never emitting the
raw path. bucket_path honours a configurable MCP mount path so bucketing
stays correct behind a path-prefix proxy, and uses exact-or-subpath
matching so siblings like /mcpfoo or /healthz aren't mis-bucketed.
bucket_path is exported alongside the other bucket_* helpers.

The remaining new Literals (AuthMode, InstallationType, LifecycleSource,
ResourceUriScheme, AuthRejectionReason) and the privacy-docstring
clarification are introduced in later steps alongside their consumers,
test-first.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Single source of truth for the new auth/transport properties stamped on
server_started / startup_error / auth_rejected, so __main__.main() and the
build_app() lifespan (wired in Phases 3-4) emit identical values and cannot
drift.

Provides installation_type (delegates to error_tracking, 'self-hosted'
hyphenated), resource_uri_scheme, oauth_configured(+from_env),
dns_rebinding_protection, allowed_hosts_is_default (whitespace-insensitive,
default read from the pydantic schema with an import-time type guard),
auth_mode_at_boot (settings/outbound default — read alongside oauth_configured
for hybrid deploys; per-request mode is resolved separately), and
collect_boot_props(). Also defines LIFECYCLE_SENTINEL (env-var name used by
Phases 3-4 to prevent double-emit) and the InstallationType/AuthMode/
ResourceUriScheme Literals these values are validated against.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
In hosted HTTP/OAuth mode the per-request identity lives in the auth_context
ContextVars but never reached BI — events fell back to the device install_id
with no api_key_sha256. _build_event runs synchronously in the caller task,
so those ContextVars are live at build time: derive auth_mode (always set, so
the stdio/no-auth cohort is visible not dark), token_sha256 (sha256 of the
opik_at_ bearer; raw token never stored), and request_workspace (plaintext,
matching the existing workspace posture). installation_type is added to the
common block so every event can be split cloud/self-hosted.

Token extraction mirrors opik_client.resolve_opik_config exactly (same
partition(" ") + lstrip + opik_at_ prefix check) so BI's auth_mode/token_sha256
agree with the credential actually forwarded outbound.

Merge precedence is now {per_request, properties, common}: common stays
authoritative; caller properties win over per-request so server_started's
settings-derived auth_mode (Phase 3) beats the contextvar value.

Privacy is guarded by direct _build_event tests — the recorder-based privacy
tests intercept at track_event and never exercise _build_event. Documents the
two declared privacy exceptions (identity hashes, plaintext workspace).

Cross-team dependency: token_sha256 is a useful BI join key only once the
backend exposes an opik_at_ -> user mapping; until then dashboards treat
auth_mode='oauth' rows as identity_pending.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
server_started now spreads collect_boot_props(settings) (oauth_configured,
resource_uri_scheme, dns_rebinding_protection, allowed_hosts_is_default,
auth_mode) and lifecycle_source='main'. Spreading into the caller tier means
the settings-derived auth_mode wins over _build_event's contextvar fallback,
so a pure-OAuth deployment's server_started correctly reports auth_mode='oauth'
(not 'none').

server_shutdown carries lifecycle_source; startup_error carries
oauth_configured (read from raw env on the config-fail path). installation_type
is intentionally NOT duplicated onto these events or into collect_boot_props:
_build_event's common block is its single source of truth, stamped on every
event (a duplicate would be dead — common wins the merge).

The lifecycle sentinel set/get live in boot_props (mark_lifecycle_owned_by_main
/ lifecycle_owned_by_main) so main() and the Phase 4 build_app() lifespan share
one source; main() sets it after setup_sentry, before build_app can run, so the
lifespan skips its own emit and boots are never double-counted. conftest clears
the sentinel (via the constant) between tests. Adds the LifecycleSource Literal
and a sentinel round-trip test that locks the contract Phase 4 depends on.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The hosted entrypoint runs 'uvicorn opik_mcp.server:build_app --factory',
which calls build_app() directly and never runs __main__.main() — so
server_started/server_shutdown were 100% dark for the entire hosted fleet,
exactly the deployment the OAuth/HTTPS work targets.

build_app() now composes an analytics lifespan AROUND FastMCP's
session-manager lifespan (captured before overwriting so the session manager
still starts). It emits server_started on enter and server_shutdown on exit
(reason=transport_error if the body raises), draining the flush off the event
loop via run_in_executor. The shutdown finally guards with except BaseException
(matching __main__) so an executor CancelledError during loop teardown can't
mask the real shutdown reason. The lifespan-seconds anchor is captured at
lifespan enter, not build_app() time, to exclude uvicorn startup latency.

When main() owns lifecycle (the sentinel is set — and verified inherited by
uvicorn --reload spawn workers via the OS environment), the lifespan runs the
inner lifespan and emits nothing, so a main()-driven HTTP boot is never
double-counted. The environment fingerprint is computed synchronously in
build_app() (it shells out on macOS) and only when this process will emit.

server_started/server_shutdown payloads come from the shared
boot_props.server_started_props / server_shutdown_props helpers, used by BOTH
main() and the lifespan so the two boot paths cannot drift. lifecycle_source
distinguishes 'main' vs 'lifespan' in BI.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Auth rejections were invisible to BI: BearerAuthMiddleware returns 401
(missing/malformed bearer) and the SDK transport-security guard returns
421/403 (Host/Origin) before any tool runs, so no tool_called ever fired.
Auth-rejection rate is the single most important HTTPS-flow health signal.

Adds AuthRejectionMiddleware — pure ASGI (never buffers streaming SSE; reads
only the first response status line) wrapped outermost in build_app(). For
401/421/403 on AUTHENTICATED paths it emits opik_mcp_auth_rejected with a
bucketed rejection_reason (missing_header/not_bearer/empty_token/token_rejected/
host_rejected/origin_rejected), auth_mode, path_bucket, oauth_configured,
resource_metadata_url_present, and the cached env cohort. _UNAUTH_PATHS (health,
OAuth-proxy, discovery) are skipped — a 401 proxied from the AS during the OAuth
dance is not opik-mcp's rejection and would pollute the chart. Non-http scopes
pass through so the composed lifespan still runs.

auth_mode is derived from the request header inside the middleware, because it
runs AFTER BearerAuthMiddleware reset the inbound-auth ContextVar — so a valid
OAuth bearer rejected by the Host guard correctly reports auth_mode='oauth'
(not the settings fallback). The bearer classification is factored into the
shared auth_context.classify_bearer, used by both _build_event and the
middleware so they cannot drift. rejection_reason is derived from status +
header SHAPE only — the raw token never enters the event (canary test).
_has_absolute_resource_metadata_url fixes the always-truthy _resource_metadata_url.
build_app() now returns ASGIApp.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rt cycle

Extends the cross-event privacy sweep (test_analytics_privacy) to cover
opik_mcp_auth_rejected with a canary OAuth bearer, asserting neither the token
nor env values leak.

Moves the pure installation_type(settings) derivation (cloud/local/self-hosted
+ host sets) from error_tracking into config (a leaf module). boot_props.
installation_type previously delegated to error_tracking, which imports
analytics.identity -> analytics.__init__ -> client -> (lazy) boot_props — a
static import cycle mypy followed and reported as 'opik_mcp has no attribute
server/error_tracking'. With the classifier in config, both error_tracking
(Sentry tag) and boot_props (BI) import it cycle-free. The taxonomy test now
targets config.installation_type at its source.

Also makes the new test Settings(**base) helpers mypy-clean under the project's
no_implicit_reexport config. Whole project green: mypy (107 files), ruff check
+ format, and 1061 tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…h modes)

The Docker image ran 'uvicorn opik_mcp.server:build_app --factory', bypassing
main() — so opik_mcp_startup_error (and the preflight bind check + OAuth-config
guard) never fired in hosted mode, the one event the build_app() lifespan does
not cover. Switch the entrypoint to 'python -m opik_mcp' so main() runs in
hosted HTTP exactly as in stdio: it emits server_started/server_shutdown/
startup_error, runs the guards, then serves via uvicorn (single worker).

The _OPIK_MCP_LIFECYCLE_OWNED_BY_MAIN sentinel (set before build_app()) makes
the composed lifespan defer to main(), so boots are never double-counted — the
new test asserts the sentinel is set at build_app() time and that lifecycle is
emitted with lifecycle_source='main'. main()'s HTTP run now passes
access_log=False (parity with the removed --no-access-log; keeps OAuth-flow
query strings, which can carry tokens, out of stdout). Helm inherits the image
entrypoint, so no chart change is needed.

Net: every BI event now fires identically for stdio (local) and HTTP (hosted)
installations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@awkoy

awkoy commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up (commit d6bc779): made the hosted entrypoint run main() (python -m opik_mcp instead of uvicorn … build_app --factory).

Reason: opik_mcp_startup_error (plus the preflight bind check and the OAuth-config guard) is emitted only from main(), so under the old --factory entrypoint it never fired in hosted mode — the one event the build_app() lifespan doesn't cover. Now every BI event fires identically for stdio (local) and HTTP (hosted).

  • The _OPIK_MCP_LIFECYCLE_OWNED_BY_MAIN sentinel (set before build_app()) makes the lifespan defer to main() → no double-count (new test asserts the ordering + lifecycle_source='main').
  • Parity preserved: single worker (uvicorn default), access_log=False (replaces --no-access-log; also keeps OAuth-flow query strings out of stdout), host/port/transport from ENV.
  • Helm inherits the image entrypoint — no chart change.
  • tini stays PID 1 for SIGTERM forwarding → graceful shutdown → server_shutdown(reason=clean_exit).

From a full review of the branch:
- auth_mode: the no-credential fallback in _build_event and AuthRejectionMiddleware
  was 2-way (api_key/none), missing 'oauth' for OAuth-only deploys. Extracted the
  settings-derived 3-way logic into auth_context.settings_auth_mode, now shared by
  client, middleware, and boot_props.auth_mode_at_boot so per-call and boot agree.
- installation_type: computed ONCE in AnalyticsClient.__init__ (was an uncached
  urlparse + lazy import per event on the hot path); falls back to 'unknown'
  instead of silently dropping the key.
- transport: lowercased in the common block so a mixed-case OPIK_MCP_TRANSPORT
  still emits a canonical BI value.
- Extracted server._is_unauth_path(), shared by BearerAuthMiddleware and
  AuthRejectionMiddleware, so the skip-set can't drift between them.
- Reload HTTP path now passes access_log=False too (was only on the non-reload
  path) — keeps OAuth-flow query strings out of dev logs.
- _build_fallback_analytics_client re-reads COMET_URL_OVERRIDE/OPIK_URL so a
  self-hosted install's config-fail startup_error isn't mislabelled 'cloud'.
- Lifespan flush captures the analytics client before run_in_executor (no
  in-thread singleton re-fetch); de-dup a double ws_header.strip().

Left intentionally as-is (documented): opik_client.resolve_opik_config still has
its own opik_at_ detection (core auth path — out of scope to refactor here), and
startup_error remains main()-only (the hosted entrypoint runs main(), so it is
covered; the lifespan fallback path reports server_shutdown(transport_error)).

Whole project green: ruff, ruff format, mypy (107 files), 1067 tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@awkoy awkoy merged commit 72cf05f into main Jun 8, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant