feat(analytics): BI events for the hosted auth/HTTPS flow (GAP#1/#2/#3)#148
Merged
Conversation
The privacy-allowlist contract in events.py promises the McpHost and HostLlmFamily Literals stay in sync with the classifiers in mcp_client_info.py, but the classifiers already emit buckets the Literals never declared: McpHost was missing zed/vscode/goose/librechat/5ire/ opencode/codex/gemini-cli, and HostLlmFamily was missing openai (codex) and google (gemini-cli). These values ship in BI today, violating the contract. Add sync tests that drive the classifier functions (covering the hardcoded 'other'/'unknown' fallbacks, not just the pattern tables) to pin the invariant, and add the missing values. Also correct the stale module docstring that pointed at wrappers.py instead of mcp_client_info.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Schema foundation for the new auth-rejection BI event (GAP#3). Adds the
event-name constant (exported from the analytics package surface), the
PathBucket Literal, and bucket_path() which collapses a request path to a
low-cardinality {mcp,health,well_known,other} bucket — never emitting the
raw path. bucket_path honours a configurable MCP mount path so bucketing
stays correct behind a path-prefix proxy, and uses exact-or-subpath
matching so siblings like /mcpfoo or /healthz aren't mis-bucketed.
bucket_path is exported alongside the other bucket_* helpers.
The remaining new Literals (AuthMode, InstallationType, LifecycleSource,
ResourceUriScheme, AuthRejectionReason) and the privacy-docstring
clarification are introduced in later steps alongside their consumers,
test-first.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Single source of truth for the new auth/transport properties stamped on server_started / startup_error / auth_rejected, so __main__.main() and the build_app() lifespan (wired in Phases 3-4) emit identical values and cannot drift. Provides installation_type (delegates to error_tracking, 'self-hosted' hyphenated), resource_uri_scheme, oauth_configured(+from_env), dns_rebinding_protection, allowed_hosts_is_default (whitespace-insensitive, default read from the pydantic schema with an import-time type guard), auth_mode_at_boot (settings/outbound default — read alongside oauth_configured for hybrid deploys; per-request mode is resolved separately), and collect_boot_props(). Also defines LIFECYCLE_SENTINEL (env-var name used by Phases 3-4 to prevent double-emit) and the InstallationType/AuthMode/ ResourceUriScheme Literals these values are validated against. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
In hosted HTTP/OAuth mode the per-request identity lives in the auth_context
ContextVars but never reached BI — events fell back to the device install_id
with no api_key_sha256. _build_event runs synchronously in the caller task,
so those ContextVars are live at build time: derive auth_mode (always set, so
the stdio/no-auth cohort is visible not dark), token_sha256 (sha256 of the
opik_at_ bearer; raw token never stored), and request_workspace (plaintext,
matching the existing workspace posture). installation_type is added to the
common block so every event can be split cloud/self-hosted.
Token extraction mirrors opik_client.resolve_opik_config exactly (same
partition(" ") + lstrip + opik_at_ prefix check) so BI's auth_mode/token_sha256
agree with the credential actually forwarded outbound.
Merge precedence is now {per_request, properties, common}: common stays
authoritative; caller properties win over per-request so server_started's
settings-derived auth_mode (Phase 3) beats the contextvar value.
Privacy is guarded by direct _build_event tests — the recorder-based privacy
tests intercept at track_event and never exercise _build_event. Documents the
two declared privacy exceptions (identity hashes, plaintext workspace).
Cross-team dependency: token_sha256 is a useful BI join key only once the
backend exposes an opik_at_ -> user mapping; until then dashboards treat
auth_mode='oauth' rows as identity_pending.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
server_started now spreads collect_boot_props(settings) (oauth_configured, resource_uri_scheme, dns_rebinding_protection, allowed_hosts_is_default, auth_mode) and lifecycle_source='main'. Spreading into the caller tier means the settings-derived auth_mode wins over _build_event's contextvar fallback, so a pure-OAuth deployment's server_started correctly reports auth_mode='oauth' (not 'none'). server_shutdown carries lifecycle_source; startup_error carries oauth_configured (read from raw env on the config-fail path). installation_type is intentionally NOT duplicated onto these events or into collect_boot_props: _build_event's common block is its single source of truth, stamped on every event (a duplicate would be dead — common wins the merge). The lifecycle sentinel set/get live in boot_props (mark_lifecycle_owned_by_main / lifecycle_owned_by_main) so main() and the Phase 4 build_app() lifespan share one source; main() sets it after setup_sentry, before build_app can run, so the lifespan skips its own emit and boots are never double-counted. conftest clears the sentinel (via the constant) between tests. Adds the LifecycleSource Literal and a sentinel round-trip test that locks the contract Phase 4 depends on. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The hosted entrypoint runs 'uvicorn opik_mcp.server:build_app --factory', which calls build_app() directly and never runs __main__.main() — so server_started/server_shutdown were 100% dark for the entire hosted fleet, exactly the deployment the OAuth/HTTPS work targets. build_app() now composes an analytics lifespan AROUND FastMCP's session-manager lifespan (captured before overwriting so the session manager still starts). It emits server_started on enter and server_shutdown on exit (reason=transport_error if the body raises), draining the flush off the event loop via run_in_executor. The shutdown finally guards with except BaseException (matching __main__) so an executor CancelledError during loop teardown can't mask the real shutdown reason. The lifespan-seconds anchor is captured at lifespan enter, not build_app() time, to exclude uvicorn startup latency. When main() owns lifecycle (the sentinel is set — and verified inherited by uvicorn --reload spawn workers via the OS environment), the lifespan runs the inner lifespan and emits nothing, so a main()-driven HTTP boot is never double-counted. The environment fingerprint is computed synchronously in build_app() (it shells out on macOS) and only when this process will emit. server_started/server_shutdown payloads come from the shared boot_props.server_started_props / server_shutdown_props helpers, used by BOTH main() and the lifespan so the two boot paths cannot drift. lifecycle_source distinguishes 'main' vs 'lifespan' in BI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Auth rejections were invisible to BI: BearerAuthMiddleware returns 401 (missing/malformed bearer) and the SDK transport-security guard returns 421/403 (Host/Origin) before any tool runs, so no tool_called ever fired. Auth-rejection rate is the single most important HTTPS-flow health signal. Adds AuthRejectionMiddleware — pure ASGI (never buffers streaming SSE; reads only the first response status line) wrapped outermost in build_app(). For 401/421/403 on AUTHENTICATED paths it emits opik_mcp_auth_rejected with a bucketed rejection_reason (missing_header/not_bearer/empty_token/token_rejected/ host_rejected/origin_rejected), auth_mode, path_bucket, oauth_configured, resource_metadata_url_present, and the cached env cohort. _UNAUTH_PATHS (health, OAuth-proxy, discovery) are skipped — a 401 proxied from the AS during the OAuth dance is not opik-mcp's rejection and would pollute the chart. Non-http scopes pass through so the composed lifespan still runs. auth_mode is derived from the request header inside the middleware, because it runs AFTER BearerAuthMiddleware reset the inbound-auth ContextVar — so a valid OAuth bearer rejected by the Host guard correctly reports auth_mode='oauth' (not the settings fallback). The bearer classification is factored into the shared auth_context.classify_bearer, used by both _build_event and the middleware so they cannot drift. rejection_reason is derived from status + header SHAPE only — the raw token never enters the event (canary test). _has_absolute_resource_metadata_url fixes the always-truthy _resource_metadata_url. build_app() now returns ASGIApp. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rt cycle Extends the cross-event privacy sweep (test_analytics_privacy) to cover opik_mcp_auth_rejected with a canary OAuth bearer, asserting neither the token nor env values leak. Moves the pure installation_type(settings) derivation (cloud/local/self-hosted + host sets) from error_tracking into config (a leaf module). boot_props. installation_type previously delegated to error_tracking, which imports analytics.identity -> analytics.__init__ -> client -> (lazy) boot_props — a static import cycle mypy followed and reported as 'opik_mcp has no attribute server/error_tracking'. With the classifier in config, both error_tracking (Sentry tag) and boot_props (BI) import it cycle-free. The taxonomy test now targets config.installation_type at its source. Also makes the new test Settings(**base) helpers mypy-clean under the project's no_implicit_reexport config. Whole project green: mypy (107 files), ruff check + format, and 1061 tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…h modes) The Docker image ran 'uvicorn opik_mcp.server:build_app --factory', bypassing main() — so opik_mcp_startup_error (and the preflight bind check + OAuth-config guard) never fired in hosted mode, the one event the build_app() lifespan does not cover. Switch the entrypoint to 'python -m opik_mcp' so main() runs in hosted HTTP exactly as in stdio: it emits server_started/server_shutdown/ startup_error, runs the guards, then serves via uvicorn (single worker). The _OPIK_MCP_LIFECYCLE_OWNED_BY_MAIN sentinel (set before build_app()) makes the composed lifespan defer to main(), so boots are never double-counted — the new test asserts the sentinel is set at build_app() time and that lifecycle is emitted with lifecycle_source='main'. main()'s HTTP run now passes access_log=False (parity with the removed --no-access-log; keeps OAuth-flow query strings, which can carry tokens, out of stdout). Helm inherits the image entrypoint, so no chart change is needed. Net: every BI event now fires identically for stdio (local) and HTTP (hosted) installations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
Author
|
Follow-up (commit Reason:
|
From a full review of the branch: - auth_mode: the no-credential fallback in _build_event and AuthRejectionMiddleware was 2-way (api_key/none), missing 'oauth' for OAuth-only deploys. Extracted the settings-derived 3-way logic into auth_context.settings_auth_mode, now shared by client, middleware, and boot_props.auth_mode_at_boot so per-call and boot agree. - installation_type: computed ONCE in AnalyticsClient.__init__ (was an uncached urlparse + lazy import per event on the hot path); falls back to 'unknown' instead of silently dropping the key. - transport: lowercased in the common block so a mixed-case OPIK_MCP_TRANSPORT still emits a canonical BI value. - Extracted server._is_unauth_path(), shared by BearerAuthMiddleware and AuthRejectionMiddleware, so the skip-set can't drift between them. - Reload HTTP path now passes access_log=False too (was only on the non-reload path) — keeps OAuth-flow query strings out of dev logs. - _build_fallback_analytics_client re-reads COMET_URL_OVERRIDE/OPIK_URL so a self-hosted install's config-fail startup_error isn't mislabelled 'cloud'. - Lifespan flush captures the analytics client before run_in_executor (no in-thread singleton re-fetch); de-dup a double ws_header.strip(). Left intentionally as-is (documented): opik_client.resolve_opik_config still has its own opik_at_ detection (core auth path — out of scope to refactor here), and startup_error remains main()-only (the hosted entrypoint runs main(), so it is covered; the lifespan fallback path reports server_shutdown(transport_error)). Whole project green: ruff, ruff format, mypy (107 files), 1067 tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends opik-mcp's BI/analytics so the hosted HTTP/OAuth deployment is no longer dark and every event carries the metadata needed for the most useful usage charts. Closes three gaps in the analytics layer relative to the recently-added OAuth resource-server + transport-security work.
GAP#1 — lifecycle events were dark in hosted mode.
server_started/server_shutdown/startup_errorwere only emitted from__main__.main(), but the Docker/Helm entrypoint runsuvicorn opik_mcp.server:build_app --factory, which bypassesmain().build_app()now composes an analytics lifespan around FastMCP's session-manager lifespan and emits them itself — guarded by an env-var sentinel so amain()-driven HTTP boot is never double-counted (verified inherited by--reloadspawn workers).GAP#2 — per-request OAuth identity never reached BI.
_build_event()runs synchronously in the request task, so theauth_contextContextVars are live: events now carry per-requestauth_mode,token_sha256(hash of theopik_at_bearer — raw token never stored),request_workspace, andinstallation_type. Extraction mirrorsresolve_opik_configexactly so BI agrees with the outbound credential.GAP#3 — auth failures were invisible. New
opik_mcp_auth_rejectedevent from a pure-ASGIAuthRejectionMiddleware(never buffers SSE) for 401/421/403 on authenticated paths, with bucketedrejection_reason,auth_mode,path_bucket,oauth_configured,resource_metadata_url_present. OAuth-proxy/discovery paths are skipped (a proxied-AS 401 isn't our rejection).Also fixes a pre-existing privacy-contract bug: the
McpHost/HostLlmFamilyLiteralallowlists were out of sync with the classifiers (codex/gemini-cli/openai/googlewere shipping undeclared).Per-event additions
server_started:installation_type,oauth_configured,resource_uri_scheme,dns_rebinding_protection,allowed_hosts_is_default,auth_mode,lifecycle_sourceserver_shutdown:lifecycle_source;startup_error:oauth_configured(+installation_typevia common)tool_called/session_initialized/ask_ollie_completed:auth_mode,token_sha256,request_workspace,installation_typeopik_mcp_auth_rejectedeventNotes
boot_propsmodule is the single source of truth shared bymain()and the lifespan so the two boot paths cannot drift. The pureinstallation_typeclassifier moved toconfig(leaf) to break an import cycle.token_sha256is only a useful BI join key once the backend exposes anopik_at_ → usermapping; until then dashboards should treatauth_mode='oauth'rows asidentity_pending.Literal/ bucket, or a SHA-256 identity hash; raw tokens, paths, and user prose never leave the process. Enforced by direct_build_eventtests + the cross-event canary sweep.docs/bi-events-https-plan.md(gitignored, local).Test Plan
uv run pytest— 1061 passed, 2 skipped (clean env)uv run ruff check .+ruff format --check .— cleanuv run mypy— clean (107 files)server_started(lifecycle_source=lifespan) fires within minutes of a Docker--factoryboot, andauth_rejectedfires on an unauthenticated requestopik_at_ → usermapping is BI-queryable before relying ontoken_sha256🤖 Generated with Claude Code