feat: add OpenTelemetry tracing for tool calls#21
Conversation
ec68142 to
bec06a0
Compare
|
Hey @priyank766 , this looks great 🚀 |
| "sphinx-design>=0.5", | ||
| ] | ||
| otel = [ | ||
| "opentelemetry-exporter-otlp>=1.25.0", |
There was a problem hiding this comment.
| "opentelemetry-exporter-otlp>=1.25.0", | |
| "opentelemetry-exporter-otlp-proto-http>=1.25.0", |
this avoids installation of unnecessary GRPC subpackage
There was a problem hiding this comment.
Thanks for the Suggestion
I will change the package as well
| exc_info=True, | ||
| ) | ||
| raise | ||
| with tracer.start_as_current_span("tool_call") as span: |
There was a problem hiding this comment.
I tried it just now and it seems span name "tool_call" makes it hard to filter by tool in tracing UIs, Maybe we can consider naming it after the tool: f"tool:{tool_name}" or even just tool_name. The attribute tool.name is still good to keep for structured querying, but the name gives context at a glance.
wdyt?
There was a problem hiding this comment.
Thanks for Reviewing @abhijeet-dhumal
I will push both changes and then you can review it later whenever you get time
Yes I think this naming scheme is much better I will change it .. 👍🏻
There was a problem hiding this comment.
Pull request overview
Adds optional OpenTelemetry tracing for Kubeflow MCP tool calls, wiring tracing through configuration, CLI, runtime instrumentation, tests, and docs.
Changes:
- Adds telemetry setup/no-op helpers and an
oteloptional dependency extra. - Instruments
_audit_wrapspans with tool/persona/correlation/success/duration attributes. - Wires
--otel-endpoint, config/env loading, startup logging, tests, and README documentation.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
kubeflow_mcp/core/telemetry.py |
Adds OpenTelemetry setup, validation, provider reuse, and no-op fallback helpers. |
kubeflow_mcp/core/server.py |
Adds span creation and attributes around audited tool calls. |
kubeflow_mcp/cli.py |
Adds --otel-endpoint and invokes tracing setup during server startup. |
kubeflow_mcp/core/config.py |
Adds observability.otel_endpoint config/env support. |
kubeflow_mcp/core/logging.py |
Includes tracing_enabled in structured startup logs. |
tests/unit/core/test_telemetry.py |
Adds telemetry helper and span attribute tests. |
kubeflow_mcp/cli_test.py |
Adds CLI tracing setup wiring tests. |
README.md |
Documents optional tracing setup and span attributes. |
pyproject.toml |
Adds the otel optional dependency extra. |
uv.lock |
Locks OpenTelemetry optional dependencies and related resolution changes. |
Comments suppressed due to low confidence (1)
kubeflow_mcp/core/server.py:126
- The circuit-open early-return tracing path is not covered by the new span behavior tests. Add coverage for a breaker with
can_execute() == Falseso thetool.success=falseand duration attributes remain verified for this failure mode.
if not breaker.can_execute():
duration_ms = int((time.monotonic() - start) * 1000)
span.set_attribute("tool.success", False)
span.set_attribute("tool.duration_ms", duration_ms)
logger.warning("circuit_open", extra={"tool": tool_name})
return {
"error": f"Circuit breaker open for '{tool_name}' — K8s API may be degraded. Retries automatically after recovery timeout.",
"error_code": ErrorCode.CIRCUIT_OPEN,
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| with tracer.start_as_current_span(f"tool:{tool_name}") as span: | ||
| span.set_attribute("tool.name", tool_name) | ||
| span.set_attribute("kubeflow.persona", persona) | ||
| span.set_attribute("correlation_id", cid) |
| breaker.record_failure() | ||
| span.set_attribute("tool.success", False) | ||
| span.set_attribute("tool.duration_ms", duration_ms) | ||
| span.record_exception(exc) |
There was a problem hiding this comment.
@priyank766 can we add span.set_status(StatusCode.ERROR) on exception here ?
span.set_status(Status(StatusCode.ERROR, str(exc)))
There was a problem hiding this comment.
this will make sure to spot failures easily in the trace list, wdyt?
| observability_file = file_config.get("observability", {}) | ||
| observability = ObservabilityConfig( | ||
| otel_endpoint=os.getenv( | ||
| "KUBEFLOW_MCP_OTEL_ENDPOINT", | ||
| observability_file.get("otel_endpoint"), | ||
| ) | ||
| ) |
| if _rate_limiter is not None and not _rate_limiter.acquire(): | ||
| duration_ms = int((time.monotonic() - start) * 1000) | ||
| span.set_attribute("tool.success", False) | ||
| span.set_attribute("tool.duration_ms", duration_ms) | ||
| logger.warning("rate_limited", extra={"tool": tool_name}) | ||
| return { | ||
| "error": "Rate limit exceeded. Retry after a brief pause.", | ||
| "error_code": ErrorCode.RATE_LIMITED, | ||
| } |
| - Install optional dependencies: `pip install ".[otel]"` | ||
| - Enable tracing with CLI flag or env var: | ||
|
|
||
| ```bash | ||
| kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces | ||
| # or | ||
| export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces | ||
| kubeflow-mcp serve |
|
|
||
| cid = with_correlation_id() | ||
| masked = mask_sensitive_data(kwargs) if kwargs else {} | ||
| tracer = get_tracer("kubeflow_mcp.tools") |
There was a problem hiding this comment.
@priyank766 can you move this outside wrapper closure?
get_tracer is cheap but calling it per-invocation seems semantically wrong, wdyt?
| exc_info=True, | ||
| ) | ||
| raise | ||
| with tracer.start_as_current_span(f"tool:{tool_name}") as span: |
There was a problem hiding this comment.
@priyank766 thinking can we use SpanKind.CLIENT for tool spans here?
Tool calls invoke an external service i.e. K8s API. SpanKind.CLIENT is semantically correct and enables better Jaeger dependency graph rendering..
| return None | ||
|
|
||
|
|
||
| class _NoopTracer: |
There was a problem hiding this comment.
_NoopTracer should accept **kwargs here
Future callers passing kind=SpanKind.CLIENT will otherwise get a TypeError in no-op mode.
|
Thanks for the review @abhijeet-dhumal All suggestions have been addressed. Please take another look when you get a chance. |
86d30a5 to
6428ffe
Compare
| ```bash | ||
| kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces | ||
| # or | ||
| export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces |
There was a problem hiding this comment.
| export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces | |
| export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/v1/traces |
| observability_file = file_config.get("observability", {}) | ||
| observability = ObservabilityConfig( | ||
| otel_endpoint=os.getenv( | ||
| "KUBEFLOW_MCP_OTEL_ENDPOINT", |
There was a problem hiding this comment.
| "KUBEFLOW_MCP_OTEL_ENDPOINT", | |
| "OTEL_EXPORTER_OTLP_ENDPOINT", |
| "--otel-endpoint", | ||
| default=None, | ||
| help="OpenTelemetry OTLP HTTP endpoint for tracing. " | ||
| "Falls back to KUBEFLOW_MCP_OTEL_ENDPOINT env var, config file.", |
There was a problem hiding this comment.
| "Falls back to KUBEFLOW_MCP_OTEL_ENDPOINT env var, config file.", | |
| "Falls back to OTEL_EXPORTER_OTLP_ENDPOINT env var, config file.", |
| with tracer.start_as_current_span( | ||
| f"tool:{tool_name}", **span_kwargs | ||
| ) as span: | ||
| span.set_attribute("tool.name", tool_name) |
There was a problem hiding this comment.
Minor nit: would be great to also surface user.id and mcp.session_id on these spans for per-session/per-user filtering in Jaeger. That will need identity propagation middleware (ContextVars populated from the MCP request context) as a prerequisite..
There was a problem hiding this comment.
happy to follow up with a separate PR for that once this lands. Marking as a non-blocking suggestion.
There was a problem hiding this comment.
Makes sense, having user.id and session_id on the spans would definitely help with debugging. Happy to take a crack at it once this lands, just let me know.
|
Hey @priyank766 , I tried this out locally against OTel collector stack, works cleanly end-to-end. putting few nits otherwise lgtm.. |
| provider.add_span_processor(processor) | ||
| _otel_trace.set_tracer_provider(provider) | ||
|
|
||
| _tracing_initialized = True |
There was a problem hiding this comment.
Missing atexit.register(provider.shutdown) here. Without it, in-flight spans in the BatchSpanProcessor queue get silently dropped on server shutdown, and the background thread can deadlock if the collector is unreachable. One line:
import atexit
atexit.register(provider.shutdown)
| with tracer.start_as_current_span( | ||
| f"tool:{tool_name}", **span_kwargs | ||
| ) as span: | ||
| span.set_attribute("tool.name", tool_name) |
There was a problem hiding this comment.
Can we also add tool.args_preview here? Without it you can't reconstruct what parameters caused a failure from Jaeger alone.. you'd have to cross-reference the audit log. Something like:
import json
span.set_attribute("tool.args_preview", json.dumps(mask_sensitive_data(kwargs), default=str)[:300])
| span.set_attribute("tool.success", False) | ||
| span.set_attribute("tool.duration_ms", duration_ms) | ||
| logger.warning("circuit_open", extra={"tool": tool_name}) | ||
| return { |
There was a problem hiding this comment.
This path sets tool.success=False on the span but there's no test covering it.
Worth adding a test with _FakeBreaker(can_execute=False) asserting the attributes are set before the early return.
| - Enable tracing with CLI flag or env var: | ||
|
|
||
| ```bash | ||
| kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces |
There was a problem hiding this comment.
Heads up: setup_tracing() auto-appends /v1/traces to whatever is passed, so the example in the README produces http://localhost:4318/v1/traces/v1/traces — double path. Either change the README example to the base URL (http://localhost:4318) or stop auto-appending in code and accept the full path as-is. The latter is less surprising.
|
|
||
| Each tool invocation emits a span with attributes: | ||
| `tool.name`, `tool.success`, `tool.duration_ms`, `kubeflow.persona`, and `correlation_id`. | ||
|
|
There was a problem hiding this comment.
Worth a one-liner noting that kubeflow-mcp agent --otel-endpoint ... emits a separate kubeflow-mcp-agent service in Jaeger. Without this, users running the agent will wonder why they only see server-side spans and think tracing is broken.
|
I've addressed all the nits in the latest commit. Let me know if everything looks good!! |
|
|
||
| @functools.wraps(tool_func) | ||
| def wrapper(**kwargs): | ||
| def wrapper(ctx: Context | None = _MCP_CTX_DEFAULT, **kwargs): |
There was a problem hiding this comment.
CurrentContext() DI on sync wrapper may not inject reliably.. prefer AuditIdentityMiddleware + ContextVars for user.id / mcp.session.id fallback
| pass | ||
|
|
||
| # Custom Kubeflow enrichment | ||
| span.set_attribute("kubeflow.persona", persona) |
There was a problem hiding this comment.
Add user.id from middleware identity bridge
There was a problem hiding this comment.
we can add Add middleware (core/middleware.py - AuditIdentityMiddleware) to bridge FastMCP async context to sync _audit_wrap
|
@priyank766 Maybe you also need to re-sign commits to fix DCO pr check |
6957f3a to
947ced4
Compare
Signed-off-by: priyank <priyank8445@gmail.com>
Signed-off-by: priyank <priyank8445@gmail.com>
…hoist tracer, noop kwargs - Use SpanKind.CLIENT for tool spans (K8s API = external service) - Replace manual record_exception() with set_status(Status(StatusCode.ERROR)) to avoid duplicate events and improve Jaeger trace visibility - Move get_tracer() to _audit_wrap scope (once per tool, not per call) - Add **kwargs and set_status() to _NoopTracer/_NoopSpan for API compatibility Signed-off-by: priyank <priyank8445@gmail.com>
Replace custom KUBEFLOW_MCP_OTEL_ENDPOINT with the official OpenTelemetry env var across README, CLI help text, and config loader. Signed-off-by: priyank <priyank8445@gmail.com>
…zation, circuit breaker test - Register atexit.shutdown on TracerProvider to flush in-flight spans - Add tool.args_preview span attribute with masked kwargs (truncated to 300 chars) - Auto-append /v1/traces for base URLs to match OTel SDK convention - Add test for circuit breaker open path (tool.success=False) - Document kubeflow-mcp-agent as separate Jaeger service Signed-off-by: priyank <priyank8445@gmail.com>
…tions Signed-off-by: priyank <priyank8445@gmail.com>
947ced4 to
04d258f
Compare
|
/ok-to-test |
|
Thanks @priyank766 for the quick turnaround. It looks great ! |
04d258f to
ab8b1c0
Compare
…uts, env var fallback, service rename Signed-off-by: priyank <priyank8445@gmail.com>
ab8b1c0 to
d70b7a4
Compare
|
This is amazing work 🚀 |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhijeet-dhumal The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
New changes are detected. LGTM label has been removed. |
55e5136 to
d70b7a4
Compare
Description
Adds optional OpenTelemetry tracing for tool calls in Kubeflow MCP Server.
core/telemetry.pywithsetup_tracing()andget_tracer()plus safe no-op fallback when OTel deps are unavailable.core.server._audit_wrapto create one span per tool invocation, set tool/persona/duration/success attributes, attachcorrelation_id, and recordexceptions.
--otel-endpointKUBEFLOW_MCP_OTEL_ENDPOINTobservability.otel_endpointImportant compatibility note:
correlation_idsemantics are preserved and exposed as a span attribute; it is not remapped to OTel trace ID.Type of Change
Checklist
make test-python)make verify)Related Issues
Fixes #18