feat: add OpenTelemetry tracing for tool calls by priyank766 · Pull Request #21 · kubeflow/mcp-server

priyank766 · 2026-05-15T16:50:36Z

Description

Adds optional OpenTelemetry tracing for tool calls in Kubeflow MCP Server.

Adds core/telemetry.py with setup_tracing() and get_tracer() plus safe no-op fallback when OTel deps are unavailable.
Instruments core.server._audit_wrap to create one span per tool invocation, set tool/persona/duration/success attributes, attach correlation_id, and record
exceptions.
Wires tracing config through CLI/env/config:
- --otel-endpoint
- KUBEFLOW_MCP_OTEL_ENDPOINT
- observability.otel_endpoint
Adds unit tests for no-op path, provider setup/reuse, endpoint validation, and span behavior.
Updates README with an Observability section.

Important compatibility note: correlation_id semantics are preserved and exposed as a span attribute; it is not remapped to OTel trace ID.

Type of Change

feat: New feature

Checklist

Tests pass locally (make test-python)
Linting passes (make verify)
Documentation updated (if applicable)
Commit messages follow conventional format

Related Issues

Fixes #18

abhijeet-dhumal · 2026-05-18T15:15:27Z

Hey @priyank766 , this looks great 🚀
Thanks for working on this !

abhijeet-dhumal · 2026-05-18T15:18:20Z

    "sphinx-design>=0.5",
 ]
+otel = [
+    "opentelemetry-exporter-otlp>=1.25.0",


Suggested change

"opentelemetry-exporter-otlp>=1.25.0",

"opentelemetry-exporter-otlp-proto-http>=1.25.0",

this avoids installation of unnecessary GRPC subpackage

Thanks for the Suggestion
I will change the package as well

abhijeet-dhumal · 2026-05-18T15:20:49Z

-                exc_info=True,
-            )
-            raise
+        with tracer.start_as_current_span("tool_call") as span:


I tried it just now and it seems span name "tool_call" makes it hard to filter by tool in tracing UIs, Maybe we can consider naming it after the tool: f"tool:{tool_name}" or even just tool_name. The attribute tool.name is still good to keep for structured querying, but the name gives context at a glance.
wdyt?

Thanks for Reviewing @abhijeet-dhumal
I will push both changes and then you can review it later whenever you get time
Yes I think this naming scheme is much better I will change it .. 👍🏻

Copilot

Pull request overview

Adds optional OpenTelemetry tracing for Kubeflow MCP tool calls, wiring tracing through configuration, CLI, runtime instrumentation, tests, and docs.

Changes:

Adds telemetry setup/no-op helpers and an otel optional dependency extra.
Instruments _audit_wrap spans with tool/persona/correlation/success/duration attributes.
Wires --otel-endpoint, config/env loading, startup logging, tests, and README documentation.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`kubeflow_mcp/core/telemetry.py`	Adds OpenTelemetry setup, validation, provider reuse, and no-op fallback helpers.
`kubeflow_mcp/core/server.py`	Adds span creation and attributes around audited tool calls.
`kubeflow_mcp/cli.py`	Adds `--otel-endpoint` and invokes tracing setup during server startup.
`kubeflow_mcp/core/config.py`	Adds `observability.otel_endpoint` config/env support.
`kubeflow_mcp/core/logging.py`	Includes `tracing_enabled` in structured startup logs.
`tests/unit/core/test_telemetry.py`	Adds telemetry helper and span attribute tests.
`kubeflow_mcp/cli_test.py`	Adds CLI tracing setup wiring tests.
`README.md`	Documents optional tracing setup and span attributes.
`pyproject.toml`	Adds the `otel` optional dependency extra.
`uv.lock`	Locks OpenTelemetry optional dependencies and related resolution changes.

Comments suppressed due to low confidence (1)

kubeflow_mcp/core/server.py:126

The circuit-open early-return tracing path is not covered by the new span behavior tests. Add coverage for a breaker with can_execute() == False so the tool.success=false and duration attributes remain verified for this failure mode.

            if not breaker.can_execute():
                duration_ms = int((time.monotonic() - start) * 1000)
                span.set_attribute("tool.success", False)
                span.set_attribute("tool.duration_ms", duration_ms)
                logger.warning("circuit_open", extra={"tool": tool_name})
                return {
                    "error": f"Circuit breaker open for '{tool_name}' — K8s API may be degraded. Retries automatically after recovery timeout.",
                    "error_code": ErrorCode.CIRCUIT_OPEN,
                }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        with tracer.start_as_current_span(f"tool:{tool_name}") as span:
+            span.set_attribute("tool.name", tool_name)
+            span.set_attribute("kubeflow.persona", persona)
+            span.set_attribute("correlation_id", cid)


abhijeet-dhumal · 2026-05-23T10:28:01Z

+                breaker.record_failure()
+                span.set_attribute("tool.success", False)
+                span.set_attribute("tool.duration_ms", duration_ms)
+                span.record_exception(exc)


@priyank766 can we add span.set_status(StatusCode.ERROR) on exception here ?
span.set_status(Status(StatusCode.ERROR, str(exc)))

this will make sure to spot failures easily in the trace list, wdyt?

+    observability_file = file_config.get("observability", {})
+    observability = ObservabilityConfig(
+        otel_endpoint=os.getenv(
+            "KUBEFLOW_MCP_OTEL_ENDPOINT",
+            observability_file.get("otel_endpoint"),
+        )
+    )


+            if _rate_limiter is not None and not _rate_limiter.acquire():
+                duration_ms = int((time.monotonic() - start) * 1000)
+                span.set_attribute("tool.success", False)
+                span.set_attribute("tool.duration_ms", duration_ms)
+                logger.warning("rate_limited", extra={"tool": tool_name})
+                return {
+                    "error": "Rate limit exceeded. Retry after a brief pause.",
+                    "error_code": ErrorCode.RATE_LIMITED,
+                }


+- Install optional dependencies: `pip install ".[otel]"`
+- Enable tracing with CLI flag or env var:
+
+```bash
+kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces
+# or
+export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces
+kubeflow-mcp serve


abhijeet-dhumal · 2026-05-23T10:32:36Z

-
        cid = with_correlation_id()
-        masked = mask_sensitive_data(kwargs) if kwargs else {}
+        tracer = get_tracer("kubeflow_mcp.tools")


@priyank766 can you move this outside wrapper closure?
get_tracer is cheap but calling it per-invocation seems semantically wrong, wdyt?

abhijeet-dhumal · 2026-05-23T10:38:05Z

-                exc_info=True,
-            )
-            raise
+        with tracer.start_as_current_span(f"tool:{tool_name}") as span:


@priyank766 thinking can we use SpanKind.CLIENT for tool spans here?
Tool calls invoke an external service i.e. K8s API. SpanKind.CLIENT is semantically correct and enables better Jaeger dependency graph rendering..

abhijeet-dhumal · 2026-05-23T10:39:48Z

+        return None
+
+
+class _NoopTracer:


_NoopTracer should accept **kwargs here
Future callers passing kind=SpanKind.CLIENT will otherwise get a TypeError in no-op mode.

priyank766 · 2026-05-23T11:18:26Z

Thanks for the review @abhijeet-dhumal All suggestions have been addressed. Please take another look when you get a chance.

abhijeet-dhumal · 2026-05-25T07:47:24Z

+```bash
+kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces
+# or
+export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces


Suggested change

export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/v1/traces

abhijeet-dhumal · 2026-05-25T07:47:38Z

+    observability_file = file_config.get("observability", {})
+    observability = ObservabilityConfig(
+        otel_endpoint=os.getenv(
+            "KUBEFLOW_MCP_OTEL_ENDPOINT",


Suggested change

"KUBEFLOW_MCP_OTEL_ENDPOINT",

"OTEL_EXPORTER_OTLP_ENDPOINT",

abhijeet-dhumal · 2026-05-25T07:48:07Z

+    "--otel-endpoint",
+    default=None,
+    help="OpenTelemetry OTLP HTTP endpoint for tracing. "
+    "Falls back to KUBEFLOW_MCP_OTEL_ENDPOINT env var, config file.",


Suggested change

"Falls back to KUBEFLOW_MCP_OTEL_ENDPOINT env var, config file.",

"Falls back to OTEL_EXPORTER_OTLP_ENDPOINT env var, config file.",

abhijeet-dhumal · 2026-05-25T07:53:34Z

+        with tracer.start_as_current_span(
+            f"tool:{tool_name}", **span_kwargs
+        ) as span:
+            span.set_attribute("tool.name", tool_name)


Minor nit: would be great to also surface user.id and mcp.session_id on these spans for per-session/per-user filtering in Jaeger. That will need identity propagation middleware (ContextVars populated from the MCP request context) as a prerequisite..

happy to follow up with a separate PR for that once this lands. Marking as a non-blocking suggestion.

Makes sense, having user.id and session_id on the spans would definitely help with debugging. Happy to take a crack at it once this lands, just let me know.

abhijeet-dhumal · 2026-06-02T14:07:03Z

Hey @priyank766 , I tried this out locally against OTel collector stack, works cleanly end-to-end.
Spans show up in Jaeger with the right service name, tool: prefix naming is much better for filtering.
Thanks you for being consistent and addressing all the previous feedback so promptly. 🚀 🙌

putting few nits otherwise lgtm..

abhijeet-dhumal · 2026-06-02T14:08:47Z

+            provider.add_span_processor(processor)
+            _otel_trace.set_tracer_provider(provider)
+
+        _tracing_initialized = True


Missing atexit.register(provider.shutdown) here. Without it, in-flight spans in the BatchSpanProcessor queue get silently dropped on server shutdown, and the background thread can deadlock if the collector is unreachable. One line:

import atexit atexit.register(provider.shutdown)

abhijeet-dhumal · 2026-06-02T14:10:13Z

+        with tracer.start_as_current_span(
+            f"tool:{tool_name}", **span_kwargs
+        ) as span:
+            span.set_attribute("tool.name", tool_name)


Can we also add tool.args_preview here? Without it you can't reconstruct what parameters caused a failure from Jaeger alone.. you'd have to cross-reference the audit log. Something like:

import json span.set_attribute("tool.args_preview", json.dumps(mask_sensitive_data(kwargs), default=str)[:300])

abhijeet-dhumal · 2026-06-02T14:11:25Z

+                span.set_attribute("tool.success", False)
+                span.set_attribute("tool.duration_ms", duration_ms)
+                logger.warning("circuit_open", extra={"tool": tool_name})
+                return {


This path sets tool.success=False on the span but there's no test covering it.
Worth adding a test with _FakeBreaker(can_execute=False) asserting the attributes are set before the early return.

abhijeet-dhumal · 2026-06-02T14:12:19Z

+- Enable tracing with CLI flag or env var:
+
+```bash
+kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces


Heads up: setup_tracing() auto-appends /v1/traces to whatever is passed, so the example in the README produces http://localhost:4318/v1/traces/v1/traces — double path. Either change the README example to the base URL (http://localhost:4318) or stop auto-appending in code and accept the full path as-is. The latter is less surprising.

abhijeet-dhumal · 2026-06-02T14:13:19Z

+
+Each tool invocation emits a span with attributes:
+`tool.name`, `tool.success`, `tool.duration_ms`, `kubeflow.persona`, and `correlation_id`.
+


Worth a one-liner noting that kubeflow-mcp agent --otel-endpoint ... emits a separate kubeflow-mcp-agent service in Jaeger. Without this, users running the agent will wonder why they only see server-side spans and think tracing is broken.

priyank766 · 2026-06-02T16:58:08Z

I've addressed all the nits in the latest commit. Let me know if everything looks good!!
Thanks for testing it out and for the kind words ! 🚀
@abhijeet-dhumal

abhijeet-dhumal · 2026-06-24T06:11:19Z


    @functools.wraps(tool_func)
-    def wrapper(**kwargs):
+    def wrapper(ctx: Context | None = _MCP_CTX_DEFAULT, **kwargs):


CurrentContext() DI on sync wrapper may not inject reliably.. prefer AuditIdentityMiddleware + ContextVars for user.id / mcp.session.id fallback

abhijeet-dhumal · 2026-06-24T06:11:51Z

+                    pass
+
+            # Custom Kubeflow enrichment
+            span.set_attribute("kubeflow.persona", persona)


Add user.id from middleware identity bridge

we can add Add middleware (core/middleware.py - AuditIdentityMiddleware) to bridge FastMCP async context to sync _audit_wrap

abhijeet-dhumal · 2026-06-24T06:15:10Z

@priyank766 Maybe you also need to re-sign commits to fix DCO pr check

Signed-off-by: priyank <priyank8445@gmail.com>

…hoist tracer, noop kwargs - Use SpanKind.CLIENT for tool spans (K8s API = external service) - Replace manual record_exception() with set_status(Status(StatusCode.ERROR)) to avoid duplicate events and improve Jaeger trace visibility - Move get_tracer() to _audit_wrap scope (once per tool, not per call) - Add **kwargs and set_status() to _NoopTracer/_NoopSpan for API compatibility Signed-off-by: priyank <priyank8445@gmail.com>

Replace custom KUBEFLOW_MCP_OTEL_ENDPOINT with the official OpenTelemetry env var across README, CLI help text, and config loader. Signed-off-by: priyank <priyank8445@gmail.com>

…zation, circuit breaker test - Register atexit.shutdown on TracerProvider to flush in-flight spans - Add tool.args_preview span attribute with masked kwargs (truncated to 300 chars) - Auto-append /v1/traces for base URLs to match OTel SDK convention - Add test for circuit breaker open path (tool.success=False) - Document kubeflow-mcp-agent as separate Jaeger service Signed-off-by: priyank <priyank8445@gmail.com>

…tions Signed-off-by: priyank <priyank8445@gmail.com>

abhijeet-dhumal · 2026-06-24T12:16:37Z

/ok-to-test

abhijeet-dhumal · 2026-06-24T12:22:17Z

Thanks @priyank766 for the quick turnaround. It looks great !
LGTM .. just need to resolve pr checks : Can you Rebase on latest main and run ruff format to pass pr checks ?

…uts, env var fallback, service rename Signed-off-by: priyank <priyank8445@gmail.com>

abhijeet-dhumal · 2026-06-24T13:14:21Z

This is amazing work 🚀
Thanks a lot @priyank766 for your patience!!
/lgtm

abhijeet-dhumal · 2026-06-24T13:18:01Z

/approve
/hold in case @andreyvelich @jaiakash @Krishna-kg732 has additional comments.

google-oss-prow · 2026-06-24T13:18:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhijeet-dhumal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhijeet-dhumal]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow · 2026-06-25T11:06:58Z

New changes are detected. LGTM label has been removed.

google-oss-prow Bot requested review from astefanutti, kramaranya and szaher May 15, 2026 16:50

google-oss-prow Bot added the size/XL label May 15, 2026

priyank766 force-pushed the feat/otel-tracing branch from ec68142 to bec06a0 Compare May 15, 2026 16:51

abhijeet-dhumal reviewed May 18, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 18, 2026 15:53

Copilot started reviewing on behalf of priyank766 May 18, 2026 15:54 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

priyank766 requested a review from abhijeet-dhumal May 20, 2026 15:02

abhijeet-dhumal reviewed May 23, 2026

View reviewed changes

priyank766 force-pushed the feat/otel-tracing branch from 86d30a5 to 6428ffe Compare May 23, 2026 11:19

abhijeet-dhumal reviewed May 25, 2026

View reviewed changes

priyank766 requested a review from abhijeet-dhumal May 25, 2026 17:30

abhijeet-dhumal reviewed Jun 2, 2026

View reviewed changes

priyank766 requested a review from abhijeet-dhumal June 3, 2026 16:28

abhijeet-dhumal reviewed Jun 24, 2026

View reviewed changes

priyank766 force-pushed the feat/otel-tracing branch from 6957f3a to 947ced4 Compare June 24, 2026 11:44

google-oss-prow Bot added size/XXL and removed size/XL labels Jun 24, 2026

priyank766 added 6 commits June 24, 2026 17:19

feat: add OpenTelemetry tracing for tool calls

47dac18

Signed-off-by: priyank <priyank8445@gmail.com>

fix: address telemetry review feedback

3e10b15

Signed-off-by: priyank <priyank8445@gmail.com>

fix: use standard OTEL_EXPORTER_OTLP_ENDPOINT env var

72b65f7

Replace custom KUBEFLOW_MCP_OTEL_ENDPOINT with the official OpenTelemetry env var across README, CLI help text, and config loader. Signed-off-by: priyank <priyank8445@gmail.com>

feat(telemetry): align OpenTelemetry tracing with MCP semantic conven…

ce8ef2a

…tions Signed-off-by: priyank <priyank8445@gmail.com>

priyank766 force-pushed the feat/otel-tracing branch from 947ced4 to 04d258f Compare June 24, 2026 11:51

priyank766 requested a review from abhijeet-dhumal June 24, 2026 11:55

google-oss-prow Bot added the ok-to-test Approve CI for external contributors label Jun 24, 2026

priyank766 force-pushed the feat/otel-tracing branch from 04d258f to ab8b1c0 Compare June 24, 2026 12:26

fix(telemetry): address review — middleware ContextVars, export timeo…

d70b7a4

…uts, env var fallback, service rename Signed-off-by: priyank <priyank8445@gmail.com>

priyank766 force-pushed the feat/otel-tracing branch from ab8b1c0 to d70b7a4 Compare June 24, 2026 12:41

google-oss-prow Bot assigned abhijeet-dhumal Jun 24, 2026

google-oss-prow Bot added the lgtm Looks good to me — approved by a reviewer label Jun 24, 2026

google-oss-prow Bot added the do-not-merge/hold Blocked — do not merge label Jun 24, 2026

google-oss-prow Bot added approved Approved by an approver in OWNERS and removed lgtm Looks good to me — approved by a reviewer labels Jun 24, 2026

priyank766 force-pushed the feat/otel-tracing branch from 55e5136 to d70b7a4 Compare June 25, 2026 11:07

	"opentelemetry-exporter-otlp>=1.25.0",
	"opentelemetry-exporter-otlp-proto-http>=1.25.0",

	export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces
	export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/v1/traces

	"Falls back to KUBEFLOW_MCP_OTEL_ENDPOINT env var, config file.",
	"Falls back to OTEL_EXPORTER_OTLP_ENDPOINT env var, config file.",


		Each tool invocation emits a span with attributes:
		`tool.name`, `tool.success`, `tool.duration_ms`, `kubeflow.persona`, and `correlation_id`.

Uh oh!

Conversation

priyank766 commented May 15, 2026

Description

Type of Change

Checklist

Related Issues

Uh oh!

abhijeet-dhumal commented May 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

abhijeet-dhumal May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

priyank766 commented May 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

priyank766 commented Jun 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal commented Jun 24, 2026

Uh oh!

abhijeet-dhumal commented Jun 24, 2026

Uh oh!

abhijeet-dhumal commented Jun 24, 2026

Uh oh!

abhijeet-dhumal commented Jun 24, 2026

Uh oh!

abhijeet-dhumal May 23, 2026 •

edited

Loading

abhijeet-dhumal May 25, 2026 •

edited

Loading

abhijeet-dhumal commented Jun 2, 2026 •

edited

Loading