Skip to content

feat: add OpenTelemetry tracing for tool calls#21

Open
priyank766 wants to merge 7 commits into
kubeflow:mainfrom
priyank766:feat/otel-tracing
Open

feat: add OpenTelemetry tracing for tool calls#21
priyank766 wants to merge 7 commits into
kubeflow:mainfrom
priyank766:feat/otel-tracing

Conversation

@priyank766

Copy link
Copy Markdown

Description

Adds optional OpenTelemetry tracing for tool calls in Kubeflow MCP Server.

  • Adds core/telemetry.py with setup_tracing() and get_tracer() plus safe no-op fallback when OTel deps are unavailable.
  • Instruments core.server._audit_wrap to create one span per tool invocation, set tool/persona/duration/success attributes, attach correlation_id, and record
    exceptions.
  • Wires tracing config through CLI/env/config:
    • --otel-endpoint
    • KUBEFLOW_MCP_OTEL_ENDPOINT
    • observability.otel_endpoint
  • Adds unit tests for no-op path, provider setup/reuse, endpoint validation, and span behavior.
  • Updates README with an Observability section.

Important compatibility note: correlation_id semantics are preserved and exposed as a span attribute; it is not remapped to OTel trace ID.

Type of Change

  • feat: New feature

Checklist

  • Tests pass locally (make test-python)
  • Linting passes (make verify)
  • Documentation updated (if applicable)
  • Commit messages follow conventional format

Related Issues

Fixes #18

@abhijeet-dhumal

Copy link
Copy Markdown
Member

Hey @priyank766 , this looks great 🚀
Thanks for working on this !

Comment thread pyproject.toml Outdated
"sphinx-design>=0.5",
]
otel = [
"opentelemetry-exporter-otlp>=1.25.0",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"opentelemetry-exporter-otlp>=1.25.0",
"opentelemetry-exporter-otlp-proto-http>=1.25.0",

this avoids installation of unnecessary GRPC subpackage

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the Suggestion
I will change the package as well

Comment thread kubeflow_mcp/core/server.py Outdated
exc_info=True,
)
raise
with tracer.start_as_current_span("tool_call") as span:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it just now and it seems span name "tool_call" makes it hard to filter by tool in tracing UIs, Maybe we can consider naming it after the tool: f"tool:{tool_name}" or even just tool_name. The attribute tool.name is still good to keep for structured querying, but the name gives context at a glance.
wdyt?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for Reviewing @abhijeet-dhumal
I will push both changes and then you can review it later whenever you get time
Yes I think this naming scheme is much better I will change it .. 👍🏻

Copilot AI review requested due to automatic review settings May 18, 2026 15:53

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds optional OpenTelemetry tracing for Kubeflow MCP tool calls, wiring tracing through configuration, CLI, runtime instrumentation, tests, and docs.

Changes:

  • Adds telemetry setup/no-op helpers and an otel optional dependency extra.
  • Instruments _audit_wrap spans with tool/persona/correlation/success/duration attributes.
  • Wires --otel-endpoint, config/env loading, startup logging, tests, and README documentation.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
kubeflow_mcp/core/telemetry.py Adds OpenTelemetry setup, validation, provider reuse, and no-op fallback helpers.
kubeflow_mcp/core/server.py Adds span creation and attributes around audited tool calls.
kubeflow_mcp/cli.py Adds --otel-endpoint and invokes tracing setup during server startup.
kubeflow_mcp/core/config.py Adds observability.otel_endpoint config/env support.
kubeflow_mcp/core/logging.py Includes tracing_enabled in structured startup logs.
tests/unit/core/test_telemetry.py Adds telemetry helper and span attribute tests.
kubeflow_mcp/cli_test.py Adds CLI tracing setup wiring tests.
README.md Documents optional tracing setup and span attributes.
pyproject.toml Adds the otel optional dependency extra.
uv.lock Locks OpenTelemetry optional dependencies and related resolution changes.
Comments suppressed due to low confidence (1)

kubeflow_mcp/core/server.py:126

  • The circuit-open early-return tracing path is not covered by the new span behavior tests. Add coverage for a breaker with can_execute() == False so the tool.success=false and duration attributes remain verified for this failure mode.
            if not breaker.can_execute():
                duration_ms = int((time.monotonic() - start) * 1000)
                span.set_attribute("tool.success", False)
                span.set_attribute("tool.duration_ms", duration_ms)
                logger.warning("circuit_open", extra={"tool": tool_name})
                return {
                    "error": f"Circuit breaker open for '{tool_name}' — K8s API may be degraded. Retries automatically after recovery timeout.",
                    "error_code": ErrorCode.CIRCUIT_OPEN,
                }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kubeflow_mcp/core/server.py Outdated
Comment on lines +102 to +105
with tracer.start_as_current_span(f"tool:{tool_name}") as span:
span.set_attribute("tool.name", tool_name)
span.set_attribute("kubeflow.persona", persona)
span.set_attribute("correlation_id", cid)
Comment thread kubeflow_mcp/core/server.py Outdated
breaker.record_failure()
span.set_attribute("tool.success", False)
span.set_attribute("tool.duration_ms", duration_ms)
span.record_exception(exc)

@abhijeet-dhumal abhijeet-dhumal May 23, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priyank766 can we add span.set_status(StatusCode.ERROR) on exception here ?
span.set_status(Status(StatusCode.ERROR, str(exc)))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will make sure to spot failures easily in the trace list, wdyt?

Comment on lines +242 to +248
observability_file = file_config.get("observability", {})
observability = ObservabilityConfig(
otel_endpoint=os.getenv(
"KUBEFLOW_MCP_OTEL_ENDPOINT",
observability_file.get("otel_endpoint"),
)
)
Comment on lines +107 to +115
if _rate_limiter is not None and not _rate_limiter.acquire():
duration_ms = int((time.monotonic() - start) * 1000)
span.set_attribute("tool.success", False)
span.set_attribute("tool.duration_ms", duration_ms)
logger.warning("rate_limited", extra={"tool": tool_name})
return {
"error": "Rate limit exceeded. Retry after a brief pause.",
"error_code": ErrorCode.RATE_LIMITED,
}
Comment thread README.md
Comment on lines +173 to +180
- Install optional dependencies: `pip install ".[otel]"`
- Enable tracing with CLI flag or env var:

```bash
kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces
# or
export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces
kubeflow-mcp serve
Comment thread kubeflow_mcp/core/server.py Outdated

cid = with_correlation_id()
masked = mask_sensitive_data(kwargs) if kwargs else {}
tracer = get_tracer("kubeflow_mcp.tools")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priyank766 can you move this outside wrapper closure?
get_tracer is cheap but calling it per-invocation seems semantically wrong, wdyt?

Comment thread kubeflow_mcp/core/server.py Outdated
exc_info=True,
)
raise
with tracer.start_as_current_span(f"tool:{tool_name}") as span:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priyank766 thinking can we use SpanKind.CLIENT for tool spans here?
Tool calls invoke an external service i.e. K8s API. SpanKind.CLIENT is semantically correct and enables better Jaeger dependency graph rendering..

return None


class _NoopTracer:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_NoopTracer should accept **kwargs here
Future callers passing kind=SpanKind.CLIENT will otherwise get a TypeError in no-op mode.

@priyank766

Copy link
Copy Markdown
Author

Thanks for the review @abhijeet-dhumal All suggestions have been addressed. Please take another look when you get a chance.

@priyank766 priyank766 force-pushed the feat/otel-tracing branch from 86d30a5 to 6428ffe Compare May 23, 2026 11:19
Comment thread README.md Outdated
```bash
kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces
# or
export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/v1/traces

Comment thread kubeflow_mcp/core/config.py Outdated
observability_file = file_config.get("observability", {})
observability = ObservabilityConfig(
otel_endpoint=os.getenv(
"KUBEFLOW_MCP_OTEL_ENDPOINT",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"KUBEFLOW_MCP_OTEL_ENDPOINT",
"OTEL_EXPORTER_OTLP_ENDPOINT",

Comment thread kubeflow_mcp/cli.py Outdated
"--otel-endpoint",
default=None,
help="OpenTelemetry OTLP HTTP endpoint for tracing. "
"Falls back to KUBEFLOW_MCP_OTEL_ENDPOINT env var, config file.",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Falls back to KUBEFLOW_MCP_OTEL_ENDPOINT env var, config file.",
"Falls back to OTEL_EXPORTER_OTLP_ENDPOINT env var, config file.",

Comment thread kubeflow_mcp/core/server.py Outdated
with tracer.start_as_current_span(
f"tool:{tool_name}", **span_kwargs
) as span:
span.set_attribute("tool.name", tool_name)

@abhijeet-dhumal abhijeet-dhumal May 25, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: would be great to also surface user.id and mcp.session_id on these spans for per-session/per-user filtering in Jaeger. That will need identity propagation middleware (ContextVars populated from the MCP request context) as a prerequisite..

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to follow up with a separate PR for that once this lands. Marking as a non-blocking suggestion.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, having user.id and session_id on the spans would definitely help with debugging. Happy to take a crack at it once this lands, just let me know.

@abhijeet-dhumal

abhijeet-dhumal commented Jun 2, 2026

Copy link
Copy Markdown
Member

Hey @priyank766 , I tried this out locally against OTel collector stack, works cleanly end-to-end.
Spans show up in Jaeger with the right service name, tool: prefix naming is much better for filtering.
Thanks you for being consistent and addressing all the previous feedback so promptly. 🚀 🙌

putting few nits otherwise lgtm..

provider.add_span_processor(processor)
_otel_trace.set_tracer_provider(provider)

_tracing_initialized = True

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing atexit.register(provider.shutdown) here. Without it, in-flight spans in the BatchSpanProcessor queue get silently dropped on server shutdown, and the background thread can deadlock if the collector is unreachable. One line:

import atexit
atexit.register(provider.shutdown)

Comment thread kubeflow_mcp/core/server.py Outdated
with tracer.start_as_current_span(
f"tool:{tool_name}", **span_kwargs
) as span:
span.set_attribute("tool.name", tool_name)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add tool.args_preview here? Without it you can't reconstruct what parameters caused a failure from Jaeger alone.. you'd have to cross-reference the audit log. Something like:

import json
span.set_attribute("tool.args_preview", json.dumps(mask_sensitive_data(kwargs), default=str)[:300])

span.set_attribute("tool.success", False)
span.set_attribute("tool.duration_ms", duration_ms)
logger.warning("circuit_open", extra={"tool": tool_name})
return {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path sets tool.success=False on the span but there's no test covering it.
Worth adding a test with _FakeBreaker(can_execute=False) asserting the attributes are set before the early return.

Comment thread README.md Outdated
- Enable tracing with CLI flag or env var:

```bash
kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up: setup_tracing() auto-appends /v1/traces to whatever is passed, so the example in the README produces http://localhost:4318/v1/traces/v1/traces — double path. Either change the README example to the base URL (http://localhost:4318) or stop auto-appending in code and accept the full path as-is. The latter is less surprising.

Comment thread README.md

Each tool invocation emits a span with attributes:
`tool.name`, `tool.success`, `tool.duration_ms`, `kubeflow.persona`, and `correlation_id`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a one-liner noting that kubeflow-mcp agent --otel-endpoint ... emits a separate kubeflow-mcp-agent service in Jaeger. Without this, users running the agent will wonder why they only see server-side spans and think tracing is broken.

@priyank766

Copy link
Copy Markdown
Author

I've addressed all the nits in the latest commit. Let me know if everything looks good!!
Thanks for testing it out and for the kind words ! 🚀
@abhijeet-dhumal

Comment thread kubeflow_mcp/core/server.py Outdated

@functools.wraps(tool_func)
def wrapper(**kwargs):
def wrapper(ctx: Context | None = _MCP_CTX_DEFAULT, **kwargs):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CurrentContext() DI on sync wrapper may not inject reliably.. prefer AuditIdentityMiddleware + ContextVars for user.id / mcp.session.id fallback

pass

# Custom Kubeflow enrichment
span.set_attribute("kubeflow.persona", persona)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add user.id from middleware identity bridge

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add Add middleware (core/middleware.py - AuditIdentityMiddleware) to bridge FastMCP async context to sync _audit_wrap

@abhijeet-dhumal

Copy link
Copy Markdown
Member

@priyank766 Maybe you also need to re-sign commits to fix DCO pr check

Signed-off-by: priyank <priyank8445@gmail.com>
Signed-off-by: priyank <priyank8445@gmail.com>
…hoist tracer, noop kwargs

- Use SpanKind.CLIENT for tool spans (K8s API = external service)

- Replace manual record_exception() with set_status(Status(StatusCode.ERROR)) to avoid duplicate events and improve Jaeger trace visibility

- Move get_tracer() to _audit_wrap scope (once per tool, not per call)

- Add **kwargs and set_status() to _NoopTracer/_NoopSpan for API compatibility

Signed-off-by: priyank <priyank8445@gmail.com>
Replace custom KUBEFLOW_MCP_OTEL_ENDPOINT with the official OpenTelemetry env var across README, CLI help text, and config loader.

Signed-off-by: priyank <priyank8445@gmail.com>
…zation, circuit breaker test

- Register atexit.shutdown on TracerProvider to flush in-flight spans

- Add tool.args_preview span attribute with masked kwargs (truncated to 300 chars)

- Auto-append /v1/traces for base URLs to match OTel SDK convention

- Add test for circuit breaker open path (tool.success=False)

- Document kubeflow-mcp-agent as separate Jaeger service

Signed-off-by: priyank <priyank8445@gmail.com>
…tions

Signed-off-by: priyank <priyank8445@gmail.com>
@abhijeet-dhumal

Copy link
Copy Markdown
Member

/ok-to-test

@google-oss-prow google-oss-prow Bot added the ok-to-test Approve CI for external contributors label Jun 24, 2026
@abhijeet-dhumal

Copy link
Copy Markdown
Member

Thanks @priyank766 for the quick turnaround. It looks great !
LGTM .. just need to resolve pr checks : Can you Rebase on latest main and run ruff format to pass pr checks ?

…uts, env var fallback, service rename

Signed-off-by: priyank <priyank8445@gmail.com>
@abhijeet-dhumal

Copy link
Copy Markdown
Member

This is amazing work 🚀
Thanks a lot @priyank766 for your patience!!
/lgtm

@google-oss-prow google-oss-prow Bot added the lgtm Looks good to me — approved by a reviewer label Jun 24, 2026
@abhijeet-dhumal

Copy link
Copy Markdown
Member

/approve
/hold in case @andreyvelich @jaiakash @Krishna-kg732 has additional comments.

@google-oss-prow google-oss-prow Bot added the do-not-merge/hold Blocked — do not merge label Jun 24, 2026
@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhijeet-dhumal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot added approved Approved by an approver in OWNERS and removed lgtm Looks good to me — approved by a reviewer labels Jun 24, 2026
@google-oss-prow

Copy link
Copy Markdown

New changes are detected. LGTM label has been removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Approved by an approver in OWNERS do-not-merge/hold Blocked — do not merge ok-to-test Approve CI for external contributors size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add OpenTelemetry tracing to tool calls

3 participants