Skip to content

experiment: Agent evaluation via MLflow + OpenTelemetry#30

Open
ascerra wants to merge 2 commits into
mainfrom
experiment/agent-eval-mlflow-otel
Open

experiment: Agent evaluation via MLflow + OpenTelemetry#30
ascerra wants to merge 2 commits into
mainfrom
experiment/agent-eval-mlflow-otel

Conversation

@ascerra

@ascerra ascerra commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds agent-eval-mlflow-otel/ experiment validating MLflow 3.x + OTLP as a complete eval platform for autonomous AI agents
  • Includes simplified example scripts: mechanical scorers, LLM-as-judge scorers, trace export, prompt registration, regression detection
  • All 5 hypotheses validated: trace capture, scoring, PR gates, regression detection, prompt versioning

Contents

File Purpose
README.md Full experiment write-up with architecture, results, and analysis
examples/scorer_mechanical.py 5 pure-Python scorers (validation, cost, efficiency, confidence, iterations)
examples/scorer_llm_judge.py 4 Claude Opus semantic quality scorers via Vertex AI
examples/run_eval.py Score traces via mlflow.genai.evaluate()
examples/check_regression.py Compare recent traces against golden baselines
examples/register_prompts.py MLflow Prompts Registry with @staging/@production aliases
examples/send_trace_example.py Minimal OTLP trace export to MLflow
examples/harness-explore.yaml Example harness config with eval section
fixtures/ Example fixture input and LLM judge rubric

Security

  • No hardcoded secrets — all credentials via environment variables
  • No internal IPs or hostnames
  • .gitignore covers .env, venv/, results/, output/

Made with Cursor

@ascerra ascerra requested a review from a team as a code owner June 9, 2026 17:23
Experiment validating MLflow 3.x + OTLP as a complete eval platform
for autonomous AI agents: trace capture, mechanical + LLM-judge scoring,
PR quality gates, regression detection, and prompt versioning.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Adam Scerra <ascerra@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@ascerra ascerra force-pushed the experiment/agent-eval-mlflow-otel branch from 02dfefe to 9203dbc Compare June 9, 2026 17:24
@fullsend-ai-review

Copy link
Copy Markdown

🤖 Review · Started 5:25 PM UTC
Commit: ba204cb · View workflow run →

Signed-off-by: Adam Scerra <ascerra@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@fullsend-ai-review

fullsend-ai-review Bot commented Jun 9, 2026

Copy link
Copy Markdown

🤖 Finished Review · ✅ Success · Started 5:28 PM UTC · Completed 5:43 PM UTC
Commit: ba204cb · View workflow run →

@fullsend-ai-review

Copy link
Copy Markdown

Review

Findings

Medium

  • [error-handling] agent-eval-mlflow-otel/examples/scorer_llm_judge.py:34_llm_judge() parses LLM response as JSON with no error handling. If the judge model returns malformed JSON or an unexpected structure (e.g., missing score key), json.loads raises JSONDecodeError and crashes the entire evaluation run. The markdown-fence stripping logic is also fragile.
    Remediation: Wrap json.loads(content) in a try/except and return a fallback Feedback (e.g., score=0 with rationale indicating parse failure). Validate the returned dict contains the expected score key before accessing it.

  • [error-handling] agent-eval-mlflow-otel/examples/scorer_mechanical.py:28 — In tool_efficiency, int() cast on get_attribute values may raise ValueError if the attribute is a non-numeric string. The or 0 fallback only handles None/falsy, not arbitrary strings.
    Remediation: Use try/except around the int() casts, e.g., try: tools = int(...) except (ValueError, TypeError): tools = 0.

  • [api-contract] agent-eval-mlflow-otel/examples/run_eval.py:93mlflow.log_param and mlflow.log_metrics are called after mlflow.genai.evaluate() returns without an active MLflow run context. If evaluate() manages its own internal run, these calls will fail with MlflowException.
    Remediation: Wrap the evaluation and logging in a with mlflow.start_run(): block.

Low

  • [logic-error] agent-eval-mlflow-otel/examples/check_regression.py:80 — The regressions list is always empty (comparison logic is commented out). The script prints "To complete: fetch recent traces..." acknowledging the stub, but still reports "All scorers within threshold" which could be misleading if used in CI without reading the output carefully.

  • [prompt-injection] agent-eval-mlflow-otel/examples/scorer_llm_judge.py:72_get_trace_summary() interpolates trace data (reasoning text, agent name) directly into LLM judge prompts without delimiting. An adversarial trace could influence scoring. Risk is low since this is an internal evaluation tool scoring the team's own agent traces.

  • [credential-handling] agent-eval-mlflow-otel/examples/check_regression.py:35connect() hardcodes admin as default MLflow username via setdefault. Could lead to unintended admin-level access if MLFLOW_TRACKING_USERNAME is unset while MLFLOW_OTLP_TOKEN is set.

  • [edge-case] agent-eval-mlflow-otel/examples/register_prompts.py:68client.search_prompt_versions() may raise RestException for non-existent prompt names rather than returning an empty list. First-time registration could fail.

  • [logic-error] agent-eval-mlflow-otel/examples/harness-explore.yaml:22iteration_count returns a raw count (e.g., 3) not a normalized 0–1 score. Gating logic like min_quality_score: 3.0 would interact confusingly with unnormalized values in metrics aggregation.

  • [naming-convention] agent-eval-mlflow-otel/examples/send_trace_example.py — The _example suffix is redundant when the file is already in the examples/ directory.

  • [missing-authorization] agent-eval-mlflow-otel/README.md — This PR adds 12 new files with no linked issue. For an experiments repo this is a minor process gap — the thorough README provides sufficient context — but linking to an authorizing issue improves traceability.

Info

  • [secrets-handling] .gitignore correctly excludes .env, venv/, results/, output/. All credentials loaded from environment variables. No hardcoded secrets found.

  • [scope-alignment] README clearly states production versions live at fullsend-ai/features and these are simplified standalone excerpts. Scope is well-documented.

  • [architectural-coherence] Post-hoc trace export design (avoiding coupling agents to observability libraries) is architecturally sound and well-justified.

@fullsend-ai-review fullsend-ai-review Bot added the requires-manual-review Review requires human judgment label Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

requires-manual-review Review requires human judgment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant