experiment: Agent evaluation via MLflow + OpenTelemetry by ascerra · Pull Request #30 · fullsend-ai/experiments

ascerra · 2026-06-09T17:23:26Z

Summary

Adds agent-eval-mlflow-otel/ experiment validating MLflow 3.x + OTLP as a complete eval platform for autonomous AI agents
Includes simplified example scripts: mechanical scorers, LLM-as-judge scorers, trace export, prompt registration, regression detection
All 5 hypotheses validated: trace capture, scoring, PR gates, regression detection, prompt versioning

File	Purpose
`README.md`	Full experiment write-up with architecture, results, and analysis
`examples/scorer_mechanical.py`	5 pure-Python scorers (validation, cost, efficiency, confidence, iterations)
`examples/scorer_llm_judge.py`	4 Claude Opus semantic quality scorers via Vertex AI
`examples/run_eval.py`	Score traces via `mlflow.genai.evaluate()`
`examples/check_regression.py`	Compare recent traces against golden baselines
`examples/register_prompts.py`	MLflow Prompts Registry with @staging/@production aliases
`examples/send_trace_example.py`	Minimal OTLP trace export to MLflow
`examples/harness-explore.yaml`	Example harness config with eval section
`fixtures/`	Example fixture input and LLM judge rubric

Security

No hardcoded secrets — all credentials via environment variables
No internal IPs or hostnames
.gitignore covers .env, venv/, results/, output/

Made with Cursor

Experiment validating MLflow 3.x + OTLP as a complete eval platform for autonomous AI agents: trace capture, mechanical + LLM-judge scoring, PR quality gates, regression detection, and prompt versioning. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Adam Scerra <ascerra@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>

fullsend-ai-review · 2026-06-09T17:25:11Z

🤖 Review · Started 5:25 PM UTC
Commit: ba204cb · View workflow run →

Signed-off-by: Adam Scerra <ascerra@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>

fullsend-ai-review · 2026-06-09T17:28:26Z

🤖 Finished Review · ✅ Success · Started 5:28 PM UTC · Completed 5:43 PM UTC
Commit: ba204cb · View workflow run →

fullsend-ai-review · 2026-06-09T17:43:13Z

Review

Findings

Medium

[error-handling] agent-eval-mlflow-otel/examples/scorer_llm_judge.py:34 — _llm_judge() parses LLM response as JSON with no error handling. If the judge model returns malformed JSON or an unexpected structure (e.g., missing score key), json.loads raises JSONDecodeError and crashes the entire evaluation run. The markdown-fence stripping logic is also fragile.
Remediation: Wrap json.loads(content) in a try/except and return a fallback Feedback (e.g., score=0 with rationale indicating parse failure). Validate the returned dict contains the expected score key before accessing it.
[error-handling] agent-eval-mlflow-otel/examples/scorer_mechanical.py:28 — In tool_efficiency, int() cast on get_attribute values may raise ValueError if the attribute is a non-numeric string. The or 0 fallback only handles None/falsy, not arbitrary strings.
Remediation: Use try/except around the int() casts, e.g., try: tools = int(...) except (ValueError, TypeError): tools = 0.
[api-contract] agent-eval-mlflow-otel/examples/run_eval.py:93 — mlflow.log_param and mlflow.log_metrics are called after mlflow.genai.evaluate() returns without an active MLflow run context. If evaluate() manages its own internal run, these calls will fail with MlflowException.
Remediation: Wrap the evaluation and logging in a with mlflow.start_run(): block.

Low

[logic-error] agent-eval-mlflow-otel/examples/check_regression.py:80 — The regressions list is always empty (comparison logic is commented out). The script prints "To complete: fetch recent traces..." acknowledging the stub, but still reports "All scorers within threshold" which could be misleading if used in CI without reading the output carefully.
[prompt-injection] agent-eval-mlflow-otel/examples/scorer_llm_judge.py:72 — _get_trace_summary() interpolates trace data (reasoning text, agent name) directly into LLM judge prompts without delimiting. An adversarial trace could influence scoring. Risk is low since this is an internal evaluation tool scoring the team's own agent traces.
[credential-handling] agent-eval-mlflow-otel/examples/check_regression.py:35 — connect() hardcodes admin as default MLflow username via setdefault. Could lead to unintended admin-level access if MLFLOW_TRACKING_USERNAME is unset while MLFLOW_OTLP_TOKEN is set.
[edge-case] agent-eval-mlflow-otel/examples/register_prompts.py:68 — client.search_prompt_versions() may raise RestException for non-existent prompt names rather than returning an empty list. First-time registration could fail.
[logic-error] agent-eval-mlflow-otel/examples/harness-explore.yaml:22 — iteration_count returns a raw count (e.g., 3) not a normalized 0–1 score. Gating logic like min_quality_score: 3.0 would interact confusingly with unnormalized values in metrics aggregation.
[naming-convention] agent-eval-mlflow-otel/examples/send_trace_example.py — The _example suffix is redundant when the file is already in the examples/ directory.
[missing-authorization] agent-eval-mlflow-otel/README.md — This PR adds 12 new files with no linked issue. For an experiments repo this is a minor process gap — the thorough README provides sufficient context — but linking to an authorizing issue improves traceability.

Info

[secrets-handling] .gitignore correctly excludes .env, venv/, results/, output/. All credentials loaded from environment variables. No hardcoded secrets found.
[scope-alignment] README clearly states production versions live at fullsend-ai/features and these are simplified standalone excerpts. Scope is well-documented.
[architectural-coherence] Post-hoc trace export design (avoiding coupling agents to observability libraries) is architecturally sound and well-justified.

ascerra requested a review from a team as a code owner June 9, 2026 17:23

ascerra force-pushed the experiment/agent-eval-mlflow-otel branch from 02dfefe to 9203dbc Compare June 9, 2026 17:24

Add architecture diagram to README

28f6e39

Signed-off-by: Adam Scerra <ascerra@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>

fullsend-ai-review Bot added the requires-manual-review Review requires human judgment label Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiment: Agent evaluation via MLflow + OpenTelemetry#30

experiment: Agent evaluation via MLflow + OpenTelemetry#30
ascerra wants to merge 2 commits into
mainfrom
experiment/agent-eval-mlflow-otel

ascerra commented Jun 9, 2026

Uh oh!

fullsend-ai-review Bot commented Jun 9, 2026

Uh oh!

fullsend-ai-review Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

fullsend-ai-review Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ascerra commented Jun 9, 2026

Summary

Contents

Security

Uh oh!

fullsend-ai-review Bot commented Jun 9, 2026

Uh oh!

fullsend-ai-review Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fullsend-ai-review Bot commented Jun 9, 2026

Review

Findings

Medium

Low

Info

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fullsend-ai-review Bot commented Jun 9, 2026 •

edited

Loading