Binlog Investigation Eval Results

📊 Dashboard — live scoring trends across configurations

Public results and dashboard for evaluating MSBuild binary log investigation quality across multiple AI tooling configurations.

The evaluation infrastructure, test cases, and scoring scripts live in the private dotnet/binlog-tools-eval repository. This public repo hosts:

results branch — full evaluation data pushed after each pipeline run
GitHub Pages dashboard — interactive charts deployed to gh-pages
Tracking issue — summary comments appended per run

Configurations

ID	Configuration	Description
A	Plain Copilot	No skills, no MCP. Agent uses `dotnet msbuild` replay + grep
B	Copilot + dotnet/skills	`binlog-failure-analysis` skill (replay-to-text-logs workflow)
C	Copilot + baronfel MCP	`baronfel.binlog.mcp` — structured queries
D	Copilot + BinlogInsights MCP	`BinlogInsights.Mcp` — 26 tools
E	Copilot + SQLite Logger	Binlog → SQLite conversion, agent queries with `sqlite3` CLI
F	Copilot + Picasso	`baronfel.binlog.cli` via `dnx` batch mode
G	Copilot + AndyG MCP	`BinlogMCP` — 52 tools
H	Copilot + BinlogMcp	`BinlogMcp` — MSBuildStructuredLog MCP server
I	Copilot + AITools.BinlogMcp	`AITools.BinlogMcp` — dotnet/ai-tools MCP server

Test Cases

Case	Description	Difficulty
01	Project not in solution (Debug/Release config mismatch)	Medium
02	App.config binding redirect poisoning (MSB3277)	Very Hard
03	Shared distrib folder signing failure	Hard

Each case has 4 scenario tiers: Surface (error extraction), Analysis (root cause), Insight (deep MSBuild inspection), Deep (actionable fix).

How Results Are Produced

The private binlog-evals repo contains:

MSBuild binary logs (.binlog) with known build failures
Evaluation scenarios with ground-truth rubrics
Scripts that run GitHub Copilot against each case × configuration
An LLM judge that scores responses on a 0–1 scale per rubric item

A CI pipeline in the private repo runs evaluations daily, then cross-publishes results and dashboard updates to this public repo.

Browsing Results

Dashboard: Visit the live dashboard for interactive score trends
Raw data: Switch to the results branch to browse runs/<timestamp>/<config>/<case>/ directories
Issue tracker: See Issue #1 for per-run summary comments

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Binlog Investigation Eval Results

Configurations

Test Cases

How Results Are Produced

Browsing Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Binlog Investigation Eval Results

Configurations

Test Cases

How Results Are Produced

Browsing Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages