Skip to content

feat(eval): seed unlabelled candidate probes from conversation logs#23

Merged
peterkimpmp merged 1 commit into
mainfrom
feat/eval-probe-seeder
Jun 2, 2026
Merged

feat(eval): seed unlabelled candidate probes from conversation logs#23
peterkimpmp merged 1 commit into
mainfrom
feat/eval-probe-seeder

Conversation

@peterkimpmp

Copy link
Copy Markdown
Contributor

What

Adds sillok/eval/probe_seeder.py — turns a conversation export into
unlabelled candidate probe stubs (redacted first-user-messages from deep
conversations), ready for a human to assign expected_pack.

Why

Real usage is the best source of eval probes. This seeds the probe set from
actual questions while keeping a human in the loop for labelling — and keeping
private data out via redaction.

How

  • seed_probes(conversations, min_messages=10) -> list[ProbeStub]; ProbeStub.expected_pack is always None (human-assigned).
  • Generic conversation shape (role + text, incl. content list/str); no vendor-specific parser.
  • redact() with a pluggable RedactionRule list; defaults are generic PII only (email / phone-like / URL) — no domain or organisation patterns.

Validation

  • ruff / mypy sillok/eval/probe_seeder.py clean
  • pytest tests/unit/eval/test_probe_seeder.py → 8 passed (deep/shallow, no-user skip, redaction, content-list shape, truncation, and a guardrail asserting the defaults carry no clinical/domain patterns)
  • All test conversation data is synthetic.

Note for reviewer

Also adds an import block to sillok/eval/__init__.py. If #20 (calibration)
merges first, this needs a one-line rebase of that __init__ (both only add
adjacent import groups).

Provenance

Reduced from an internal history→golden tool whose domain/intent labels and
clinical-variable redactions were operator-specific. Those were dropped;
only the generic conversation→redacted-stub skeleton remains. PII gate: 0 hits.

Upstream: peterkimpmp/aipm#660 · Plan: peterkimpmp/aipm#647,#656 · Follows: #22

Turn a conversation export into redacted, unlabelled probe stubs (the redacted
first user message of each sufficiently deep conversation) so a human can assign
expected_pack. Automatic domain/intent labelling is deliberately out of scope —
those labels were one operator's personal vocabulary and do not generalise.
Redaction is a pluggable rule list; defaults cover only generic PII
(email/phone-like/URL). Privacy-first, provider-neutral; tests use synthetic
conversations only.

Upstream: peterkimpmp/aipm#660
Signed-off-by: peterkimpmp <tykim21@gmail.com>
@peterkimpmp peterkimpmp force-pushed the feat/eval-probe-seeder branch from 12992c0 to 2980fe0 Compare June 2, 2026 03:36
@peterkimpmp peterkimpmp merged commit 8b0a501 into main Jun 2, 2026
6 checks passed
@peterkimpmp peterkimpmp deleted the feat/eval-probe-seeder branch June 2, 2026 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant