feat(eval): seed unlabelled candidate probes from conversation logs by peterkimpmp · Pull Request #23 · sillok-os/sillok

peterkimpmp · 2026-06-02T03:20:40Z

What

Adds sillok/eval/probe_seeder.py — turns a conversation export into
unlabelled candidate probe stubs (redacted first-user-messages from deep
conversations), ready for a human to assign expected_pack.

Why

Real usage is the best source of eval probes. This seeds the probe set from
actual questions while keeping a human in the loop for labelling — and keeping
private data out via redaction.

How

seed_probes(conversations, min_messages=10) -> list[ProbeStub]; ProbeStub.expected_pack is always None (human-assigned).
Generic conversation shape (role + text, incl. content list/str); no vendor-specific parser.
redact() with a pluggable RedactionRule list; defaults are generic PII only (email / phone-like / URL) — no domain or organisation patterns.

Validation

ruff / mypy sillok/eval/probe_seeder.py clean
pytest tests/unit/eval/test_probe_seeder.py → 8 passed (deep/shallow, no-user skip, redaction, content-list shape, truncation, and a guardrail asserting the defaults carry no clinical/domain patterns)
All test conversation data is synthetic.

Note for reviewer

Also adds an import block to sillok/eval/__init__.py. If #20 (calibration)
merges first, this needs a one-line rebase of that __init__ (both only add
adjacent import groups).

Provenance

Reduced from an internal history→golden tool whose domain/intent labels and
clinical-variable redactions were operator-specific. Those were dropped;
only the generic conversation→redacted-stub skeleton remains. PII gate: 0 hits.

Upstream: peterkimpmp/aipm#660 · Plan: peterkimpmp/aipm#647,#656 · Follows: #22

Turn a conversation export into redacted, unlabelled probe stubs (the redacted first user message of each sufficiently deep conversation) so a human can assign expected_pack. Automatic domain/intent labelling is deliberately out of scope — those labels were one operator's personal vocabulary and do not generalise. Redaction is a pluggable rule list; defaults cover only generic PII (email/phone-like/URL). Privacy-first, provider-neutral; tests use synthetic conversations only. Upstream: peterkimpmp/aipm#660 Signed-off-by: peterkimpmp <tykim21@gmail.com>

peterkimpmp force-pushed the feat/eval-probe-seeder branch from 12992c0 to 2980fe0 Compare June 2, 2026 03:36

peterkimpmp merged commit 8b0a501 into main Jun 2, 2026
6 checks passed

peterkimpmp deleted the feat/eval-probe-seeder branch June 2, 2026 03:37

peterkimpmp mentioned this pull request Jun 2, 2026

release: 0.3.0a1 — Wave 2 modules + functional unified CLI #24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): seed unlabelled candidate probes from conversation logs#23

feat(eval): seed unlabelled candidate probes from conversation logs#23
peterkimpmp merged 1 commit into
mainfrom
feat/eval-probe-seeder

peterkimpmp commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

peterkimpmp commented Jun 2, 2026

What

Why

How

Validation

Note for reviewer

Provenance

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant