Skip to content

[MC-462-B] feat(compare): surface forms LingPy could not cluster#652

Merged
TrueNorth49 merged 1 commit into
mainfrom
feat/mc-462-b-skipped-form-observability
Jun 3, 2026
Merged

[MC-462-B] feat(compare): surface forms LingPy could not cluster#652
TrueNorth49 merged 1 commit into
mainfrom
feat/mc-462-b-skipped-form-observability

Conversation

@TrueNorth49

Copy link
Copy Markdown
Collaborator

MC Task: MC-462 — Phonetic-similarity job robustness / Lane MC-462-B

Why

Follow-up to #651. That PR stopped the cognate job crashing by skipping forms LingPy cannot tokenise into known sounds. But those forms were dropped silently — and an un-clusterable form is usually a real data-quality signal (a placeholder, or text accidentally entered in the form field), not noise. The reviewer flagged the silent drop as the one gap to close.

What

_compute_cognate_sets_with_lingpy now:

  • accepts an optional skipped_out collector and records each skipped form as {concept_id, speaker, form};
  • logs a one-line stderr summary (surfaces in the job's per-job stderr section of the crash-log modal).

The compute route threads a collector through and exposes the result on the job:

  • a progress message when anything was skipped (Skipped N form(s) LingPy could not cluster);
  • skippedFormCount and a capped skippedForms sample in the job result.

Not changed

  • Persisted parse-enrichments.json shape is untouched.
  • Displayed/stored forms are untouched.
  • Clustering behaviour is identical — this only adds observability fields and a log line. The function signature change is backward compatible (skipped_out defaults to None; the existing CLI call site is unaffected).

Verification

  • compare/test_cognate_lingpy_safe.py: 11 passed with real LingPy (2 new tests assert the collector records the skipped form and stays empty when all forms cluster).
  • Against a full real workspace (585 concepts): job completes, skippedFormCount=1, collector + stderr both report the single placeholder form — example only; the mechanism is dataset-agnostic.

🤖 Generated with Claude Code

Follow-up to #651. Un-clusterable forms (e.g. a bare "?" placeholder, or
mis-entered text in the form field) are skipped from cognate clustering
so the job no longer crashes -- but they were dropped silently, hiding a
real data-quality signal.

_compute_cognate_sets_with_lingpy now takes an optional skipped_out
collector and records each skipped form as {concept_id, speaker, form};
it also logs a one-line stderr summary (visible in the job's per-job
stderr). The compute route threads a collector through, emits a job
progress message when anything was skipped, and returns skippedFormCount
plus a capped skippedForms sample in the job result.

No change to persisted enrichments or displayed forms -- this only adds
observability fields to the job result and a log line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@TrueNorth49 TrueNorth49 added feat Feature work MC-462 Mission Control MC-462 labels Jun 3, 2026
@TrueNorth49 TrueNorth49 merged commit 702c500 into main Jun 3, 2026
4 checks passed
@TrueNorth49 TrueNorth49 deleted the feat/mc-462-b-skipped-form-observability branch June 3, 2026 21:39
TrueNorth49 added a commit that referenced this pull request Jun 3, 2026
… forms (#653)

#651 hardened the cognate job by *predicting* which forms LingPy would
reject (a sound-class pre-check). The prediction had false negatives: a
"/" variant separator (e.g. "qap/gaza girtin") survives tokenising as an
unknown segment that LexStat rejects, so the job still crashed with
"Could not convert item ID: <row>" on larger speaker selections (observed
at row 3235 with a 14-speaker run).

Replace the guess with two robust layers:

1. _lingpy_safe_form now also folds "/" into the "_" word boundary, so
   variant forms tokenise and still cluster (no longer dropped).
2. LexStat itself is the authority: build the wordlist, and if it raises
   "Could not convert item ID: <row>", drop that row, rebuild with fresh
   CONTIGUOUS ids, and retry. This cannot crash on any unforeseen
   character (deleting a key in place leaves a gap that corrupts LingPy's
   row numbering, hence the full rebuild). The sound-class pre-check
   helper is removed.

Skipped forms are still recorded/surfaced via the #652 collector. No
change to persisted enrichments or displayed forms. Verified against a
full real workspace (14 speakers / 3464 forms) that previously crashed at
row 3235: now completes, the "/"-variant form clusters, and only the lone
"?" placeholder is skipped.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feat Feature work MC-462 Mission Control MC-462

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant