Skip to content

[MC-462-C] fix(compare): make LexStat authoritative on un-clusterable forms#653

Merged
TrueNorth49 merged 1 commit into
mainfrom
feat/mc-462-c-lexstat-authoritative
Jun 3, 2026
Merged

[MC-462-C] fix(compare): make LexStat authoritative on un-clusterable forms#653
TrueNorth49 merged 1 commit into
mainfrom
feat/mc-462-c-lexstat-authoritative

Conversation

@TrueNorth49

Copy link
Copy Markdown
Collaborator

MC Task: MC-462 — Phonetic-similarity job robustness / Lane MC-462-C

Problem

#651 stopped the cognate job crashing by predicting which forms LingPy would reject (a sound-class pre-check). The prediction has false negatives: a / variant separator (e.g. qap/gaza girtin) survives tokenising as an unknown segment that LexStat then refuses to convert. So the job still aborted with Could not convert item ID: <row> — reproduced live at row 3235 on a 14-speaker selection.

Fix

Stop guessing; let LexStat be the authority, in two layers:

  1. _lingpy_safe_form also folds / into _ (the word-boundary symbol), alongside whitespace — so variant forms tokenise and still cluster rather than being dropped.
  2. Authoritative retry: build the wordlist; if LexStat raises Could not convert item ID: <row>, drop that row, rebuild with fresh contiguous ids, and retry. This cannot crash on any unforeseen character. (Deleting a key in place leaves a gap that corrupts LingPy's row numbering — hence the full rebuild.) The sound-class pre-check helper is removed.

Skipped forms are still recorded and surfaced via the collector merged in #652 (skippedFormCount / skippedForms / per-job stderr).

No user-facing data change

Sanitising and exclusion touch only the in-memory LexStat wordlist. Persisted parse-enrichments.json and displayed forms are unchanged.

Verification

  • compare/test_cognate_lingpy_safe.py: 13 passed with real LingPy (adds slash-collapse + slash-clusters-not-skipped cases).
  • Full real workspace, the exact 14-speaker selection that crashed at row 3235: now completes — 716 concepts, the /-variant form clusters (group A), and only the lone ? placeholder is skipped. Example only; the mechanism is dataset-agnostic.

🤖 Generated with Claude Code

… forms

#651 hardened the cognate job by *predicting* which forms LingPy would
reject (a sound-class pre-check). The prediction had false negatives: a
"/" variant separator (e.g. "qap/gaza girtin") survives tokenising as an
unknown segment that LexStat rejects, so the job still crashed with
"Could not convert item ID: <row>" on larger speaker selections (observed
at row 3235 with a 14-speaker run).

Replace the guess with two robust layers:

1. _lingpy_safe_form now also folds "/" into the "_" word boundary, so
   variant forms tokenise and still cluster (no longer dropped).
2. LexStat itself is the authority: build the wordlist, and if it raises
   "Could not convert item ID: <row>", drop that row, rebuild with fresh
   CONTIGUOUS ids, and retry. This cannot crash on any unforeseen
   character (deleting a key in place leaves a gap that corrupts LingPy's
   row numbering, hence the full rebuild). The sound-class pre-check
   helper is removed.

Skipped forms are still recorded/surfaced via the #652 collector. No
change to persisted enrichments or displayed forms. Verified against a
full real workspace (14 speakers / 3464 forms) that previously crashed at
row 3235: now completes, the "/"-variant form clusters, and only the lone
"?" placeholder is skipped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@TrueNorth49 TrueNorth49 added bugfix Bug fix MC-462 Mission Control MC-462 labels Jun 3, 2026
@TrueNorth49 TrueNorth49 merged commit 97a84fc into main Jun 3, 2026
4 checks passed
@TrueNorth49 TrueNorth49 deleted the feat/mc-462-c-lexstat-authoritative branch June 3, 2026 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Bug fix MC-462 Mission Control MC-462

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant