[MC-462-C] fix(compare): make LexStat authoritative on un-clusterable forms#653
Merged
Merged
Conversation
… forms #651 hardened the cognate job by *predicting* which forms LingPy would reject (a sound-class pre-check). The prediction had false negatives: a "/" variant separator (e.g. "qap/gaza girtin") survives tokenising as an unknown segment that LexStat rejects, so the job still crashed with "Could not convert item ID: <row>" on larger speaker selections (observed at row 3235 with a 14-speaker run). Replace the guess with two robust layers: 1. _lingpy_safe_form now also folds "/" into the "_" word boundary, so variant forms tokenise and still cluster (no longer dropped). 2. LexStat itself is the authority: build the wordlist, and if it raises "Could not convert item ID: <row>", drop that row, rebuild with fresh CONTIGUOUS ids, and retry. This cannot crash on any unforeseen character (deleting a key in place leaves a gap that corrupts LingPy's row numbering, hence the full rebuild). The sound-class pre-check helper is removed. Skipped forms are still recorded/surfaced via the #652 collector. No change to persisted enrichments or displayed forms. Verified against a full real workspace (14 speakers / 3464 forms) that previously crashed at row 3235: now completes, the "/"-variant form clusters, and only the lone "?" placeholder is skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
MC Task: MC-462 — Phonetic-similarity job robustness / Lane MC-462-C
Problem
#651 stopped the cognate job crashing by predicting which forms LingPy would reject (a sound-class pre-check). The prediction has false negatives: a
/variant separator (e.g.qap/gaza girtin) survives tokenising as an unknown segment that LexStat then refuses to convert. So the job still aborted withCould not convert item ID: <row>— reproduced live at row 3235 on a 14-speaker selection.Fix
Stop guessing; let LexStat be the authority, in two layers:
_lingpy_safe_formalso folds/into_(the word-boundary symbol), alongside whitespace — so variant forms tokenise and still cluster rather than being dropped.Could not convert item ID: <row>, drop that row, rebuild with fresh contiguous ids, and retry. This cannot crash on any unforeseen character. (Deleting a key in place leaves a gap that corrupts LingPy's row numbering — hence the full rebuild.) The sound-class pre-check helper is removed.Skipped forms are still recorded and surfaced via the collector merged in #652 (
skippedFormCount/skippedForms/ per-job stderr).No user-facing data change
Sanitising and exclusion touch only the in-memory LexStat wordlist. Persisted
parse-enrichments.jsonand displayed forms are unchanged.Verification
compare/test_cognate_lingpy_safe.py: 13 passed with real LingPy (adds slash-collapse + slash-clusters-not-skipped cases)./-variant form clusters (group A), and only the lone?placeholder is skipped. Example only; the mechanism is dataset-agnostic.🤖 Generated with Claude Code