[MC-462-B] feat(compare): surface forms LingPy could not cluster#652
Merged
Conversation
Follow-up to #651. Un-clusterable forms (e.g. a bare "?" placeholder, or mis-entered text in the form field) are skipped from cognate clustering so the job no longer crashes -- but they were dropped silently, hiding a real data-quality signal. _compute_cognate_sets_with_lingpy now takes an optional skipped_out collector and records each skipped form as {concept_id, speaker, form}; it also logs a one-line stderr summary (visible in the job's per-job stderr). The compute route threads a collector through, emits a job progress message when anything was skipped, and returns skippedFormCount plus a capped skippedForms sample in the job result. No change to persisted enrichments or displayed forms -- this only adds observability fields to the job result and a log line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
TrueNorth49
added a commit
that referenced
this pull request
Jun 3, 2026
… forms (#653) #651 hardened the cognate job by *predicting* which forms LingPy would reject (a sound-class pre-check). The prediction had false negatives: a "/" variant separator (e.g. "qap/gaza girtin") survives tokenising as an unknown segment that LexStat rejects, so the job still crashed with "Could not convert item ID: <row>" on larger speaker selections (observed at row 3235 with a 14-speaker run). Replace the guess with two robust layers: 1. _lingpy_safe_form now also folds "/" into the "_" word boundary, so variant forms tokenise and still cluster (no longer dropped). 2. LexStat itself is the authority: build the wordlist, and if it raises "Could not convert item ID: <row>", drop that row, rebuild with fresh CONTIGUOUS ids, and retry. This cannot crash on any unforeseen character (deleting a key in place leaves a gap that corrupts LingPy's row numbering, hence the full rebuild). The sound-class pre-check helper is removed. Skipped forms are still recorded/surfaced via the #652 collector. No change to persisted enrichments or displayed forms. Verified against a full real workspace (14 speakers / 3464 forms) that previously crashed at row 3235: now completes, the "/"-variant form clusters, and only the lone "?" placeholder is skipped. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
MC Task: MC-462 — Phonetic-similarity job robustness / Lane MC-462-B
Why
Follow-up to #651. That PR stopped the cognate job crashing by skipping forms LingPy cannot tokenise into known sounds. But those forms were dropped silently — and an un-clusterable form is usually a real data-quality signal (a placeholder, or text accidentally entered in the form field), not noise. The reviewer flagged the silent drop as the one gap to close.
What
_compute_cognate_sets_with_lingpynow:skipped_outcollector and records each skipped form as{concept_id, speaker, form};The compute route threads a collector through and exposes the result on the job:
Skipped N form(s) LingPy could not cluster);skippedFormCountand a cappedskippedFormssample in the job result.Not changed
parse-enrichments.jsonshape is untouched.skipped_outdefaults toNone; the existing CLI call site is unaffected).Verification
compare/test_cognate_lingpy_safe.py: 11 passed with real LingPy (2 new tests assert the collector records the skipped form and stays empty when all forms cluster).skippedFormCount=1, collector + stderr both report the single placeholder form — example only; the mechanism is dataset-agnostic.🤖 Generated with Claude Code