Joint Tagger to support probability-aware POS tagging#3717
Open
marcbachmann wants to merge 27 commits into
Open
Joint Tagger to support probability-aware POS tagging#3717marcbachmann wants to merge 27 commits into
marcbachmann wants to merge 27 commits into
Conversation
…_utils
Five training-data extraction sites — the Brill tagger (`epoch`,
`FreqDictBuilder`) and the Brill / Burn / UPOS-frequency chunkers — each
hand-rolled their own CoNLL-U sentence walk: merging contraction suffixes with
a copy-pasted list, iterating raw token rows, and (for the chunkers) locating
noun phrases over those raw rows. That last part shifts every noun-phrase span
in a sentence containing a multiword-token range (e.g. English "cannot",
"don't") or an ellipsis node, so the extracted tokens, tags and NP labels are
wrong for such sentences.
Move the extraction into one place — `conllu_utils` — built on a single
multiword-token/contraction-aware sentence walk:
- `syntactic_words` / `walk_sentence` merge UD component rows back into the
surface tokens Harper actually tokenizes at runtime ("don't" is one token),
keeping a span map so noun-phrase flags (computed on the syntactic words)
fold back onto the merged tokens without shifting.
- `sentence_to_training_pair` / `sentence_to_training_record` and the
`extract_pairs_from_files` / `extract_records_from_files` batch helpers
return `(tokens, tags[, NP mask])`.
- `locate_noun_phrases_in_words` (moved here from `chunker::np_extraction`,
now operating on syntactic words) is the shared NP locator.
All five sites call these helpers; the duplicated loops are gone and the
multiword-token / contraction span-shift is fixed.
Inference and the shipped models are unaffected — these are training-data
paths only. Regenerating the Brill tagger, Burn chunker or frequency
dictionaries will produce corrected data on sentences with multiword tokens
or contractions.
Introduce `harper_pos_utils::joint`: a single neural model that produces both per-token UPOS tags and noun-phrase membership from one shared encoder, replacing the separate Brill tagger + Burn NP chunker for English. Architecture: a character encoder (parallel convolutions over char embeddings + a word-final suffix embedding) feeds a BiLSTM, which fans out to two heads — a UPOS classification head and an NP-membership head. Inference runs on burn's NdArray (CPU) backend; the training path (`train`, `inject`, `eval`, `conllu_utils`) is gated behind the existing `training` feature so the default build stays GPU-free. Tag classes are 0-based and positional: class `i` is the `i`-th `UPOS` variant in declaration order (ADJ..VERB), with class 16 reserved for padding. Also add `Tagger::tag_sentence_topk`, a default-provided method returning the set of *plausible* tags per token (argmax plus close runner-ups). The joint runtime overrides it with the model's softmax distribution; this is what later lets linters resolve genuine homographs the argmax alone can't.
Replace the separately-loaded Brill tagger and Burn NP chunker with the single embedded joint model. `brill_tagger()` and `burn_chunker()` now both return the one memoized, per-thread `JointRuntime` (an `Rc<dyn …>`, since burn's NdArray model is `!Sync`). The trained artifacts (`finished_joint/model.mpk` + the char/suffix vocabularies) ship embedded; inference is CPU-only and needs no extra dependencies. Reproducibility: the `regenerate_english_joint` example (re)trains the model end-to-end and is gated behind a new `training` feature that pulls burn's autodiff + GPU backend. It auto-fetches the UD treebanks, trains deterministically (seeded), and writes the embedded artifacts; `joint_determinism_check` verifies the training is reproducible. The `diagnose_lint_words` / `eval_dev` / `probe_pos_probs` examples are inference-only diagnostics.
The joint model assigns different (and overall more accurate) UPOS tags than the Brill tagger, so the `pos_tags` snapshots are regenerated to match. Pure tagging output; no linting behavior is involved here. This is the boundary of the model layer: English tagging + chunking now run on the joint model and the `pos_tags` suite is green. The lint suite is not yet green at this commit — swapping the tagger shifts a handful of genuine homographs (NOUN/VERB, ADP/SCONJ, ADV/ADJ) onto the model's rank-2 reading, which the strict-argmax linters discard. The following "linting layer" commits expose the model's distribution to those rules and close the gap.
A neural tagger resolving a genuine homograph ("books" = NOUN or VERB,
"for" = ADP or SCONJ) often ranks the contextually-correct tag a hair below
the argmax. Strict-argmax linters discard that reading and miss the error.
Instead of perturbing the model, expose its distribution to the linters:
- `DictWordMetadata.pos_tag_topk` carries the plausible-tag set (argmax plus
close runner-ups) alongside `pos_tag`. It is runtime-only (`#[serde(skip)]`)
and stores a tag *set*, so the type keeps its `Eq`/`Hash`/`Ord` derives.
- `DictWordMetadata::could_be_pos` / `TokenKind::could_be_upos`: "is this a
plausible reading?" — the probability-aware counterpart to `is_upos`.
- `UPOSSet::new_loose`: a `UPOSSet` that matches on `could_be_pos` rather
than the strict argmax.
- `Document::parse` populates `pos_tag_topk` from the tagger's top-k.
This is strictly opt-in: every existing call site keeps strict-argmax
matching until it deliberately switches. Loosening globally over-fires the
precision guardrails that depend on the argmax, so it is applied per rule.
Expose probability-aware POS matching in the weir rule DSL. Writing `~NOUN` (the existing tilde punctuation followed by a UPOS tag) compiles to a `UPOSSet::new_loose`, so a weir rule slot matches a plausible reading of the token rather than only the strict argmax — the DSL equivalent of `could_be_upos`. Like the Rust API, it is opt-in per slot.
…omograph heads
The possessed head noun is frequently a noun/verb homograph ("song",
"books", "shop") that the joint tagger ranks VERB#1, NOUN#2 in this
out-of-distribution possessive-mistake frame. Switch the head slot from
`UPOSSet::new` to `new_loose` so the plausible NOUN reading still matches.
The rule's trigger is rare and specific, so loosening here introduces no
prose false positives.
…raph modifiers
In "deep breathe" / "sound affect", the modifier ("deep", "sound") is a
homograph the joint tagger ranks ADJ#2 / NOUN#2 behind a verb reading.
- verb_instead_of_noun: loosen the leading ADJ/ADP slot to `new_loose`; the
existing AUX guard and curated verb list keep it from over-firing.
- affect_to_effect: treat a VERB-tagged token that could also be a NOUN as
valid preceding context (it modifies the affect/effect head rather than
governing it).
… clausal "for" In "its common for users to …", the "for" introducing the infinitival clause is ranked ADP#1, SCONJ#2 by the joint tagger. Match the plausible SCONJ/PART reading (`new_loose` in the pattern, `could_be_upos` in the follow-up check) so the "its → it's" contraction error is still caught.
… clauses
Add an `imperative_were_clause` arm: a sentence-initial imperative verb
("Check were the error occurred" → "where") is mis-tagged NOUN/PROPN with
VERB only a low-but-plausible reading, so loose-match the leading VERB — but
anchored to sentence start. The anchor is what makes this precise: it
excludes existential "There were …" (no plausible VERB reading on "There")
and mid-sentence "…people there were…" (the start anchor never matches),
the two false positives a blanket loose match produced.
Also loosen the "you where <verb>" slot so adverb/adjective homographs
("right", "sure") match.
…ctive "too stiff/slippery/polite" is correct excess-degree usage, but the joint tagger sometimes argmaxes the adjective as VERB. Add a probability-aware suppress guard: don't rewrite "too → to" when the following token could plausibly be an ADJ (`could_be_upos`). Probability tolerance in reverse — the rule fires only when the adjective reading is implausible.
…djectives
Only fire when the predicate adjective is one that idiomatically takes a
prepositional complement (interested IN, famous FOR, afraid OF) — a curated
`PP_TAKING_ADJECTIVES` allowlist. A general descriptive adjective merely
modifies the following noun attributively ("live hedgehogs") or heads a
clause ("sure he'd start"); both are structurally identical to the real
error, so the adjective itself is the only signal that separates them.
Deliberate trade-off: precision over recall. Also clears pre-existing
baseline false positives (high time / close friends / careless people).
Add the tagger-resistant possessed nouns (ladder, answer, server, understanding) to the rule's existing literal escape-hatch array (alongside mittens/pawn/…). Precise by construction — it only ever matches "there <these exact nouns>", so it cannot fire on locative "there".
…ith `~TAG` These rules trigger on rare, specific constructions, so probability tolerance is safe. Switch their adjective/adverb slots to the loose `~ADJ` / `~ADV` operator so the plausible reading matches when the joint tagger ranks the adjective/adverb a hair below the argmax.
…fident noun
The permissive pattern matches on dictionary verb-lemma *capability*, so a
controller word can pair with any word that is also a verb in the dictionary
("expected feature" — "feature" is a verb lemma). Add a probability-aware
guard: if the tagger is confident the following word is a non-verb noun (no
verb reading even among its plausible top-k tags), there is no infinitive to
complete, so don't fire — "a standard and expected feature" is a nominal
phrase, not "expected *to* feature". A genuine homograph ("site selling …",
where "selling" keeps its VERB reading) is unaffected and still fires.
…d, 0 added) Regenerate the prose lint snapshots for the joint model + probability-aware linting. The net change is purely the removal of four pre-existing false positives on the prose corpus — "high time" / "close friends" / "careless people" (missing_preposition) and "that there crystal glass" (there_to_their). No new lints are introduced: the diff is removals only.
Fold tagging, probability-aware top-k, and NP chunking of a sentence into a single forward pass instead of two, and expose it through one generic helper: - `JointRuntime::infer` derives the argmax tags, the plausible-tag (top-k) sets, and the NP flags from one model run, cached per sentence. Previously `tag_sentence` and `tag_topk` each ran their own forward pass; only the chunk call was a cache hit. The top-k set now lists the argmax first, so the chosen tag and the set share one source of truth. - The cache lookup borrows the `&[String]` slice (`Vec<String>: Borrow<[String]>`, identical hash), so a hit costs only the hash — no key clone; the owned key is allocated only on a miss, alongside the forward pass that dwarfs it. - `Document::parse` calls one generic `annotate(tagger, chunker, sentence)` returning `(plausible-tag sets, NP flags)`. It works for any Tagger/Chunker: the joint model collapses the two trait calls to the single cached forward; a separate tagger + chunker run independently. `pos_tag` is `tags[i].first()`. Behavior-preserving: the full suite stays green and the POS/lint snapshots do not change.
…ached pass Introduce `Annotator: Tagger + Chunker` so a sentence's plausible-tag sets and NP-membership flags come from a single `annotate()` call backed by one cached forward pass, replacing the separate tag_sentence_topk + chunk_sentence path. Per-token tag sets are now `TagSet = SmallVec<[UPOS; 4]>` (argmax first), which flows straight into harper-core's `pos_tag_topk` without re-collecting and keeps the common 1-3 tag set inline (no heap alloc). Drops the duplicate argmax `Vec` and removes the now-redundant `tag_topk` / inherent `annotate` methods. ~7% faster parse_essay (warm) from the dropped allocations.
… cold inference) Gate burn-ndarray's `simd` (and the `std` it transitively needs) to non-wasm32 targets via a target-specific dependency table. SIMD nearly halves cold per-sentence inference (parse_essay cache-off ~1.26s -> ~0.68s on aarch64); the gemm was already NEON-vectorized through matrixmultiply, so the win is in the elementwise / LSTM-gate ops the `simd` feature accelerates. The feature is not no_std-clean in burn-ndarray 0.19.1 (its ops/simd modules use Vec/vec! without `use alloc::vec::Vec`), so it also requires `std`; the wasm32 build keeps the no_std scalar path. Full lint suite stays green with SIMD active (no FLOOR-sensitive top-k membership flips), and the snapshots — originally generated on the scalar path — still match, so native-SIMD and wasm-scalar agree across the test corpus.
Two changes to JointRuntime::infer's per-token decode: - index_to_upos is now a direct `match` (jump table) instead of scanning `UPOS::iter().find(...)`, dropping the `strum::IntoEnumIterator` import. The arms mirror the enum's declaration order; the test is reworked into a drift guard that checks every arm against the real discriminant, so a reorder or new variant fails loudly instead of silently mis-tagging. - The chosen tag is now derived from the logits in-loop via `argmax_first` rather than a separate `argmax(2)` tensor op: this reads `tag_logits` once (no clone, no second into_data, no class buffer) and folds the argmax and the softmax max-shift into one per-row scan. `argmax_first` is a strictly-greater, first-max scan that mirrors burn-ndarray's argmax tie-break exactly (pinned by a unit test), so the chosen tag — and every lint — stays bit-identical. Verified zero lint drift: full harper-core suite (5349 lib + integration) green, clippy clean. Cold parse_essay is unchanged-to-marginally-faster (~1%, within the run-to-run noise floor; the decode is a negligible fraction of the forward).
Two allocation cuts on the cache-miss inference path: - Split JointBatch::build into build_inference (char-id + suffix-id buffers only — the inputs forward() reads) and build (which calls build_inference then fills the gold tag/np/mask label buffers used solely by the training loss). JointRuntime::infer now calls build_inference, dropping the two dummy tag/np vectors plus the three dead label-buffer allocations and their fill per sentence. build keeps its signature, so the training path is unchanged; the label accessors gain a debug_assert that fires if they are called on an inference batch (empty label buffers). - The per-token softmax writes into a fixed `[f32; TAG_CLASSES]` stack buffer instead of allocating a `Vec<f32>` per token. Output bit-identical (full harper-core suite green, zero lint drift; clippy clean; training compiles), confirmed by an adversarial review of the diff. Cold-path only and below the parse_essay noise floor, but removes real dead work.
507d373 to
03b8968
Compare
…ence
Upgrade burn 0.19.1 -> 0.21.0 and switch the joint tagger's runtime
inference backend from burn-ndarray to burn-flex, a fast pure-Rust CPU
backend (gemm + SIMD). flex is now the only inference backend, on every
target including wasm32 (no rayon + critical-section atomics shim there).
- cold inference ~5.6x faster (0.53ms vs 2.97ms/sentence), byte-identical
output; real-prose parse_essay 5.9x faster, lint_essay_uncached 3.4x.
- retrain the embedded model on 0.21 (UPOS 0.9463 / NP F1 0.8872); the
0.19 model.mpk can't load on 0.21 (version guard + new LSTM
gate_activation field).
- burn 0.21 moved Backend::Device to a BackendTypes supertrait; name the
concrete device type in runtime.rs instead.
- regenerate_english_joint: AutoGraphicsApi instead of hardcoded Metal so
training runs on any wgpu GPU (Metal/Vulkan/DX12).
- save_then_load_preserves_forward: rewrite as an fp16->fp16 round-trip
equality check (the old fp32-vs-fp16 threshold failed on both backends
under 0.21's RNG, not a flex issue).
- char/suffix vocab JSON now pretty-printed with ordered keys for
diff-friendly artifacts; key order is cosmetic (ids are the values).
burn-ndarray remains only for the legacy BurnChunker and tests.
03b8968 to
0f0a336
Compare
Collaborator
This is also from The Great Gatsby, chapter IV, not from Alice in Wonderland:
|
…n tooling - train_joint gains a periodic-eval hook: every N epochs (and the last) it hands the validated inner model to a callback. The regenerate example scores it on the production fp16 + flex path (eval_via_flex), so the logged UPOS/NP-F1 curve reflects what ships, not the fp32 training graph. - JOINT_OUT / JOINT_EPOCHS env overrides so sweeps train into a scratch dir without clobbering the committed model. - Default epochs 100 -> 30, chosen from the dev plateau: dev UPOS peaks at epoch ~25-30 and declines past it (mild overfit). A 30-epoch cosine schedule reaches UPOS 0.9479 / NP F1 0.8972 vs the old 100-epoch 0.9463 / 0.8872, in a third of the wall time. Keep AutoGraphicsApi (GPU-agnostic) with a comment.
…ty-aware
Regenerate the embedded joint model at 30 epochs (UPOS 0.9479 / NP F1 0.8972).
The model's argmax POS tags shift on a number of borderline tokens, so make the
lints that keyed off the strict argmax read the tagger's top-k instead, and add
a dictionary escape where the right tag fell below the top-k floor entirely:
- missing_to: gate the want/need noun-suppression on could_be_verb (top-k).
- theses_these: match the head noun with UPOSSet::new_loose.
- its_possessive: exclude adjective-only words (dict adjective, not dict noun)
from the possessed-noun slot ('it's insincere' stays a contraction).
- to_adverb: add 'maybe' to the by-word adverb escape (tagged VERB here).
- weir HaveNegNoAny / RainbowColoredHyphen: loosen the final NOUN slot to ~NOUN.
- weir MissingDeterminer: guard the ~ADJ modifier with <~ADJ, !DET> so 'another'
(argmax DET, inflated ADJ runner-up) isn't read as a bare noun phrase.
- weir ThereToTheir: add (there <VERB, ~NOUN> [VERB, AUX]) for 'There <noun>
<verb>' errors where the subject noun tags VERB, without firing on existential
'there' (there must be / here and there / that there crystal).
Also: run_tests_for_weir_rules now aggregates every rule's failures instead of
panicking on the first (which had masked later regressions), and the POS-tagger
snapshots are regenerated for the new model. Lint snapshots are unchanged.
Full suite: 5351 passed, 0 failed; test_most_lints + test_pos_tagger green.
41602cb to
3865f1e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issues
Fixes #3680 —
MissingPrepositionfalse positive on predicate nouns/adjectives ("We were best friends").Fixes #3204 —
MissingPrepositionfalse positive ("It's been many years since I've…").Both are verified before/after against
master(see How Has This Been Tested?). This PR also removesfour real-world
MissingPreposition/ThereToTheirfalse positives from the Great Gatsby / Alice inWonderland test corpus, and hardens a class of homograph-driven false positives/negatives that the same
linter family is repeatedly reported for (e.g. #1585, #3089, #3087, #2856, #3700, #1947, #3203, #2301).
I've intentionally not written
Fixesfor those, because against currentmasterthey either alreadypass or need a separate look — see the table in How Has This Been Tested? for exactly which ones I could
prove and which I couldn't (e.g. #3214 still false-positives and is out of scope here).
Description
Replaces Harper's two separate English models — the JSON Brill POS tagger and the Burn NP chunker — with a single small neural joint model that produces both per-token UPOS tags and noun-phrase membership from one shared encoder, and uses the model's probability distribution to fix a whole class of homograph linter bugs the old argmax-only tagger couldn't.
Sorry, this is a lot to take in at once. 🙈
But I'm opening this PR for now to receive input for a first step and to figure out which steps this project is willing to go. The motivation for me is to introduce other language support like in #3402 while keeping the binaries small for those.
What Changed, in Iterations
1. The joint model (
harper-pos-utils::joint)trainingfeature so the default build stays lean.harper-brill::brill_tagger()andburn_chunker()are now both served by the one embedded, memoized per-threadJointRuntime. Newannotator()returns tags + plausible-tag sets + NP flags from a single cached forward pass. Some cleanup should probably still happen before this lands completely (If you think nobody needs the old setup, I can remove those completely. The code was built for compatibility while removing the old models already out of the build. The assets are still there).2. Probability-aware POS matching (the reason this is worth doing)
pos_tag_topk(runtime-only,#[serde(skip)]) carries the plausible-tag set;could_be_pos/could_be_uposandUPOSSet::new_looselet a linter opt into matching a plausible reading.~TAGoperator (e.g.~NOUN).3. Per-rule linter fixes built on that infrastructure (each commit explains the exact homograph frame):
possessive_noun,noun_verb_confusion,its_contraction,missing_preposition(now restricted to a curatedPP_TAKING_ADJECTIVESallowlist),to_two_too,were_where,there_to_their,all_ready,missing_determiner,missing_to.4. Performance (cold inference is the bottleneck for one-shot CLI/LSP runs):
burn-ndarray simd): ~1.8× faster cold inference.SmallVectag sets: ~7% faster warmparse_essay, fewer allocations.burn-flex: ~5.6× faster cold inference (0.53 ms vs 2.97 ms / sentence), byte-identical output. Real-proseparse_essay5.9× faster,lint_essay_uncached3.4× faster.Model quality
model.mpk(752 KiB) + two small vocab JSONs, replacing the old tagger JSON + chunker model + chunker vocab + chunker JSON. Good for the WASM bundle and theharper-lsbinary (cf. Reduce binary size #1945).Reproducibility / faster iteration
regenerate_english_jointexample (gated behindtraining): it auto-fetches the UD treebanks, trains deterministically (seeded), and writes the embedded artifacts.joint_determinism_checkverifies the training is reproducible.Demo
Before/after on two reported false positives (American dialect, default rule set):
Real-prose corpus diff (
harper-core/tests/text/linters/*.snap.yml) — 4 false positives removed, 0 added:MissingPrepositionMissingPrepositionMissingPrepositionThereToTheirHow Has This Been Tested?
Full
harper-coresuite (5349 lib + integration tests) green; clippy clean; POS-tagger and prose-lint snapshots regenerated and reviewed (the prose-lint snapshot diff is removals only).New unit tests per touched linter, using the reported sentences and real prose.
Before/after built
harper-clifrommaster(2.6.0) vs this branch and diffed the lint output on the reported sentences:MissingPrepositionMissingPrepositiontriggered by "It's been many years since I've..." #3204MissingPrepositionMissingPrepositionMissingPrepositionMissingToMissingToWereWhereWereWherewant->want topreceeding e.g. 'phase one' (MissingTo) #3089 / False Positive: 'wanted revenge' -> 'wanted to revenge' (MissingTo#3087 / False positive inToTwoToo#2856 / False positive flagging "there" to change to "their" #3700WereWhere) #3214WereWhere(FP)WereWhere(FP)The regression-guard rows matter: the tagger swap does not cost recall on the real errors.
AI Disclosure
If Your PR Implements or Enhances a Linter
Checklist