feat(accent): local UniDic + POS-driven patches by torrid-fish · Pull Request #53 · sessatakuma/API-tools

torrid-fish · 2026-05-27T11:15:37Z

目的

Closes #45, Closes #48, Closes #50, Closes #57, Closes #58.
Supersedes (now-closed) PRs #47 and #49.
Replaces #51 (same branch, renamed spike/local-unidic → feat/local-unidic;
the rename closed #51, so this is its continuation).

Issue #50 was explicitly framed as an alternative architecture to #48 ("two architectures targeting the same goal. Production will likely pick ONE"). This PR is the GO outcome: local fugashi + NINJAL UniDic CWJ 2025-12-31 in-process replaces the Yahoo MA HTTP round-trip, while keeping OJAD for surface-level phrase pitch and keeping every POS-driven rule originally introduced in PR #49.

This is the single, complete migration PR — it targets main directly and folds in the foundation that was previously proposed as the standalone PR #47 (regex reading-override layer, Needleman-Wunsch DP aligner, /MarkAccent/stream/ NDJSON endpoint). Since the Yahoo-backed intermediate from #47 is a stepping stone that won't ship independently, #47 was closed and its commits are the base of this branch rather than a separate merge.

PR structure (19 commits, base `ci/cd-ghcr` → `main`)

Stacked on top of PR #55 (ci/cd-ghcr → main), which carries the generic CI/CD basework (GHCR workflow, Node 24 bumps, non-root Dockerfile). Once #55 merges, this PR's base auto-retargets back to main and the diff stays the same — only the foundation + UniDic + UniDic-specific Docker tweak remain here.

ci/cd-ghcr → foundation (3) → UniDic migration (15) → UniDic Docker tweak (1):

Foundation (was #47): Needleman-Wunsch DP aligner (+ rendaku fold), regex reading-override layer + URL/non-JP preprocessing, and the /MarkAccent/stream/ NDJSON endpoint.

UniDic migration (#50): swap Yahoo MA → fugashi+UniDic, drop Yahoo config, add tokenizer / preprocess / postprocess / user_patches modules, UniDic strong-mode fields, English README.

UniDic Docker tweak: scripts/download_unidic.sh at build time (depends on the unidic pip dep that lands in this PR — that's why it can't live in #55).

Rebased onto current main (which includes the #52 package refactor and the #46 Docker compose setup). The Docker-scaffolding commit from the original spike was dropped during rebase — already upstream via #46. Commit subjects were reworded to stay under 75 chars (so exact hashes aren't pinned here).

Architecture

flowchart LR
    text([text input])
    fugashi["fugashi (MeCab) + UniDic CWJ 2025-12-31<br/>surface · kana · lemma<br/>pos · conjugation · aType"]
    ojad["OJAD scrape (suzukikun)<br/>per-mora surface pitch contour"]
    align["DP alignment<br/><code>align_accent</code>"]
    patches["<code>apply_accent_patches</code><br/>POS-driven rules"]
    out([WordAccentResult])

    text --> fugashi
    text --> ojad
    fugashi -- tokens + POS --> align
    ojad -- per-mora pitch --> align
    align --> patches
    fugashi -. POS metadata .-> patches
    patches --> out

Simplified schematic. For the full pipeline data-flow diagram see api/accent/README.md → "Data flow".

Yahoo MA HTTP endpoint fully removed (api/accent/furigana.py and the /MarkFurigana/ route deleted). Everything #49 introduced (POS-driven apply_accent_patches, pos_match override predicate, 5 POS columns on WordResult/WordAccentResult) carries over to UniDic identically — same UniDic schema, only the upstream changed.

Package layout

fugashi + UniDic tokenizer → api/accent/tokenizer.py
preprocessing (URL / comma / × strip + restore) → api/accent/preprocess.py
postprocessing passes (heiban-particle flatten, 助詞 furigana suppression, punct suppression, katakana toggle) → api/accent/postprocess.py
POS-driven patches + user-maintained patch table → api/accent/reading_overrides.py + api/accent/user_patches.py
strong-mode fields → api/accent/models.py
chunked orchestrator (_build_chunks, _schedule_chunks) → api/accent/pipeline.py

Local UniDic replace Yahoo MA

_fetch_yahoo_raw → _tag_local using singleton fugashi.Tagger() + UniDic CWJ 2025-12-31
Field mapping: feat.pos1 → pos, feat.pos2 → pos1, feat.cType → conjugation_type, feat.cForm → conjugation_form, feat.lemma → base (with -gloss suffix stripped), feat.aType → lexical_kernel (parsed for multi-reading via _parse_atype)
Reading uses feat.kana over feat.pron — UniDic stores 忙しい as kana=イソガシイ (matches OJAD's ortho-kana) while pron=イソガシー with chōonpu would never align

Strong-mode fields

lexical_kernel: int | None — aType primary (0=heiban, N≥1=kernel on mora N)
lexical_kernel_alts: list[int] | None — multi-reading alternates (e.g. aType="2,0" → [2, 0])
kernel_absorbed: bool — UniDic says kernel exists but OJAD has no FALL in this word's range (connected-speech sandhi case, e.g. 忙しい inside お忙しい中)

Dictionary variant selection

NINJAL publishes two UniDic variants — the download script supports both:

Tag	Variant	Training corpus	Use case
`cwj-2025-12-31`	CWJ (現代書き言葉)	BCCWJ (written)	Articles, novels, web text — default
`csj-2025-12-31`	CSJ (現代話し言葉)	CEJC (spoken)	Conversational / transcribed speech

./scripts/download_unidic.sh cwj-2025-12-31   # default — written
./scripts/download_unidic.sh csj-2025-12-31   # spoken-language variant
./scripts/download_unidic.sh cwj-2021-08-31   # older CWJ 3.1.0

CWJ is the default because this service primarily processes written input. CSJ may be preferable when processing conversational or transcribed speech. The Dockerfile defaults to CWJ; override by changing the script argument in the builder stage.

Hide internal MA metadata from response

base/pos/pos1/conjugation_type/conjugation_form marked Field(exclude=True) — kept on the model for in-pipeline use by apply_accent_patches, excluded from JSON serialization. Cleaner client contract; nothing functional changed.

Unify chunking between two endpoints

_build_chunks / _schedule_chunks shared by /MarkAccent/ and /MarkAccent/stream/. Both endpoints emit byte-identical per-chunk word lists; only delivery shape differs (collected vs streamed NDJSON).

Rendering polish

What	Why
Polish furigana/accent on punct, symbols, toggles	Existing edge cases pre-spike
Katakana with `render_katakana_furigana=False` keeps `accent` (only clears `furigana`)	Per-mora pitch overlay can still render against the surface text
助詞 (`pos=助詞`) tokens clear their redundant top-level `furigana`	hiragana ruby on hiragana surface is visual duplication; was crowding out pitch overlay
After heiban predecessor, の／な／は／が pitch flattened to LOW	OJAD's HIGH-plateau continuation was visual noise without contour change
`\d × \d` swapped → `\d/\d` pre-pipeline, `×` restored on surface post-alignment	OJAD merges `19×19` → 1919 (千九百十九); swap forces independent reading per number
Numeric `_match_cost` adds tiny per-empty-OJAD-entry cost	DP was splitting `19/19` as 1+7 morae instead of 4+4 because all cost-0 ties picked an arbitrary path
User-maintained patch table (`user_patches.py`) with explicit per-mora int tuples	Hand-fix OJAD/UniDic misreads without touching pipeline code

Verified locally

ruff check, ruff format --check pass (17 files); import main + all accent submodules import clean
学校に行きます。今日は寒いですね。19×19の格子。 — heiban-particle flatten (に stays 1, は／が／の／な flatten after heiban), 助詞 furigana cleared, 19×19 splits 4+0+4 not 1+7
Response JSON contains zero of base/pos/pos1/conjugation_type/conjugation_form (Field(exclude=True))
/MarkAccent/stream/ returns NDJSON with {chunk, subchunk, status, result, error}

Disambiguation probe

All 11 cases from #49's table still pass — POS gates fire identically because the UniDic schema is the same upstream of the rule layer.

UniDic Docker tweak (the only Docker/CI commit still in this PR)

UniDic dict baked into the image: the unidic pip package ships only the loader, not the ~1.3GB dicdir, so fugashi.Tagger() would fail at runtime without the dictionary. Dockerfile now runs scripts/download_unidic.sh in the builder stage as its own cache layer → the image is self-contained, no runtime download.

The CD workflow, Node 24 bumps, and non-root runtime are in #55 (the base of this stacked PR).

The published CD image after both #55 and this PR is ~2.1GB (base + venv + UniDic CWJ 2025-12-31). Smoke-tested end-to-end before the split: build + push to GHCR green, image runs as uid 10001, and POST /api/MarkAccent/ returns 200 with correct accent output.

Out of scope

i-adjective た形 cross-token (高かった = 高 + かっ + た)
Loanword 3-mora rule (バナナ, ピアノ)
数詞 + counter irregulars beyond date/age (4時, 7時, 9時, 1分, 1人, 2人)
Replacing OJAD entirely — still needed for surface-level phrase pitch; UniDic aType is lexical (dictionary-form), not connected-speech contour

Notes

Docker image disk cost: UniDic CWJ 2025-12-31 ≈ 1.3GB on disk. Trade-off accepted per spike's Evaluate local UniDic (fugashi/Sudachi) for in-process accent + lower latency #50 measurement.
Singleton tagger: per-request construction would re-load ~1.3GB and add ~1-2s; tagger is cached as module-level lazy init.

🤖 Generated with Claude Code

The greedy aligner had two failure modes that cascaded across whole sentences: a numeric anchor that over-consumed when Yahoo and OJAD disagreed on phrase boundary, and a +1 fallback path that turned a single mismatch into type-0 fallback for every downstream token. Replaces it with a global DP over (yahoo_token, ojad_entry) pairs: each Yahoo token consumes k ∈ [0, K_MAX] contiguous OJAD entries, with per-token cost computed via shape (punct/numeric/kana) and edit distance over rendaku-folded strings for kana tokens. Sub cost (0.4) is lower than ins/del (1.0) so the DP prefers same-length spans with substitutions over shorter spans with deletions — fixes the case where OJAD's `う` from `等→とう` leaked onto the next token. Adds a voicing-fold table so Yahoo's dictionary-form readings (ふんかん) align against OJAD's pronounced readings with rendaku (ぷんかん). All comparisons under this fold; ぱ/ば/ぷ/ぶ all alias to は/ふ. Refs #47.

Add api/accent/reading_overrides.py — a context-blind correction layer sitting between Yahoo Furigana and OJAD alignment. Each override is a regex on the concatenated surface text plus the replacement tokens that should appear instead. Covers: - 曜日 brackets: (月)/（月）→ げつ, (土) → ど, etc. for all 7 weekdays. - All 31 day-of-month readings: 1日 → ついたち (atamadaka), 5日 → いつか, 14日 → じゅうよっか, 20日 → はつか, etc. - N日間 durations 1-31: 1日間 → いちにちかん (NOT ついたちかん since the 1st-of-month reading is impossible for a duration), 7日間 → しちにちかん (modern technical writing preference over なのかかん). - 20歳 / 二十歳 / 20才 → はたち (the only irregular age reading). Patterns accept arabic / full-width / kanji numeral variants of the same N so `3月5日(土)` / `３月５日（土）` / `三月五日（土）` all trigger the same overrides. Order-of-overrides matters: duration list precedes date list so `N日間` wins over `N日` at the same start (longer match breaks ties in _collect_matches). apply_furigana_overrides runs BEFORE align_accent so merged spans like `5日→いつか` reach OJAD as a single token whose furigana matches OJAD's phrase reading (the numeric-anchor logic in align_accent otherwise cascades-fails because numeric tokens lack any Yahoo furigana). apply_accent_overrides runs AFTER align to re-stamp both furigana and accent on the same matched spans, so the response is consistent. Adds URL preprocessing: each https?:// is swapped for the placeholder "URLPLACEHOLDER" before the pipeline runs (Yahoo fragments URLs across several alphabet tokens; OJAD's phrasing scraper produces noise for Latin punctuation runs — both drag alignment off-rail). Placeholders are walked back to the originals in order after alignment. URL body stops at whitespace, any Japanese char, or `,()<>[]"'` so embedded URLs strip cleanly. Adds a non-Japanese short-circuit: if (after URL stripping) the chunk contains no hiragana / katakana / CJK ideograph, skip Yahoo + OJAD entirely and echo the chunk back as a single token. Lets pure-URL / pure-English lines stream through cheaply. Also adds stream_accent_chunks() to pipeline.py as a helper used by the streaming endpoint added in the next commit. Splits the input on \n then on full-width sentence terminators (。！？．) — long paragraphs degrade OJAD's phrasing predictor and parallelising across sentences caps the latency. In-flight work is bounded by a semaphore (concurrency=4) because OJAD's u-tokyo backend falls over with 30+ parallel scrapes. main.py docstring updated to reflect /MarkAccent/stream/. Refs #47.

Add a streaming variant of /MarkAccent/ that processes the input as a sequence of (line, sentence) chunks and emits one NDJSON object per chunk in input order. Each line carries `{"chunk": line_idx, "subchunk": sub_idx, ...AccentResponse}` so clients can render output incrementally while keeping document position. Underlying chunk-fanout and concurrency limiting live in pipeline.stream_accent_chunks; the route is a thin StreamingResponse wrapper. Streaming benefits compound: OJAD's phrasing predictor degrades on long inputs (a single misaligned mora cascades across the paragraph), so per-sentence chunks both stay short enough for OJAD to handle and fan out under the bounded semaphore. Also adds test.sh — a small bash smoke-test helper that POSTs a sample text to either /MarkAccent/ or /MarkFurigana/ and pretty-prints the per-moji (surface|furigana|accent_marking_type) rows. STREAM=1 switches to the streaming endpoint, ENDPOINT= picks which router. Useful while iterating on overrides; not wired into CI. .gitignore adds data/ and output/ for ad-hoc test fixtures we don't want committed. Refs #47.

Replace the Yahoo Furigana HTTP path with in-process fugashi + NINJAL UniDic 3.1.0. The migration adds three new layers inside the `api/accent/` package plus a sentence-level chunked streaming endpoint: * `tokenizer.py` — singleton `fugashi.Tagger`; maps UniDic features into the existing `WordResult` shape plus new strong-mode fields `lexical_kernel` / `lexical_kernel_alts` (parsed from `aType`). * `preprocess.py` — pre-alignment text rewrites (URL strip, western- grouped thousands `1,234`→`1234`, `\d×\d`→`\d/\d`), `has_japanese` short-circuit gate, sentence splitting, readable-symbol (`2%`, `15℃`) pre-merge. * `postprocess.py` — rendering passes: heiban-particle accent flatten (の/な/は/が after a 平板調 word), pure-punct furigana suppression, English / katakana toggle handling, 助詞 furigana suppression. * `reading_overrides.py` — moved into the package; regex overrides for 日付/N日間/20歳/曜日 plus the POS-driven `apply_accent_patches` rule for ます / たい first-mora FALL. `align.py` is upgraded to a Needleman-Wunsch DP over (token, OJAD-entry) pairs with weighted edit distance, rendaku voicing fold, an OJAD-punct guard, and a numeric tiebreaker that fixes the `19×19` 1+7-split bug. `models.py` extends `Request` with `render_english_furigana` / `render_katakana_furigana` toggles, adds POS metadata fields (excluded from serialization) plus strong-mode lexical-accent fields exposed in JSON, and drops the standalone `FuriganaResponse`. `routes.py` exposes `/api/MarkAccent/` (collected) and `/api/MarkAccent/stream/` (NDJSON per chunk); both share `pipeline.build_chunks` + `pipeline.schedule_chunks` for byte-identical per-chunk results. The standalone MarkFurigana endpoint is removed — there is no in-process equivalent for the Yahoo Furigana service. `main.py` drops the slowapi rate limiter, CORS, trusted-host, and X-API-KEY middleware — the service is now expected to run behind the parent backend on a private network. `config/settings.py` is reduced to just `load_dotenv()` accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Refresh the data-flow diagram, file-responsibilities table, and dependency graph to reflect the new layout (tokenizer / preprocess / postprocess / reading_overrides). Update the alignment-algorithm section to describe the DP / `_match_cost` / voicing fold / OJAD-punct guard / numeric tiebreaker that replaced the old greedy implementation. Append three new sections documenting layers that didn't exist when the README was first written: * **Surface overrides + POS patches** — regex `OVERRIDES` list shape, apply_furigana_overrides vs apply_accent_overrides, POS-driven `_is_masu_auxiliary` / `_is_tai_auxiliary` predicates. * **Postprocess passes** — the four idempotent passes that run after align + overrides + patches, with the rationale for their order. * **Local UniDic tokeniser** — feature → WordResult mapping, `feat.kana` vs `feat.pron` choice, `*` null handling, Field( exclude=True) on POS metadata. Also drops the MarkFurigana row from the endpoint table (the endpoint was removed in the local-UniDic migration) and updates the "Adding endpoints / overrides" section to reference the new file names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Snapshot the spike's investigation artefacts so future readers can reconstruct the GO/NO-GO decision and the test cases that drove the DP-aligner and POS-patch design: * docs/spike-local-unidic.md — phased measurement report (verb forms, て-form, long sentences) culminating in the GO recommendation. * docs/spike-local-unidic-runbook.md — runbook for replaying the spike with `uv run scripts/spike_local_unidic.py`. * scripts/spike_local_unidic.py — end-to-end Yahoo-vs-UniDic comparison harness against the existing OJAD pipeline. * scripts/probe_verb_forms.py — generates verb-form coverage matrices for the DP-aligner regression suite. * scripts/probe_te_and_long.py — exercises te-form chains and long sentences where OJAD's CRF was most likely to absorb kernels. * scripts/smoke_test_partial.py — minimal in-process smoke test against `_process_accent_chunk` for quick iteration. These scripts still reference the pre-refactor `api.accent_marker` monolith paths intentionally — they were the artefacts the spike produced, and rewriting them would lose the audit trail. Rerunning them in the new layout would require trivial import updates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fugashi's split of G2P / PSP-1000 / Wifi.7 / 12.5 into single-piece tokens lets the DP shuffle OJAD's morae onto the wrong child — `ーにピ` floating above `2` in G2P, `てん` leaking onto `5` in `12.5`, and the accent CRF collapsing on everything after `Wifi.7`. Fuse those runs into one token before alignment so each kind flows through one branch. - tokenizer.tag_local: glue contiguous (alpha|digit) runs, bridging `-` / `_` / `.` between alpha/digit pieces via look-ahead. Letter-less runs whose joined surface matches NUMERIC_PATTERN (`12.5`, `0.5`) get a decimal merge instead. fugashi's `white_space` attribute gates the merge so `Hello world` and `API key` stay split. - align: new `is_english_compound` free-consume branch in `_match_cost`, reordered ahead of the OJAD-punct guard so a merged acronym can swallow the `。` OJAD inserts when it normalises `.`. `_build_word_result` filters those punct entries from the rendered accent so they don't surface as ruby when the English toggle is on. - preprocess.strip_acronym_dots_for_ojad: OJAD-only strip — OJAD's `.` → `。` normalisation collapses its prosody CRF on the rest of the sentence, so the OJAD query gets `Wifi7` while fugashi keeps the original `.` and the tokenizer merge preserves the user-visible `Wifi.7` surface. - postprocess._is_pure_english_surface: accept `-` / `_` / `.` so the toggle wipe agrees with the aligner on which fused surfaces qualify. - pipeline: thread the OJAD-only stripped text into `get_ojad_result`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

) OJAD silently elides English when interleaved with kana (probe in spike-only scripts: Whisper inside `ふりがなWhisper`, satochin inside `深掘りライターsatochin氏`, URLPLACEHOLDER after strip_urls all come back with 0 OJAD morae). The aligner charged _FALLBACK_COST for an english_compound token taking k=0, so the cheapest DP path was to steal 1 mora from the neighbouring kana token to dodge the 3.0 penalty — paying ~1.0 edit-distance on the kana side instead. That cascade left ふりがな missing trailing な (test_1 ×7), コメント empty-spanned and falling through to the collapsed single-entry fallback (test_0), ライター missing the trailing chōon ー (test_0), and テスト missing the leading テ after a URL token (test_0). Lower k=0 to 0.0 in the english_compound branch. Spelled-out cases (`G2P` → ジーツーピー) still align correctly because forcing those katakana morae onto a neighbouring kana token costs more edit- distance than letting the english token absorb them at k≥1 cost 0. Verified end-to-end against all 30 fixtures: 0 under-mora anomalies remaining (previously 10 across test_0 and test_1). Also adds scripts/run_10_tests.sh as a kept regression harness driving the full corpus via a TESTS env override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After the local fugashi + UniDic migration, the Yahoo-era scaffolding is fully unreferenced: - config/settings.py loaded YAHOO_API_KEY via dotenv but nothing imports it. config/__init__.py is empty. - .env / .env.example only carried YAHOO_API_KEY (and an unread API_TOOLS_PORT). The application reads neither. - scripts/spike_local_unidic.py was the Yahoo↔local comparison spike; the comparison is the merged work itself. - scripts/probe_*.py and scripts/smoke_test_partial.py were spike-only debug tools. - docs/spike-local-unidic*.md narrate work now landed. Dockerfile drops `config` from the compileall/COPY lines. docker-compose.yml drops the `env_file: .env` block (the ${API_TOOLS_PORT:-8000} fallback still works from shell env). README.md trims the false "obtain a Yahoo API key" paragraph. scripts/run_10_tests.sh stays — it's the 30-fixture regression harness committed in the previous commit, not a spike artefact. Verified post-prune: server reloads cleanly (HTTP 200 on a fresh MarkAccent POST) and test_0 / test_15 / test_29 all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four follow-on features and a doc refresh on top of the spike fix. 1. SYMBOL_READINGS table in preprocess.py, consumed by tokenizer.tag_local. Standalone symbols (#, %, @, &, +, =, $, ¥, €, ℃, °, *, ~, §, plus full-width siblings) now get their spoken katakana reading instead of an empty furigana. The aligner's edit-distance branch matches the OJAD span at cost 0 rather than refusing it; the `#病` cascade that stole one mora from the next particle (test_0 idx 1411) is gone. suppress_punct_furigana also learns to skip these surfaces so the symbol's furigana + accent survive the post-alignment scrub. 2. split_okurigana in postprocess.py populates WordResult.subword when a token mixes kanji and kana. `聞き分け` → subword=[(聞,き),(き,""),(分,わ),(け,"")]. Top-level surface, furigana, and accent are unchanged — clients that ignore subword get the previous behaviour bit-for-bit. Irregular readings that can't be aligned against the surface kana fall back to no subword (no garbled segments). Across the 30-fixture corpus, 1045 tokens in 30/30 files gain segments. 3. New `script` request arg: hiragana (default), katakana, or romaji. convert_furigana_script in postprocess.py rewrites every furigana field (top-level + per-mora + subword) before serialisation. Internal alignment stays hiragana. Default "hiragana" also normalises per-mora morae that OJAD echoed back as katakana (e.g. `ラ`/`イ` on ライター's accent[]) — the per-mora script is now consistent across surface types. 4. README rewritten in English: covers all five live endpoints (MarkAccent + UsageQuery + DictQuery + SentenceQuery), the full MarkAccent request body with the three new fields, response shape, examples, the regression harness, and the four known UniDic-vs-OJAD reading-mismatch tokens. Re-profiled against the 30-fixture corpus after the changes: 0 under-mora anomalies (was 0 after the spike fix), 4 over-mora cases (was 5 — the `#病→と` leak is fixed by the symbol table). The remaining four over-mora cases are pre-existing UniDic context- reading mismatches (世, 本当, 他, 寺) unrelated to this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The merge step was added before standalone symbols had a reading — it glued `(2, %)` into one token whose surface (`2%`) matched OJAD's phrase boundary so the パーセント morae wouldn't leak onto the digit. Side-effect: the merged token's furigana came out as `ごじゅうてんさんぱーせんと` for `50.3%`, with no way for a client to render ruby specifically over `%`. After the SYMBOL_READINGS work in the previous commit, `%` (and its siblings `@`, `&`, `+`, `$`, `¥`, `€`, `℃`, `°`, …) already carry their spoken katakana reading. The DP aligner matches each symbol's furigana against the OJAD span at edit-distance 0, so the パーセント morae no longer leak — the merge is redundant. Removing it gives the user's preferred shape: `50.3%とは` → [50.3|ごじゅうてんさん] [%|ぱーせんと] [と] [は] `READABLE_COMPOUND_RE` and the `is_readable_compound` branch in align.py stay in place — nothing wired produces a compound surface any more, but the dead branches are harmless and reading_overrides could in principle still synthesise one. Re-profiled all 30 fixtures: row counts unchanged, under-mora anomalies 0, over-mora 4 (same UniDic context-reading mismatches as before). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`apply_furigana_toggles` was clearing both furigana AND accent on any surface matching `_is_pure_english_surface` — but fugashi+UniDic hand back proper Japanese readings for unit compounds whose surface happens to look ASCII (`53mm` → みりめーとる, `33m/s` → めーとるまいびょう, `3kg` → きろぐらむ). With `render_english_furigana` off (default), those unit tokens came back with empty furigana and empty accent — the user saw `53mm` "escaped" entirely. Skip the english wipe when the token's furigana already contains any hiragana/katakana char. UniDic only fills a kana reading when the surface IS a recognised Japanese unit / loanword token, so truly foreign english (`Whisper`, `G2P`, `Apple`) still has furigana==surface (no kana) and continues to be cleared. Verified: - `53mm` → surface=`53mm`, furi=`53みりめーとる`, accent=[ご,じゅ,う,さ,ん,み,り,め,ー,と,る] with marks - `m/s` → surface=`m/s`, furi=`めーとるまいびょう`, full accent - `Whisper`, `G2P` → still wiped (no kana in furigana) 30-fixture regression: 30/30 HTTP 200, under-mora 0, over-mora 4 (same pre-existing UniDic context-reading mismatches). 7 fixtures gained rows where unit tokens previously were stripped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Some inputs hit deterministic upstream quirks the local pipeline can't fix (OJAD reading `33m/s` as `さんじゅうみっめーとるまいびょう`, with a stray `みっ` from a CRF sound-change; UniDic giving a kanji its lemma reading instead of the contextual one). Rather than chase each with bespoke align/postprocess logic, give the caller a maintenance file they can grow over time. `api/accent/user_patches.py` exposes USER_PATCHES: a dict of literal-match surface fragments to a tuple of (segment_surface, segment_furigana) pairs. `reading_overrides._user_patch_overrides` compiles those into FuriganaOverride entries appended to the existing OVERRIDES list, so both the pre-OJAD furigana pass and the post-alignment accent pass pick them up — the second pass rewrites the contour with the prescribed reading. Accent defaults to heiban via a new `_mora_seq` helper that splits the reading into actual morae (so じゅ stays one entry, not two). Power users can drop full FuriganaOverride objects into the existing OVERRIDES section for atamadaka / per-mora custom marks. Seeded with one entry for `33m/s` as a working example. Edit the dict and re-run `./scripts/run_10_tests.sh` after each addition. Verified: `33m/s` now comes back as [33|さんじゅうさん] [m/s|めーとるまいびょう] instead of `33|さんじゅうみっ` + `m/s|めーとるまいびょう`. 30-fixture regression: 30/30 HTTP 200, under-mora 0, over-mora 4 (same pre-existing UniDic-context mismatches — addressable by adding USER_PATCHES entries case-by-case). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extend the patch schema with an optional third element per segment so users can prescribe a non-heiban contour: (surf, furi) → heiban (default) (surf, furi, "heiban") → heiban (explicit) (surf, furi, "atamadaka") → first-mora FALL, rest LOW (surf, furi, "low") → all-LOW (surf, furi, (0, 1, 2)) → explicit per-mora types The shape names live in `_accent_from_spec` in reading_overrides.py; unknown specs warn and fall back to heiban. `_split_morae` is now factored out so both `_mora_seq` and the new helper share the same 小さな仮名-attach mora splitter. Seeded three patches for the pre-existing UniDic-vs-OJAD context- reading mismatches in the 30-fixture corpus: - `本当の` → ほんとう / の (heiban) - `他の` → ほか (atamadaka) / の - `世にも` → よ / に / も (heiban; demonstrates flatten-after-heiban naturally drops the trailing に to LOW) Re-profile: under-mora 0 (unchanged), **over-mora 4 → 1**. The remaining `寺` case (test_16, after `永昌寺という`) involves a compound-boundary mis-tokenisation, not addressable by a simple literal patch — left for a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

) The previous schema accepted 2-tuples (heiban default) plus three named shapes ("heiban" / "atamadaka" / "low") as the accent_spec. Convenience came at the cost of one rule per shape and a wall of docs explaining which shape maps to what. Drop all of that — every segment is now exactly `(surface, furigana, accent_ints)` with the int tuple required and one entry per mora. The shapes are trivially expressible as tuples: heiban → (1, 1, 1, ...) atamadaka → (2, 0, 0, ...) low → (0, 0, 0, ...) `_accent_from_spec` now returns `None` on any malformed spec and the caller skips the whole patch entry (no per-segment fallback). All four seeded patches are rewritten in the strict form. Re-profile: under-mora 0, over-mora 1 (same `寺` compound boundary case remains). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`_age_overrides` merges `20歳` into a single WordResult with the prescribed furigana `はたち` (3 morae), but OJAD still pronounces the same surface as `にじゅっさい` (5 morae). The DP aligner's kana branch couldn't grant the merged token 5 morae cheaply (the edit-distance between `はたち` and `にじゅっさい` is huge), so it allocated only 3 morae and the leftover `さい` cascaded onto the following kana tokens — `20歳の私達へ` ended up with `の` getting acc=[さ] and `私` getting acc=[い,の,わ,た,し]. Override-merged tokens carry no UniDic backing (both `base` and `pos` are None — `ReplacementToken` doesn't set MA metadata). Detect that combination in `_match_cost` and give the same free-consume treatment as numeric / readable_compound: k=0 returns _FALLBACK_COST so the DP prefers absorption, k≥1 returns 0 up to a generous upper. `apply_accent_overrides` rewrites the accent post-align so whatever DP picked up from OJAD is discarded. Verified `20歳の私達へ`: 20歳 → はたち [(は,2),(た,0),(ち,0)] の → の [(の,0)] 私 → わたくし [(わ,0),(た,1),(し,2)] 達 → たち [(た,0),(ち,0)] へ → へ [(へ,0)] 30-fixture regression: 30/30 HTTP 200, under-mora 0, over-mora 1 (same `寺` compound boundary case). Also refreshes `api/accent/README.md`: - Adds `user_patches.py` to file map + new section documenting the strict 3-tuple schema and accent_ints shapes. - Documents the new synthesized branch in `_match_cost`. - Adds Request toggle table (render_english/katakana_furigana, script) and the unit-compound exception for english toggle. - Adds `split_okurigana` + `convert_furigana_script` to the postprocess pass list and updates the data-flow diagram. - Removes references to merge_readable_symbol_compounds (gone since the SYMBOL_READINGS refactor). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously `apply_furigana_toggles` cleared the top-level `furigana` on pure-katakana tokens when `render_katakana_furigana=False`, but left every `AccentInfo.furigana` populated (with hiragana morae). Clients that draw ruby from the per-mora field rendered hiragana copies (`ふ・ら・ん・つ`) on top of katakana surfaces (`フランツ`) despite the toggle saying "no furigana" — the user-visible symptom on inputs like `フランツ・ヨーゼフ・ハイドン` was katakana names gaining unwanted ruby. Clear every `AccentInfo.furigana` to `""` for those tokens while keeping `accent_marking_type` and `length` intact, so clients that draw pitch overlay against the surface chars can still do so (length-aware iteration handles small kana like `ァ` / `ェ`). `render_katakana_furigana=True` is unaffected — both top-level and per-mora furigana flow through normally. 30-fixture regression: 30/30 HTTP 200, anomalies unchanged (one false-positive in the heuristic dropped because the cleared per-mora field stops triggering the "collapsed entry" check). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Full translation of the package-level documentation from Traditional Chinese to English. Structure and section ordering preserved; same mermaid data-flow diagram, same tables. Also folds in the changes since the last refresh: - Request toggle table documents the per-mora-furigana clear for the katakana toggle (the フランツ/Frаnz ruby-on-katakana fix). - _match_cost branches list now includes the synthesized free- consume rule (override-merged 20歳 → はたち) alongside the english-compound k=0=0 rule. - Postprocess pass list calls out unit-compound exemption from the english toggle wipe (53mm, 33m/s, 3kg keep their reading). - User-patches section uses the strict 3-tuple schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The `unidic` pip package ships the loader but not the ~770MB dicdir, so fugashi.Tagger() failed at runtime (missing mecabrc) and /api/MarkAccent/ returned 500. Run `unidic download` in the builder stage; the venv copy into the final image carries the dict along. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The version-selectable dict script (01fcd71) switched the UniDic download to curl, but python:3.11-slim ships without it — the docker build died with exit 127 at the download step. Add curl next to unzip in the builder-stage apt install (multi-stage, so the runtime image is unchanged). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The collected endpoint rebuilds AccentResponse from per-chunk results and silently dropped the new `warning` field (#60), so OJAD-degraded responses looked like full results. Keep the first chunk warning, mirroring the first_error convention. The stream endpoint already passes it through via model_dump(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The global+if-None lazy init let two concurrent first requests each see _TAGGER as None and build their own fugashi.Tagger(), reloading the ~1.3GB UniDic dictionary twice (raised in PR #53 review). functools .lru_cache(maxsize=1) makes the lazy init atomic so only one tagger is ever constructed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A leftover '# extra う onto the following の' line was sitting between the all-LOW and nakadaka rows of the accent-tuple table in the module docstring (flagged in PR #53 review). Remove it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

f-strings build the message eagerly even when the debug level is disabled; pass the value as a lazy %-arg instead (PR #53 review). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

schedule_chunks created detached asyncio tasks that kept scraping OJAD even after the client went away (PR #53 review): on the streaming endpoint a disconnect just stopped consuming the generator, and on the collected endpoint a cancelled handler orphaned its tasks. Add a shared cancel_pending helper and call it from a finally in both endpoints, so a disconnect (GeneratorExit into the stream, or the collected handler being cancelled) tears down any still-pending chunk. A TaskGroup would scope the tasks automatically, but async with TaskGroup() inside the streaming async generator wraps the aclose() GeneratorExit into a BaseExceptionGroup, so explicit cancellation is the only shape that closes the stream cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-04T08:08:25Z

🛡️ PR Quality Check Summary

✅ PR Title: Passed (Length: 47/75, Format: OK). feat(accent): local UniDic + POS-driven patches
✅ Branch Name: Follows naming convention (feat/local-unidic)
✅ Commit Messages: All 29 commit(s) passed (Length, Format, Case)
✅ Conflicts: No merge conflict markers found
✅ Python Quality: All checks passed.

🎉 All checks passed!

torrid-fish requested review from minto1226 and wade00754 as code owners May 27, 2026 11:15

torrid-fish mentioned this pull request May 27, 2026

feat(accent): local UniDic + POS-driven patches (closes #48, #50) #51

Closed

torrid-fish changed the title ~~feat(accent): local UniDic + POS-driven patches (closes #48, #50)~~ feat(accent): local UniDic + POS-driven patches May 27, 2026

torrid-fish marked this pull request as draft May 27, 2026 11:21

sessatakuma deleted a comment from chatgpt-codex-connector Bot May 27, 2026

torrid-fish force-pushed the feat/local-unidic branch from 95c7ae1 to 80c55b9 Compare May 27, 2026 11:25

torrid-fish mentioned this pull request May 28, 2026

ci: add CD pipeline to GHCR with hardened Dockerfile #55

Merged

torrid-fish force-pushed the feat/local-unidic branch from ae50410 to 41620f8 Compare May 28, 2026 13:36

torrid-fish changed the base branch from main to ci/cd-ghcr May 28, 2026 13:36

This was referenced May 28, 2026

Rule-based post accent correction via Yahoo MA POS metadata #48

Open

Evaluate local UniDic (fugashi/Sudachi) for in-process accent + lower latency #50

Open

torrid-fish marked this pull request as ready for review May 29, 2026 12:11

Base automatically changed from ci/cd-ghcr to main May 30, 2026 11:40

torrid-fish force-pushed the feat/local-unidic branch from 55b2117 to 01fcd71 Compare May 30, 2026 11:53

wade00754 requested changes May 30, 2026

View reviewed changes

Comment thread api/accent/tokenizer.py Outdated

Comment thread api/accent/pipeline.py

Comment thread api/accent/routes.py

Comment thread api/accent/pipeline.py Outdated

Comment thread api/accent/user_patches.py

torrid-fish self-assigned this May 30, 2026

torrid-fish and others added 13 commits June 4, 2026 08:00

torrid-fish and others added 16 commits June 4, 2026 08:06

chore: update ignore file and remove unused test scripts

a6ab1e3

chore: remove unused test

e746364

docs: update root readme

7bf63ec

feat: add version-selectable dict cwj and csj script

c8f7811

perf(accent): defer tokeniser-count debug log formatting

eb72d9d

f-strings build the message eagerly even when the debug level is disabled; pass the value as a lazy %-arg instead (PR #53 review). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

torrid-fish force-pushed the feat/local-unidic branch from ef5f097 to c509f70 Compare June 4, 2026 08:07

torrid-fish requested a review from wade00754 June 4, 2026 08:22

This was referenced Jun 4, 2026

Fallback to UniDic aType-derived lexical pitch when OJAD is unavailable #61

Open

[Epic] Make MarkAccent a standalone pitch-accent service (drop OJAD dependency) #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(accent): local UniDic + POS-driven patches#53

feat(accent): local UniDic + POS-driven patches#53
torrid-fish wants to merge 29 commits into
mainfrom
feat/local-unidic

torrid-fish commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

torrid-fish commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

目的

PR structure (19 commits, base ci/cd-ghcr → main)

Architecture

Package layout

Local UniDic replace Yahoo MA

Strong-mode fields

Dictionary variant selection

Hide internal MA metadata from response

Unify chunking between two endpoints

Rendering polish

Verified locally

Disambiguation probe

UniDic Docker tweak (the only Docker/CI commit still in this PR)

Out of scope

Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 4, 2026

🛡️ PR Quality Check Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

torrid-fish commented May 27, 2026 •

edited

Loading

PR structure (19 commits, base `ci/cd-ghcr` → `main`)