Skip to content

feat(accent): local UniDic + POS-driven patches#53

Open
torrid-fish wants to merge 29 commits into
mainfrom
feat/local-unidic
Open

feat(accent): local UniDic + POS-driven patches#53
torrid-fish wants to merge 29 commits into
mainfrom
feat/local-unidic

Conversation

@torrid-fish

@torrid-fish torrid-fish commented May 27, 2026

Copy link
Copy Markdown
Member

目的

Closes #45, Closes #48, Closes #50, Closes #57, Closes #58.
Supersedes (now-closed) PRs #47 and #49.
Replaces #51 (same branch, renamed spike/local-unidicfeat/local-unidic;
the rename closed #51, so this is its continuation).

Issue #50 was explicitly framed as an alternative architecture to #48 ("two architectures targeting the same goal. Production will likely pick ONE"). This PR is the GO outcome: local fugashi + NINJAL UniDic CWJ 2025-12-31 in-process replaces the Yahoo MA HTTP round-trip, while keeping OJAD for surface-level phrase pitch and keeping every POS-driven rule originally introduced in PR #49.

This is the single, complete migration PR — it targets main directly and folds in the foundation that was previously proposed as the standalone PR #47 (regex reading-override layer, Needleman-Wunsch DP aligner, /MarkAccent/stream/ NDJSON endpoint). Since the Yahoo-backed intermediate from #47 is a stepping stone that won't ship independently, #47 was closed and its commits are the base of this branch rather than a separate merge.

PR structure (19 commits, base ci/cd-ghcrmain)

Stacked on top of PR #55 (ci/cd-ghcrmain), which carries the generic CI/CD basework (GHCR workflow, Node 24 bumps, non-root Dockerfile). Once #55 merges, this PR's base auto-retargets back to main and the diff stays the same — only the foundation + UniDic + UniDic-specific Docker tweak remain here.

ci/cd-ghcr → foundation (3) → UniDic migration (15) → UniDic Docker tweak (1):

Foundation (was #47): Needleman-Wunsch DP aligner (+ rendaku fold), regex reading-override layer + URL/non-JP preprocessing, and the /MarkAccent/stream/ NDJSON endpoint.

UniDic migration (#50): swap Yahoo MA → fugashi+UniDic, drop Yahoo config, add tokenizer / preprocess / postprocess / user_patches modules, UniDic strong-mode fields, English README.

UniDic Docker tweak: scripts/download_unidic.sh at build time (depends on the unidic pip dep that lands in this PR — that's why it can't live in #55).

Rebased onto current main (which includes the #52 package refactor and the #46 Docker compose setup). The Docker-scaffolding commit from the original spike was dropped during rebase — already upstream via #46. Commit subjects were reworded to stay under 75 chars (so exact hashes aren't pinned here).

Architecture

flowchart LR
    text([text input])
    fugashi["fugashi (MeCab) + UniDic CWJ 2025-12-31<br/>surface · kana · lemma<br/>pos · conjugation · aType"]
    ojad["OJAD scrape (suzukikun)<br/>per-mora surface pitch contour"]
    align["DP alignment<br/><code>align_accent</code>"]
    patches["<code>apply_accent_patches</code><br/>POS-driven rules"]
    out([WordAccentResult])

    text --> fugashi
    text --> ojad
    fugashi -- tokens + POS --> align
    ojad -- per-mora pitch --> align
    align --> patches
    fugashi -. POS metadata .-> patches
    patches --> out
Loading

Simplified schematic. For the full pipeline data-flow diagram see api/accent/README.md → "Data flow".

Yahoo MA HTTP endpoint fully removed (api/accent/furigana.py and the /MarkFurigana/ route deleted). Everything #49 introduced (POS-driven apply_accent_patches, pos_match override predicate, 5 POS columns on WordResult/WordAccentResult) carries over to UniDic identically — same UniDic schema, only the upstream changed.

Package layout

  • fugashi + UniDic tokenizer → api/accent/tokenizer.py
  • preprocessing (URL / comma / × strip + restore) → api/accent/preprocess.py
  • postprocessing passes (heiban-particle flatten, 助詞 furigana suppression, punct suppression, katakana toggle) → api/accent/postprocess.py
  • POS-driven patches + user-maintained patch table → api/accent/reading_overrides.py + api/accent/user_patches.py
  • strong-mode fields → api/accent/models.py
  • chunked orchestrator (_build_chunks, _schedule_chunks) → api/accent/pipeline.py

Local UniDic replace Yahoo MA

  • _fetch_yahoo_raw_tag_local using singleton fugashi.Tagger() + UniDic CWJ 2025-12-31
  • Field mapping: feat.pos1 → pos, feat.pos2 → pos1, feat.cType → conjugation_type, feat.cForm → conjugation_form, feat.lemma → base (with -gloss suffix stripped), feat.aType → lexical_kernel (parsed for multi-reading via _parse_atype)
  • Reading uses feat.kana over feat.pron — UniDic stores 忙しい as kana=イソガシイ (matches OJAD's ortho-kana) while pron=イソガシー with chōonpu would never align

Strong-mode fields

  • lexical_kernel: int | None — aType primary (0=heiban, N≥1=kernel on mora N)
  • lexical_kernel_alts: list[int] | None — multi-reading alternates (e.g. aType="2,0" → [2, 0])
  • kernel_absorbed: bool — UniDic says kernel exists but OJAD has no FALL in this word's range (connected-speech sandhi case, e.g. 忙しい inside お忙しい中)

Dictionary variant selection

NINJAL publishes two UniDic variants — the download script supports both:

Tag Variant Training corpus Use case
cwj-2025-12-31 CWJ (現代書き言葉) BCCWJ (written) Articles, novels, web text — default
csj-2025-12-31 CSJ (現代話し言葉) CEJC (spoken) Conversational / transcribed speech
./scripts/download_unidic.sh cwj-2025-12-31   # default — written
./scripts/download_unidic.sh csj-2025-12-31   # spoken-language variant
./scripts/download_unidic.sh cwj-2021-08-31   # older CWJ 3.1.0

CWJ is the default because this service primarily processes written input. CSJ may be preferable when processing conversational or transcribed speech. The Dockerfile defaults to CWJ; override by changing the script argument in the builder stage.

Hide internal MA metadata from response

base/pos/pos1/conjugation_type/conjugation_form marked Field(exclude=True) — kept on the model for in-pipeline use by apply_accent_patches, excluded from JSON serialization. Cleaner client contract; nothing functional changed.

Unify chunking between two endpoints

_build_chunks / _schedule_chunks shared by /MarkAccent/ and /MarkAccent/stream/. Both endpoints emit byte-identical per-chunk word lists; only delivery shape differs (collected vs streamed NDJSON).

Rendering polish

What Why
Polish furigana/accent on punct, symbols, toggles Existing edge cases pre-spike
Katakana with render_katakana_furigana=False keeps accent (only clears furigana) Per-mora pitch overlay can still render against the surface text
助詞 (pos=助詞) tokens clear their redundant top-level furigana hiragana ruby on hiragana surface is visual duplication; was crowding out pitch overlay
After heiban predecessor, の/な/は/が pitch flattened to LOW OJAD's HIGH-plateau continuation was visual noise without contour change
\d × \d swapped → \d/\d pre-pipeline, × restored on surface post-alignment OJAD merges 19×19 → 1919 (千九百十九); swap forces independent reading per number
Numeric _match_cost adds tiny per-empty-OJAD-entry cost DP was splitting 19/19 as 1+7 morae instead of 4+4 because all cost-0 ties picked an arbitrary path
User-maintained patch table (user_patches.py) with explicit per-mora int tuples Hand-fix OJAD/UniDic misreads without touching pipeline code

Verified locally

  • ruff check, ruff format --check pass (17 files); import main + all accent submodules import clean
  • 学校に行きます。今日は寒いですね。19×19の格子。 — heiban-particle flatten (に stays 1, は/が/の/な flatten after heiban), 助詞 furigana cleared, 19×19 splits 4+0+4 not 1+7
  • Response JSON contains zero of base/pos/pos1/conjugation_type/conjugation_form (Field(exclude=True))
  • /MarkAccent/stream/ returns NDJSON with {chunk, subchunk, status, result, error}

Disambiguation probe

All 11 cases from #49's table still pass — POS gates fire identically because the UniDic schema is the same upstream of the rule layer.

UniDic Docker tweak (the only Docker/CI commit still in this PR)

  • UniDic dict baked into the image: the unidic pip package ships only the loader, not the ~1.3GB dicdir, so fugashi.Tagger() would fail at runtime without the dictionary. Dockerfile now runs scripts/download_unidic.sh in the builder stage as its own cache layer → the image is self-contained, no runtime download.

The CD workflow, Node 24 bumps, and non-root runtime are in #55 (the base of this stacked PR).

The published CD image after both #55 and this PR is ~2.1GB (base + venv + UniDic CWJ 2025-12-31). Smoke-tested end-to-end before the split: build + push to GHCR green, image runs as uid 10001, and POST /api/MarkAccent/ returns 200 with correct accent output.

Out of scope

  • i-adjective た形 cross-token (高かった = 高 + かっ + た)
  • Loanword 3-mora rule (バナナ, ピアノ)
  • 数詞 + counter irregulars beyond date/age (4時, 7時, 9時, 1分, 1人, 2人)
  • Replacing OJAD entirely — still needed for surface-level phrase pitch; UniDic aType is lexical (dictionary-form), not connected-speech contour

Notes

🤖 Generated with Claude Code

@torrid-fish torrid-fish changed the title feat(accent): local UniDic + POS-driven patches (closes #48, #50) feat(accent): local UniDic + POS-driven patches May 27, 2026
@torrid-fish torrid-fish marked this pull request as draft May 27, 2026 11:21
@sessatakuma sessatakuma deleted a comment from chatgpt-codex-connector Bot May 27, 2026
@torrid-fish torrid-fish changed the base branch from main to ci/cd-ghcr May 28, 2026 13:36
@torrid-fish torrid-fish marked this pull request as ready for review May 29, 2026 12:11
Base automatically changed from ci/cd-ghcr to main May 30, 2026 11:40
Comment thread api/accent/tokenizer.py Outdated
Comment thread api/accent/pipeline.py
Comment thread api/accent/routes.py
Comment thread api/accent/pipeline.py Outdated
Comment thread api/accent/user_patches.py
@torrid-fish torrid-fish self-assigned this May 30, 2026
torrid-fish and others added 13 commits June 4, 2026 08:00
The greedy aligner had two failure modes that cascaded across whole
sentences: a numeric anchor that over-consumed when Yahoo and OJAD
disagreed on phrase boundary, and a +1 fallback path that turned a
single mismatch into type-0 fallback for every downstream token.

Replaces it with a global DP over (yahoo_token, ojad_entry) pairs:
each Yahoo token consumes k ∈ [0, K_MAX] contiguous OJAD entries,
with per-token cost computed via shape (punct/numeric/kana) and edit
distance over rendaku-folded strings for kana tokens. Sub cost
(0.4) is lower than ins/del (1.0) so the DP prefers same-length
spans with substitutions over shorter spans with deletions — fixes
the case where OJAD's `う` from `等→とう` leaked onto the next token.

Adds a voicing-fold table so Yahoo's dictionary-form readings
(ふんかん) align against OJAD's pronounced readings with rendaku
(ぷんかん). All comparisons under this fold; ぱ/ば/ぷ/ぶ all alias
to は/ふ.

Refs #47.
Add api/accent/reading_overrides.py — a context-blind correction layer
sitting between Yahoo Furigana and OJAD alignment. Each override is a
regex on the concatenated surface text plus the replacement tokens that
should appear instead. Covers:

- 曜日 brackets: (月)/(月)→ げつ, (土) → ど, etc. for all 7 weekdays.
- All 31 day-of-month readings: 1日 → ついたち (atamadaka), 5日 → いつか,
  14日 → じゅうよっか, 20日 → はつか, etc.
- N日間 durations 1-31: 1日間 → いちにちかん (NOT ついたちかん since
  the 1st-of-month reading is impossible for a duration), 7日間 →
  しちにちかん (modern technical writing preference over なのかかん).
- 20歳 / 二十歳 / 20才 → はたち (the only irregular age reading).

Patterns accept arabic / full-width / kanji numeral variants of the
same N so `3月5日(土)` / `3月5日(土)` / `三月五日(土)` all trigger
the same overrides. Order-of-overrides matters: duration list precedes
date list so `N日間` wins over `N日` at the same start (longer match
breaks ties in _collect_matches).

apply_furigana_overrides runs BEFORE align_accent so merged spans like
`5日→いつか` reach OJAD as a single token whose furigana matches OJAD's
phrase reading (the numeric-anchor logic in align_accent otherwise
cascades-fails because numeric tokens lack any Yahoo furigana).
apply_accent_overrides runs AFTER align to re-stamp both furigana and
accent on the same matched spans, so the response is consistent.

Adds URL preprocessing: each https?:// is swapped for the placeholder
"URLPLACEHOLDER" before the pipeline runs (Yahoo fragments URLs across
several alphabet tokens; OJAD's phrasing scraper produces noise for
Latin punctuation runs — both drag alignment off-rail). Placeholders
are walked back to the originals in order after alignment. URL body
stops at whitespace, any Japanese char, or `,()<>[]"'` so embedded
URLs strip cleanly.

Adds a non-Japanese short-circuit: if (after URL stripping) the chunk
contains no hiragana / katakana / CJK ideograph, skip Yahoo + OJAD
entirely and echo the chunk back as a single token. Lets pure-URL /
pure-English lines stream through cheaply.

Also adds stream_accent_chunks() to pipeline.py as a helper used by
the streaming endpoint added in the next commit. Splits the input on
\n then on full-width sentence terminators (。!?.) — long
paragraphs degrade OJAD's phrasing predictor and parallelising across
sentences caps the latency. In-flight work is bounded by a semaphore
(concurrency=4) because OJAD's u-tokyo backend falls over with 30+
parallel scrapes.

main.py docstring updated to reflect /MarkAccent/stream/.

Refs #47.
Add a streaming variant of /MarkAccent/ that processes the input as a
sequence of (line, sentence) chunks and emits one NDJSON object per
chunk in input order. Each line carries `{"chunk": line_idx,
"subchunk": sub_idx, ...AccentResponse}` so clients can render output
incrementally while keeping document position. Underlying chunk-fanout
and concurrency limiting live in pipeline.stream_accent_chunks; the
route is a thin StreamingResponse wrapper.

Streaming benefits compound: OJAD's phrasing predictor degrades on
long inputs (a single misaligned mora cascades across the paragraph),
so per-sentence chunks both stay short enough for OJAD to handle and
fan out under the bounded semaphore.

Also adds test.sh — a small bash smoke-test helper that POSTs a sample
text to either /MarkAccent/ or /MarkFurigana/ and pretty-prints the
per-moji (surface|furigana|accent_marking_type) rows. STREAM=1 switches
to the streaming endpoint, ENDPOINT= picks which router. Useful while
iterating on overrides; not wired into CI.

.gitignore adds data/ and output/ for ad-hoc test fixtures we don't
want committed.

Refs #47.
Replace the Yahoo Furigana HTTP path with in-process fugashi + NINJAL
UniDic 3.1.0. The migration adds three new layers inside the `api/accent/`
package plus a sentence-level chunked streaming endpoint:

  * `tokenizer.py` — singleton `fugashi.Tagger`; maps UniDic features into
    the existing `WordResult` shape plus new strong-mode fields
    `lexical_kernel` / `lexical_kernel_alts` (parsed from `aType`).
  * `preprocess.py` — pre-alignment text rewrites (URL strip, western-
    grouped thousands `1,234`→`1234`, `\d×\d`→`\d/\d`), `has_japanese`
    short-circuit gate, sentence splitting, readable-symbol (`2%`, `15℃`)
    pre-merge.
  * `postprocess.py` — rendering passes: heiban-particle accent flatten
    (の/な/は/が after a 平板調 word), pure-punct furigana suppression,
    English / katakana toggle handling, 助詞 furigana suppression.
  * `reading_overrides.py` — moved into the package; regex overrides for
    日付/N日間/20歳/曜日 plus the POS-driven `apply_accent_patches` rule
    for ます / たい first-mora FALL.

`align.py` is upgraded to a Needleman-Wunsch DP over (token, OJAD-entry)
pairs with weighted edit distance, rendaku voicing fold, an OJAD-punct
guard, and a numeric tiebreaker that fixes the `19×19` 1+7-split bug.

`models.py` extends `Request` with `render_english_furigana` /
`render_katakana_furigana` toggles, adds POS metadata fields (excluded
from serialization) plus strong-mode lexical-accent fields exposed in
JSON, and drops the standalone `FuriganaResponse`.

`routes.py` exposes `/api/MarkAccent/` (collected) and
`/api/MarkAccent/stream/` (NDJSON per chunk); both share
`pipeline.build_chunks` + `pipeline.schedule_chunks` for byte-identical
per-chunk results. The standalone MarkFurigana endpoint is removed —
there is no in-process equivalent for the Yahoo Furigana service.

`main.py` drops the slowapi rate limiter, CORS, trusted-host, and
X-API-KEY middleware — the service is now expected to run behind the
parent backend on a private network. `config/settings.py` is reduced to
just `load_dotenv()` accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refresh the data-flow diagram, file-responsibilities table, and
dependency graph to reflect the new layout (tokenizer / preprocess /
postprocess / reading_overrides). Update the alignment-algorithm
section to describe the DP / `_match_cost` / voicing fold / OJAD-punct
guard / numeric tiebreaker that replaced the old greedy implementation.

Append three new sections documenting layers that didn't exist when the
README was first written:

  * **Surface overrides + POS patches** — regex `OVERRIDES` list shape,
    apply_furigana_overrides vs apply_accent_overrides, POS-driven
    `_is_masu_auxiliary` / `_is_tai_auxiliary` predicates.
  * **Postprocess passes** — the four idempotent passes that run after
    align + overrides + patches, with the rationale for their order.
  * **Local UniDic tokeniser** — feature → WordResult mapping,
    `feat.kana` vs `feat.pron` choice, `*` null handling, Field(
    exclude=True) on POS metadata.

Also drops the MarkFurigana row from the endpoint table (the endpoint
was removed in the local-UniDic migration) and updates the "Adding
endpoints / overrides" section to reference the new file names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot the spike's investigation artefacts so future readers can
reconstruct the GO/NO-GO decision and the test cases that drove the
DP-aligner and POS-patch design:

  * docs/spike-local-unidic.md — phased measurement report (verb forms,
    て-form, long sentences) culminating in the GO recommendation.
  * docs/spike-local-unidic-runbook.md — runbook for replaying the
    spike with `uv run scripts/spike_local_unidic.py`.
  * scripts/spike_local_unidic.py — end-to-end Yahoo-vs-UniDic
    comparison harness against the existing OJAD pipeline.
  * scripts/probe_verb_forms.py — generates verb-form coverage
    matrices for the DP-aligner regression suite.
  * scripts/probe_te_and_long.py — exercises te-form chains and long
    sentences where OJAD's CRF was most likely to absorb kernels.
  * scripts/smoke_test_partial.py — minimal in-process smoke test
    against `_process_accent_chunk` for quick iteration.

These scripts still reference the pre-refactor `api.accent_marker`
monolith paths intentionally — they were the artefacts the spike
produced, and rewriting them would lose the audit trail. Rerunning
them in the new layout would require trivial import updates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fugashi's split of G2P / PSP-1000 / Wifi.7 / 12.5 into single-piece
tokens lets the DP shuffle OJAD's morae onto the wrong child — `ーにピ`
floating above `2` in G2P, `てん` leaking onto `5` in `12.5`, and the
accent CRF collapsing on everything after `Wifi.7`. Fuse those runs
into one token before alignment so each kind flows through one branch.

- tokenizer.tag_local: glue contiguous (alpha|digit) runs, bridging
  `-` / `_` / `.` between alpha/digit pieces via look-ahead.
  Letter-less runs whose joined surface matches NUMERIC_PATTERN
  (`12.5`, `0.5`) get a decimal merge instead. fugashi's `white_space`
  attribute gates the merge so `Hello world` and `API key` stay split.
- align: new `is_english_compound` free-consume branch in `_match_cost`,
  reordered ahead of the OJAD-punct guard so a merged acronym can
  swallow the `。` OJAD inserts when it normalises `.`.
  `_build_word_result` filters those punct entries from the rendered
  accent so they don't surface as ruby when the English toggle is on.
- preprocess.strip_acronym_dots_for_ojad: OJAD-only strip — OJAD's
  `.` → `。` normalisation collapses its prosody CRF on the rest of
  the sentence, so the OJAD query gets `Wifi7` while fugashi keeps
  the original `.` and the tokenizer merge preserves the user-visible
  `Wifi.7` surface.
- postprocess._is_pure_english_surface: accept `-` / `_` / `.` so the
  toggle wipe agrees with the aligner on which fused surfaces qualify.
- pipeline: thread the OJAD-only stripped text into `get_ojad_result`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
)

OJAD silently elides English when interleaved with kana (probe in
spike-only scripts: Whisper inside `ふりがなWhisper`, satochin
inside `深掘りライターsatochin氏`, URLPLACEHOLDER after strip_urls
all come back with 0 OJAD morae). The aligner charged _FALLBACK_COST
for an english_compound token taking k=0, so the cheapest DP path
was to steal 1 mora from the neighbouring kana token to dodge the
3.0 penalty — paying ~1.0 edit-distance on the kana side instead.
That cascade left ふりがな missing trailing な (test_1 ×7), コメント
empty-spanned and falling through to the collapsed single-entry
fallback (test_0), ライター missing the trailing chōon ー (test_0),
and テスト missing the leading テ after a URL token (test_0).

Lower k=0 to 0.0 in the english_compound branch. Spelled-out cases
(`G2P` → ジーツーピー) still align correctly because forcing those
katakana morae onto a neighbouring kana token costs more edit-
distance than letting the english token absorb them at k≥1 cost 0.

Verified end-to-end against all 30 fixtures: 0 under-mora anomalies
remaining (previously 10 across test_0 and test_1). Also adds
scripts/run_10_tests.sh as a kept regression harness driving the
full corpus via a TESTS env override.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the local fugashi + UniDic migration, the Yahoo-era scaffolding
is fully unreferenced:

- config/settings.py loaded YAHOO_API_KEY via dotenv but nothing
  imports it. config/__init__.py is empty.
- .env / .env.example only carried YAHOO_API_KEY (and an unread
  API_TOOLS_PORT). The application reads neither.
- scripts/spike_local_unidic.py was the Yahoo↔local comparison
  spike; the comparison is the merged work itself.
- scripts/probe_*.py and scripts/smoke_test_partial.py were
  spike-only debug tools.
- docs/spike-local-unidic*.md narrate work now landed.

Dockerfile drops `config` from the compileall/COPY lines.
docker-compose.yml drops the `env_file: .env` block (the
${API_TOOLS_PORT:-8000} fallback still works from shell env).
README.md trims the false "obtain a Yahoo API key" paragraph.

scripts/run_10_tests.sh stays — it's the 30-fixture regression
harness committed in the previous commit, not a spike artefact.

Verified post-prune: server reloads cleanly (HTTP 200 on a fresh
MarkAccent POST) and test_0 / test_15 / test_29 all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four follow-on features and a doc refresh on top of the spike fix.

1. SYMBOL_READINGS table in preprocess.py, consumed by
   tokenizer.tag_local. Standalone symbols (#, %, @, &, +, =, $, ¥,
   €, ℃, °, *, ~, §, plus full-width siblings) now get their spoken
   katakana reading instead of an empty furigana. The aligner's
   edit-distance branch matches the OJAD span at cost 0 rather than
   refusing it; the `#病` cascade that stole one mora from the next
   particle (test_0 idx 1411) is gone. suppress_punct_furigana also
   learns to skip these surfaces so the symbol's furigana + accent
   survive the post-alignment scrub.

2. split_okurigana in postprocess.py populates WordResult.subword
   when a token mixes kanji and kana. `聞き分け` →
   subword=[(聞,き),(き,""),(分,わ),(け,"")]. Top-level surface,
   furigana, and accent are unchanged — clients that ignore subword
   get the previous behaviour bit-for-bit. Irregular readings that
   can't be aligned against the surface kana fall back to no
   subword (no garbled segments). Across the 30-fixture corpus,
   1045 tokens in 30/30 files gain segments.

3. New `script` request arg: hiragana (default), katakana, or
   romaji. convert_furigana_script in postprocess.py rewrites every
   furigana field (top-level + per-mora + subword) before
   serialisation. Internal alignment stays hiragana. Default
   "hiragana" also normalises per-mora morae that OJAD echoed back
   as katakana (e.g. `ラ`/`イ` on ライター's accent[]) — the
   per-mora script is now consistent across surface types.

4. README rewritten in English: covers all five live endpoints
   (MarkAccent + UsageQuery + DictQuery + SentenceQuery), the full
   MarkAccent request body with the three new fields, response
   shape, examples, the regression harness, and the four known
   UniDic-vs-OJAD reading-mismatch tokens.

Re-profiled against the 30-fixture corpus after the changes:
0 under-mora anomalies (was 0 after the spike fix), 4 over-mora
cases (was 5 — the `#病→と` leak is fixed by the symbol table).
The remaining four over-mora cases are pre-existing UniDic context-
reading mismatches (世, 本当, 他, 寺) unrelated to this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The merge step was added before standalone symbols had a reading —
it glued `(2, %)` into one token whose surface (`2%`) matched
OJAD's phrase boundary so the パーセント morae wouldn't leak onto
the digit. Side-effect: the merged token's furigana came out as
`ごじゅうてんさんぱーせんと` for `50.3%`, with no way for a client
to render ruby specifically over `%`.

After the SYMBOL_READINGS work in the previous commit, `%` (and
its siblings `@`, `&`, `+`, `$`, `¥`, `€`, `℃`, `°`, …) already
carry their spoken katakana reading. The DP aligner matches each
symbol's furigana against the OJAD span at edit-distance 0, so the
パーセント morae no longer leak — the merge is redundant.

Removing it gives the user's preferred shape:
  `50.3%とは` → [50.3|ごじゅうてんさん] [%|ぱーせんと] [と] [は]

`READABLE_COMPOUND_RE` and the `is_readable_compound` branch in
align.py stay in place — nothing wired produces a compound surface
any more, but the dead branches are harmless and reading_overrides
could in principle still synthesise one.

Re-profiled all 30 fixtures: row counts unchanged, under-mora
anomalies 0, over-mora 4 (same UniDic context-reading mismatches
as before).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`apply_furigana_toggles` was clearing both furigana AND accent on
any surface matching `_is_pure_english_surface` — but fugashi+UniDic
hand back proper Japanese readings for unit compounds whose surface
happens to look ASCII (`53mm` → みりめーとる, `33m/s` →
めーとるまいびょう, `3kg` → きろぐらむ). With `render_english_furigana`
off (default), those unit tokens came back with empty furigana and
empty accent — the user saw `53mm` "escaped" entirely.

Skip the english wipe when the token's furigana already contains
any hiragana/katakana char. UniDic only fills a kana reading when
the surface IS a recognised Japanese unit / loanword token, so
truly foreign english (`Whisper`, `G2P`, `Apple`) still has
furigana==surface (no kana) and continues to be cleared.

Verified:
  - `53mm` → surface=`53mm`, furi=`53みりめーとる`,
    accent=[ご,じゅ,う,さ,ん,み,り,め,ー,と,る] with marks
  - `m/s` → surface=`m/s`, furi=`めーとるまいびょう`, full accent
  - `Whisper`, `G2P` → still wiped (no kana in furigana)

30-fixture regression: 30/30 HTTP 200, under-mora 0, over-mora 4
(same pre-existing UniDic context-reading mismatches). 7 fixtures
gained rows where unit tokens previously were stripped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some inputs hit deterministic upstream quirks the local pipeline
can't fix (OJAD reading `33m/s` as `さんじゅうみっめーとるまいびょう`,
with a stray `みっ` from a CRF sound-change; UniDic giving a kanji
its lemma reading instead of the contextual one). Rather than
chase each with bespoke align/postprocess logic, give the caller a
maintenance file they can grow over time.

`api/accent/user_patches.py` exposes USER_PATCHES: a dict of
literal-match surface fragments to a tuple of (segment_surface,
segment_furigana) pairs. `reading_overrides._user_patch_overrides`
compiles those into FuriganaOverride entries appended to the
existing OVERRIDES list, so both the pre-OJAD furigana pass and the
post-alignment accent pass pick them up — the second pass rewrites
the contour with the prescribed reading.

Accent defaults to heiban via a new `_mora_seq` helper that splits
the reading into actual morae (so じゅ stays one entry, not two).
Power users can drop full FuriganaOverride objects into the
existing OVERRIDES section for atamadaka / per-mora custom marks.

Seeded with one entry for `33m/s` as a working example. Edit the
dict and re-run `./scripts/run_10_tests.sh` after each addition.

Verified: `33m/s` now comes back as
  [33|さんじゅうさん] [m/s|めーとるまいびょう]
instead of `33|さんじゅうみっ` + `m/s|めーとるまいびょう`.

30-fixture regression: 30/30 HTTP 200, under-mora 0, over-mora 4
(same pre-existing UniDic-context mismatches — addressable by
adding USER_PATCHES entries case-by-case).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
torrid-fish and others added 16 commits June 4, 2026 08:06
Extend the patch schema with an optional third element per segment so
users can prescribe a non-heiban contour:

    (surf, furi)                  → heiban (default)
    (surf, furi, "heiban")        → heiban (explicit)
    (surf, furi, "atamadaka")     → first-mora FALL, rest LOW
    (surf, furi, "low")           → all-LOW
    (surf, furi, (0, 1, 2))       → explicit per-mora types

The shape names live in `_accent_from_spec` in reading_overrides.py;
unknown specs warn and fall back to heiban. `_split_morae` is now
factored out so both `_mora_seq` and the new helper share the same
小さな仮名-attach mora splitter.

Seeded three patches for the pre-existing UniDic-vs-OJAD context-
reading mismatches in the 30-fixture corpus:

  - `本当の` → ほんとう / の  (heiban)
  - `他の`   → ほか (atamadaka) / の
  - `世にも` → よ / に / も  (heiban; demonstrates flatten-after-heiban
              naturally drops the trailing に to LOW)

Re-profile: under-mora 0 (unchanged), **over-mora 4 → 1**. The
remaining `寺` case (test_16, after `永昌寺という`) involves a
compound-boundary mis-tokenisation, not addressable by a simple
literal patch — left for a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
)

The previous schema accepted 2-tuples (heiban default) plus three
named shapes ("heiban" / "atamadaka" / "low") as the accent_spec.
Convenience came at the cost of one rule per shape and a wall of
docs explaining which shape maps to what. Drop all of that — every
segment is now exactly `(surface, furigana, accent_ints)` with the
int tuple required and one entry per mora.

The shapes are trivially expressible as tuples:
  heiban    →  (1, 1, 1, ...)
  atamadaka →  (2, 0, 0, ...)
  low       →  (0, 0, 0, ...)

`_accent_from_spec` now returns `None` on any malformed spec and
the caller skips the whole patch entry (no per-segment fallback).
All four seeded patches are rewritten in the strict form.

Re-profile: under-mora 0, over-mora 1 (same `寺` compound boundary
case remains).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_age_overrides` merges `20歳` into a single WordResult with the
prescribed furigana `はたち` (3 morae), but OJAD still pronounces
the same surface as `にじゅっさい` (5 morae). The DP aligner's
kana branch couldn't grant the merged token 5 morae cheaply (the
edit-distance between `はたち` and `にじゅっさい` is huge), so it
allocated only 3 morae and the leftover `さい` cascaded onto the
following kana tokens — `20歳の私達へ` ended up with `の` getting
acc=[さ] and `私` getting acc=[い,の,わ,た,し].

Override-merged tokens carry no UniDic backing (both `base` and
`pos` are None — `ReplacementToken` doesn't set MA metadata).
Detect that combination in `_match_cost` and give the same
free-consume treatment as numeric / readable_compound: k=0 returns
_FALLBACK_COST so the DP prefers absorption, k≥1 returns 0 up to
a generous upper. `apply_accent_overrides` rewrites the accent
post-align so whatever DP picked up from OJAD is discarded.

Verified `20歳の私達へ`:
  20歳 → はたち [(は,2),(た,0),(ち,0)]
  の   → の    [(の,0)]
  私   → わたくし [(わ,0),(た,1),(し,2)]
  達   → たち  [(た,0),(ち,0)]
  へ   → へ    [(へ,0)]

30-fixture regression: 30/30 HTTP 200, under-mora 0, over-mora 1
(same `寺` compound boundary case).

Also refreshes `api/accent/README.md`:
  - Adds `user_patches.py` to file map + new section documenting
    the strict 3-tuple schema and accent_ints shapes.
  - Documents the new synthesized branch in `_match_cost`.
  - Adds Request toggle table (render_english/katakana_furigana,
    script) and the unit-compound exception for english toggle.
  - Adds `split_okurigana` + `convert_furigana_script` to the
    postprocess pass list and updates the data-flow diagram.
  - Removes references to merge_readable_symbol_compounds (gone
    since the SYMBOL_READINGS refactor).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `apply_furigana_toggles` cleared the top-level `furigana`
on pure-katakana tokens when `render_katakana_furigana=False`, but
left every `AccentInfo.furigana` populated (with hiragana morae).
Clients that draw ruby from the per-mora field rendered hiragana
copies (`ふ・ら・ん・つ`) on top of katakana surfaces (`フランツ`)
despite the toggle saying "no furigana" — the user-visible symptom
on inputs like `フランツ・ヨーゼフ・ハイドン` was katakana names
gaining unwanted ruby.

Clear every `AccentInfo.furigana` to `""` for those tokens while
keeping `accent_marking_type` and `length` intact, so clients
that draw pitch overlay against the surface chars can still
do so (length-aware iteration handles small kana like `ァ` / `ェ`).

`render_katakana_furigana=True` is unaffected — both top-level
and per-mora furigana flow through normally.

30-fixture regression: 30/30 HTTP 200, anomalies unchanged (one
false-positive in the heuristic dropped because the cleared
per-mora field stops triggering the "collapsed entry" check).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Full translation of the package-level documentation from Traditional
Chinese to English. Structure and section ordering preserved; same
mermaid data-flow diagram, same tables. Also folds in the changes
since the last refresh:

- Request toggle table documents the per-mora-furigana clear for
  the katakana toggle (the フランツ/Frаnz ruby-on-katakana fix).
- _match_cost branches list now includes the synthesized free-
  consume rule (override-merged 20歳 → はたち) alongside the
  english-compound k=0=0 rule.
- Postprocess pass list calls out unit-compound exemption from the
  english toggle wipe (53mm, 33m/s, 3kg keep their reading).
- User-patches section uses the strict 3-tuple schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `unidic` pip package ships the loader but not the ~770MB dicdir, so
fugashi.Tagger() failed at runtime (missing mecabrc) and /api/MarkAccent/
returned 500. Run `unidic download` in the builder stage; the venv copy
into the final image carries the dict along.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The version-selectable dict script (01fcd71) switched the UniDic download
to curl, but python:3.11-slim ships without it — the docker build died
with exit 127 at the download step. Add curl next to unzip in the
builder-stage apt install (multi-stage, so the runtime image is unchanged).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The collected endpoint rebuilds AccentResponse from per-chunk results and
silently dropped the new `warning` field (#60), so OJAD-degraded responses
looked like full results. Keep the first chunk warning, mirroring the
first_error convention. The stream endpoint already passes it through via
model_dump().

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The global+if-None lazy init let two concurrent first requests each see
_TAGGER as None and build their own fugashi.Tagger(), reloading the
~1.3GB UniDic dictionary twice (raised in PR #53 review). functools
.lru_cache(maxsize=1) makes the lazy init atomic so only one tagger is
ever constructed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A leftover '# extra う onto the following の' line was sitting between the
all-LOW and nakadaka rows of the accent-tuple table in the module
docstring (flagged in PR #53 review). Remove it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f-strings build the message eagerly even when the debug level is
disabled; pass the value as a lazy %-arg instead (PR #53 review).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schedule_chunks created detached asyncio tasks that kept scraping OJAD
even after the client went away (PR #53 review): on the streaming
endpoint a disconnect just stopped consuming the generator, and on the
collected endpoint a cancelled handler orphaned its tasks.

Add a shared cancel_pending helper and call it from a finally in both
endpoints, so a disconnect (GeneratorExit into the stream, or the
collected handler being cancelled) tears down any still-pending chunk.

A TaskGroup would scope the tasks automatically, but async with
TaskGroup() inside the streaming async generator wraps the aclose()
GeneratorExit into a BaseExceptionGroup, so explicit cancellation is the
only shape that closes the stream cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

🛡️ PR Quality Check Summary

PR Title: Passed (Length: 47/75, Format: OK). feat(accent): local UniDic + POS-driven patches
Branch Name: Follows naming convention (feat/local-unidic)
Commit Messages: All 29 commit(s) passed (Length, Format, Case)
Conflicts: No merge conflict markers found
Python Quality: All checks passed.


🎉 All checks passed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants