You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Epic / tracking issue. Long-running theme, not a single PR. Individual work items get their own issues/PRs and link back here.
Vision
API-tools is converging on pitch-accent marking (/api/MarkAccent/) as its primary purpose; the other endpoints (DictQuery, SentenceQuery, UsageQuery) see little use. The goal of this epic is to make the accent service standalone — no runtime calls to external services — so it is robust, reproducible, low-latency, and deployable offline.
The tokeniser half is already standalone (fugashi + NINJAL UniDic CWJ, PR #53). The remaining external dependency in the accent path is OJAD (gavo.t.u-tokyo.ac.jp), which supplies the connected-speech pitch contour. This epic tracks removing it.
Latency / scale: in-process inference vs per-chunk HTTP round-trip + a semaphore to avoid hammering OJAD.
Reproducibility & offline deploy: pinned local model + dictionary, no network at request time.
The technical gap
OJAD does connected-speech contour: accent-phrase (アクセント句) segmentation, compound accent (複合語アクセント), particle attachment, downstep. UniDic's aType (already on every token as lexical_kernel) is only lexical / dictionary-form pitch — it cannot express phrasing. So the standalone replacement is not "use aType"; it is a local prosody engine.
Spike — local prosody engine: evaluate pyopenjtalk (OpenJTalk full-context accent), optionally run_marine=True (MARINE neural accent), as an OJAD replacement. Dict is naist-jdic (≈tens of MB, ≪ UniDic). Deliverable: accent-nucleus agreement vs an OJAD gold set + latency + image-size delta.
Mapping layer: OpenJTalk full-context labels (/A:, accent-phrase fields) → the service's per-mora accent_marking_type (0/1/2). Reuse the existing mora-grouping convention.
Integration: introduce the engine behind the existing get_ojad_result seam so the pipeline (align_accent, overrides, POS patches, postprocess) stays unchanged. Make the contour source pluggable: openjtalk | ojad | lexical.
OJAD → optional: keep OJAD selectable for comparison, then default to local; eventually remove the scrape.
Evaluation harness: a gold set of sentences with OJAD output captured on a network that reaches OJAD, to score any local engine offline (nucleus-position agreement %, by word class).
Open decisions
Quality bar: "not worse than OJAD" (→ likely need MARINE + eval) vs "good enough, simpler" (OpenJTalk only)?
MARINE: pulls in torch (heavy image). Worth it?
Image-size budget: already ~1.3 GB (UniDic). OpenJTalk dict is small; torch is not.
Repo split: spin the accent service into its own repo, leaving Dict/Sentence/Usage behind? (Under consideration — affects packaging/CD.)
Keep OJAD at all post-migration (as an optional high-accuracy backend) or remove entirely?
Vision
API-toolsis converging on pitch-accent marking (/api/MarkAccent/) as its primary purpose; the other endpoints (DictQuery,SentenceQuery,UsageQuery) see little use. The goal of this epic is to make the accent service standalone — no runtime calls to external services — so it is robust, reproducible, low-latency, and deployable offline.The tokeniser half is already standalone (fugashi + NINJAL UniDic CWJ, PR #53). The remaining external dependency in the accent path is OJAD (
gavo.t.u-tokyo.ac.jp), which supplies the connected-speech pitch contour. This epic tracks removing it.Why
The technical gap
OJAD does connected-speech contour: accent-phrase (アクセント句) segmentation, compound accent (複合語アクセント), particle attachment, downstep. UniDic's
aType(already on every token aslexical_kernel) is only lexical / dictionary-form pitch — it cannot express phrasing. So the standalone replacement is not "use aType"; it is a local prosody engine.Workstreams
lexical_kernelinstead of returning flat. Lowest-quality tier, but a guaranteed baseline. (separate issue: Fallback to UniDic aType-derived lexical pitch when OJAD is unavailable #61)run_marine=True(MARINE neural accent), as an OJAD replacement. Dict is naist-jdic (≈tens of MB, ≪ UniDic). Deliverable: accent-nucleus agreement vs an OJAD gold set + latency + image-size delta./A:, accent-phrase fields) → the service's per-moraaccent_marking_type(0/1/2). Reuse the existing mora-grouping convention.get_ojad_resultseam so the pipeline (align_accent, overrides, POS patches, postprocess) stays unchanged. Make the contour source pluggable:openjtalk|ojad|lexical.Open decisions
Explicitly out of scope (parked)
DictQuery/SentenceQuery/UsageQuery, and their known bugs (DictQuery returns 404 for valid words (先生): EDRDG/JMdictDB markup changed name="e" → name="g" #54, Invalid symbols block the whole result parsing #57, Furigana parsing for date with number tend to drift #58). These stay external for now; revisit only if the repo-split decision pulls them along.References
api/accent/ojad.py, consumed inapi/accent/pipeline.py