Skip to content

[Epic] Make MarkAccent a standalone pitch-accent service (drop OJAD dependency) #62

@torrid-fish

Description

@torrid-fish

Epic / tracking issue. Long-running theme, not a single PR. Individual work items get their own issues/PRs and link back here.

Vision

API-tools is converging on pitch-accent marking (/api/MarkAccent/) as its primary purpose; the other endpoints (DictQuery, SentenceQuery, UsageQuery) see little use. The goal of this epic is to make the accent service standalone — no runtime calls to external services — so it is robust, reproducible, low-latency, and deployable offline.

The tokeniser half is already standalone (fugashi + NINJAL UniDic CWJ, PR #53). The remaining external dependency in the accent path is OJAD (gavo.t.u-tokyo.ac.jp), which supplies the connected-speech pitch contour. This epic tracks removing it.

Why

The technical gap

OJAD does connected-speech contour: accent-phrase (アクセント句) segmentation, compound accent (複合語アクセント), particle attachment, downstep. UniDic's aType (already on every token as lexical_kernel) is only lexical / dictionary-form pitch — it cannot express phrasing. So the standalone replacement is not "use aType"; it is a local prosody engine.

flat (today's OJAD-down)  →  #61 aType lexical  →  rule-based 複合語アクセント  →  OpenJTalk  →  OpenJTalk + marine
      worst                    approximation          partial                      replaces OJAD     closest to / beyond OJAD

Workstreams

  • Floor — lexical fallback (Fallback to UniDic aType-derived lexical pitch when OJAD is unavailable #61): when no engine/OJAD is available, synthesise per-mora pitch from lexical_kernel instead of returning flat. Lowest-quality tier, but a guaranteed baseline. (separate issue: Fallback to UniDic aType-derived lexical pitch when OJAD is unavailable #61)
  • Spike — local prosody engine: evaluate pyopenjtalk (OpenJTalk full-context accent), optionally run_marine=True (MARINE neural accent), as an OJAD replacement. Dict is naist-jdic (≈tens of MB, ≪ UniDic). Deliverable: accent-nucleus agreement vs an OJAD gold set + latency + image-size delta.
  • Mapping layer: OpenJTalk full-context labels (/A:, accent-phrase fields) → the service's per-mora accent_marking_type (0/1/2). Reuse the existing mora-grouping convention.
  • Integration: introduce the engine behind the existing get_ojad_result seam so the pipeline (align_accent, overrides, POS patches, postprocess) stays unchanged. Make the contour source pluggable: openjtalk | ojad | lexical.
  • OJAD → optional: keep OJAD selectable for comparison, then default to local; eventually remove the scrape.
  • Evaluation harness: a gold set of sentences with OJAD output captured on a network that reaches OJAD, to score any local engine offline (nucleus-position agreement %, by word class).

Open decisions

  1. Quality bar: "not worse than OJAD" (→ likely need MARINE + eval) vs "good enough, simpler" (OpenJTalk only)?
  2. MARINE: pulls in torch (heavy image). Worth it?
  3. Image-size budget: already ~1.3 GB (UniDic). OpenJTalk dict is small; torch is not.
  4. Repo split: spin the accent service into its own repo, leaving Dict/Sentence/Usage behind? (Under consideration — affects packaging/CD.)
  5. Keep OJAD at all post-migration (as an optional high-accuracy backend) or remove entirely?

Explicitly out of scope (parked)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions