Skip to content

Fallback to UniDic aType-derived lexical pitch when OJAD is unavailable #61

@torrid-fish

Description

@torrid-fish

Summary

When OJAD is unavailable, MarkAccent currently degrades to flat pitch (accent_marking_type=0 on every mora) — see the graceful-degradation path added in #59 / PR #53. Meanwhile UniDic's per-morpheme accent kernel (aType) is already tokenised and carried on every word as lexical_kernel / lexical_kernel_alts, but it is not used to synthesise any pitch.

This issue proposes an OJAD-down fallback that converts lexical_kernel into a per-mora pitch contour, so degraded responses carry an approximate (dictionary-form) accent instead of nothing.

Likely a separate PR on top of PR #53.

Current behaviour

When OJAD returns nothing, align_accent takes its m == 0 branch and every token goes through _fallback_word_build_word_result(..., ojad_span=[]), which emits a single hard-coded flat entry:

# api/accent/align.py:423
fallback_accent = [
    AccentInfo(furigana=token_furigana, accent_marking_type=0, length=len(token_furigana))
]

lexical_kernel is only passed through as metadata (api/accent/align.py:444); it never drives accent_marking_type. So in degraded mode the client sees furigana + flat pitch (plus whatever apply_accent_patches POS rules fire independently, e.g. ます→FALL).

Confirmed locally: with OJAD unreachable, all four test articles return 200 with accent_marking_type=0 throughout and the warning field set.

Proposed feature

In the empty-OJAD-span path only, when lexical_kernel is not None, derive a per-mora contour from the kernel using the standard Tokyo-dialect mapping (same 0=LOW / 1=HIGH / 2=FALL-kernel convention documented in api/accent/user_patches.py):

aType (kernel) type per-mora pattern (M morae)
0 平板 heiban [LOW, HIGH, HIGH, …, HIGH][0,1,1,…,1]
1 頭高 atamadaka [FALL, LOW, …, LOW][2,0,…,0]
N≥2 中高/尾高 [LOW, HIGH, …, HIGH(mora N)=FALL, LOW, …][0,1,…,2,0,…]

None kernel (particles, auxiliaries, override-constructed tokens) stays flat as today.

Possible implementation

  • Add a helper, e.g. _kernel_to_accents(furigana, kernel) -> list[AccentInfo], in api/accent/align.py, mapping the kernel position to per-mora accent_marking_type.
    • Count morae with the existing convention (small kana ゃゅょ etc. attach to the preceding mora — reuse whatever user_patches mora-counting uses so the kernel index lands on the right mora).
    • Use lexical_kernel (primary); ignore lexical_kernel_alts for the first cut.
  • Call it from the empty-span branches in _build_word_result (align.py:431) and have align_accent's m == 0 branch (align.py:~534) route through it, instead of the flat fallback_accent.
  • Gate it so it only affects the OJAD-down / empty-span case. OJAD-normal behaviour must stay byte-identical (the per-mora OJAD contour still wins whenever a voiced span exists).
  • Consider exposing it as opt-in (request flag, e.g. lexical_pitch_fallback, or config) so callers who prefer "flat + lexical_kernel metadata" can keep today's behaviour.
  • Keep / adjust the existing warning wording to say pitch is lexical-form approximated rather than absent.
  • Verify interaction with apply_accent_patches (runs after align in pipeline.py): the POS ます/たい patches must not double-mark a FALL that the kernel mapping already produced.

Caveats / non-goals

  • UniDic aType is lexical (isolated dictionary-form) pitch, not connected-speech contour. Cross-word phrasing (downstep deletion across a phrase, 助詞 attachment, 連語 sandhi) is not modelled — this is a best-effort approximation, strictly inferior to OJAD, only meant to beat "completely flat".
  • kernel_absorbed is about OJAD-present cases and is orthogonal here.
  • Not a replacement for OJAD (see PR feat(accent): local UniDic + POS-driven patches #53 "Out of scope").

Acceptance criteria

  • With OJAD reachable: output unchanged (diff-clean against current behaviour).
  • With OJAD unavailable: words with lexical_kernel = N render a kernel-consistent contour (heiban / atamadaka / nakadaka verified on a few fixtures, e.g. 端=heiban, 箸=頭高, 橋=尾高).
  • None-kernel tokens (particles) stay flat.
  • ruff check / ruff format clean; existing import sanity holds.

References

  • Graceful degradation: OJAD graceful fallback #59, PR feat(accent): local UniDic + POS-driven patches #53 (pipeline.py OJAD try/except OJADUnavailableError)
  • Flat fallback today: api/accent/align.py:423, :431, m == 0 branch :~534
  • Kernel metadata fields: api/accent/models.py lexical_kernel / lexical_kernel_alts / kernel_absorbed
  • Per-mora 0/1/2 convention: api/accent/user_patches.py schema docstring

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions