[Epic] Make MarkAccent a standalone pitch-accent service (drop OJAD dependency)

> **Epic / tracking issue.** Long-running theme, not a single PR. Individual work items get their own issues/PRs and link back here.

### Vision

`API-tools` is converging on **pitch-accent marking** (`/api/MarkAccent/`) as its primary purpose; the other endpoints (`DictQuery`, `SentenceQuery`, `UsageQuery`) see little use. The goal of this epic is to make the accent service **standalone** — no runtime calls to external services — so it is robust, reproducible, low-latency, and deployable offline.

The tokeniser half is already standalone (fugashi + NINJAL UniDic CWJ, PR #53). The remaining external dependency in the accent path is **OJAD** (`gavo.t.u-tokyo.ac.jp`), which supplies the connected-speech pitch contour. This epic tracks removing it.

### Why

- **Robustness**: OJAD is an HTML scrape against a university server — unreachable from some networks (observed: TCP connect timeout), no SLA, ToS-grey. Same fragility class that produced the scrape-drift bugs on the *other* endpoints (#54, #57, #58).
- **Latency / scale**: in-process inference vs per-chunk HTTP round-trip + a semaphore to avoid hammering OJAD.
- **Reproducibility & offline deploy**: pinned local model + dictionary, no network at request time.

### The technical gap

OJAD does **connected-speech contour**: accent-phrase (アクセント句) segmentation, compound accent (複合語アクセント), particle attachment, downstep. UniDic's `aType` (already on every token as `lexical_kernel`) is only **lexical / dictionary-form** pitch — it cannot express phrasing. So the standalone replacement is not "use aType"; it is a **local prosody engine**.

```
flat (today's OJAD-down)  →  #61 aType lexical  →  rule-based 複合語アクセント  →  OpenJTalk  →  OpenJTalk + marine
      worst                    approximation          partial                      replaces OJAD     closest to / beyond OJAD
```

### Workstreams

- [ ] **Floor — lexical fallback (#61)**: when no engine/OJAD is available, synthesise per-mora pitch from `lexical_kernel` instead of returning flat. Lowest-quality tier, but a guaranteed baseline. *(separate issue: #61)*
- [ ] **Spike — local prosody engine**: evaluate **pyopenjtalk** (OpenJTalk full-context accent), optionally `run_marine=True` (MARINE neural accent), as an OJAD replacement. Dict is naist-jdic (≈tens of MB, ≪ UniDic). Deliverable: accent-nucleus agreement vs an OJAD gold set + latency + image-size delta.
- [ ] **Mapping layer**: OpenJTalk full-context labels (`/A:`, accent-phrase fields) → the service's per-mora `accent_marking_type` (`0/1/2`). Reuse the existing mora-grouping convention.
- [ ] **Integration**: introduce the engine behind the existing `get_ojad_result` seam so the pipeline (`align_accent`, overrides, POS patches, postprocess) stays unchanged. Make the contour source pluggable: `openjtalk` | `ojad` | `lexical`.
- [ ] **OJAD → optional**: keep OJAD selectable for comparison, then default to local; eventually remove the scrape.
- [ ] **Evaluation harness**: a gold set of sentences with OJAD output captured **on a network that reaches OJAD**, to score any local engine offline (nucleus-position agreement %, by word class).

### Open decisions

1. **Quality bar**: "not worse than OJAD" (→ likely need MARINE + eval) vs "good enough, simpler" (OpenJTalk only)?
2. **MARINE**: pulls in torch (heavy image). Worth it?
3. **Image-size budget**: already ~1.3 GB (UniDic). OpenJTalk dict is small; torch is not.
4. **Repo split**: spin the accent service into its own repo, leaving Dict/Sentence/Usage behind? (Under consideration — affects packaging/CD.)
5. **Keep OJAD at all** post-migration (as an optional high-accuracy backend) or remove entirely?

### Explicitly out of scope (parked)

- Standalone-ifying `DictQuery` / `SentenceQuery` / `UsageQuery`, and their known bugs (#54, #57, #58). These stay external for now; revisit only if the repo-split decision pulls them along.

### References

- Tokeniser already local: PR #53 (#50)
- OJAD graceful degradation: #59
- Lexical fallback floor: #61
- OJAD scrape seam: `api/accent/ojad.py`, consumed in `api/accent/pipeline.py`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Make MarkAccent a standalone pitch-accent service (drop OJAD dependency) #62

Vision

Why

The technical gap

Workstreams

Open decisions

Explicitly out of scope (parked)

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Epic] Make MarkAccent a standalone pitch-accent service (drop OJAD dependency) #62

Description

Vision

Why

The technical gap

Workstreams

Open decisions

Explicitly out of scope (parked)

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions