Offline EN↔JA travel translator, Q4_K_M GGUF, runs fully on-device via llama.cpp / PocketPal.
Fine-tuned from cyberagent/CAT-Translate-0.8b
(itself built on sbintuitions/sarashina2.2-0.5b)
via LoRA domain adaptation on a curated travel corpus, then converted to GGUF via a
patched convert_hf_to_gguf.py and quantized with
importance-matrix (imatrix) calibration on travel data.
| File | Size | Notes |
|---|---|---|
cat-translate-0.8b-travel-Q4_K_M.gguf |
503.7 MB | RECOMMENDED — iPhone target |
cat-translate-0.8b-travel-Q5_K_M.gguf |
569.2 MB | Alternate |
cat-translate-0.8b-travel-Q8_0.gguf |
806.2 MB | Near-lossless reference |
All three files are imatrix-calibrated PTQ (importance-matrix post-training quantization),
calibrated on a 534-example travel corpus weighted toward emergency (134 examples, all
available) and food (200 examples) domains. Quantized with llama-quantize --imatrix.
Target use case: consumer travel translation — ordering food, navigating transportation, accommodation, shopping, and emergency communication in Japan.
Food and emergency domains are explicitly weighted in both fine-tuning (food is 42.5% of training pairs) and in the importance matrix calibration (emergency is the heaviest-weighted domain by proportion). The motivating scenario is a foreigner with no internet access facing a Japanese-only emergency line — "Please call an ambulance" and "I am having an allergic reaction" must be correct.
This model is a credible demonstrator with rigorous safety-aware evaluation. It is not a deployable medical, legal, or defense product. The consumer travel app is the near-term target. See the safety architecture note below for how life-critical phrases are handled in the intended deployment.
Not in scope: medical advice, legal documents, regulated translation workflows, or any context where a confident-wrong output carries high-stakes consequences.
The intended deployment is a hybrid phrasebook + LLM system:
- Deterministic, human-verified phrasebook handles life-critical phrases: calling an ambulance, reporting an allergic reaction, "I can't breathe", "I'm lost". These are fixed strings — no probability, no hallucination risk.
- The LLM handles the open-ended long tail: menus, negotiations, descriptions — anything that formulaic phrases don't cover.
A 4-bit quantized LLM alone must not be relied on for life-critical phrases. Confident-wrong is the worst failure mode in an emergency; the phrasebook is the safety net.
Measured on a 156-item held-out gold test set (78 EN→JA + 78 JA→EN), food- and emergency-weighted, with all 44 emergency references human-verified. Evaluated at the desktop inference stage using llama-server with plain-text prompts on stock llama.cpp b9716.
| Label | Size (MB) | COMET (all) | Δ vs bf16 | food COMET | emerg COMET | BLEU en→ja | BLEU ja→en |
|---|---|---|---|---|---|---|---|
| bf16 | 1,515.2 | 0.9001 | ±0.0000 | 0.8960 | 0.8913 | 48.87 | 39.99 |
| Q8_0 | 806.2 | 0.9022 | +0.0021 | 0.8971 | 0.8925 | 49.79 | 40.35 |
| Q5_K_M | 569.2 | 0.9000 | −0.0001 | 0.8933 | 0.8925 | 48.92 | 41.32 |
| Q4_K_M | 503.7 | 0.8988 | −0.0013 | 0.8892 | 0.8986 | 47.96 | 38.97 |
COMET model: Unbabel/wmt22-comet-da.
BLEU tokenizers: char for EN→JA (sacrebleu), 13a for JA→EN (sacrebleu).
Why Q4_K_M: COMET delta −0.0013 vs bf16 is inside the 0.005 negligible threshold, at 67% smaller. Emergency gate clears: Q4 emerg COMET 0.8986 vs bf16 0.8913 — zero emergency degradation at 4-bit (the small apparent gain is within n=44 noise; the honest claim is no degradation). The entire quant ladder sits within ~0.003 COMET; quant choice collapses to smallest-that-fits-the-phone.
- COMET ≈ parity is the reliable signal. Do not read small BLEU swings between quant levels as quality conclusions — the variation is within small-sample noise.
- These 156-item numbers are NOT comparable to Phase 3 (fine-tune) numbers. The earlier 40-item eval set contained shorter, more formulaic sentences. The 156-item set includes longer, more varied sentences; BLEU naturally drops on harder sets. The difference is not a regression.
- EN→JA BLEU uses
chartokenizer; JA→EN uses13a. Mixing tokenizers makes numbers incomparable. If you re-evaluate, pin these tokenizers explicitly.
The value proposition is pragmatics, not BLEU. Incumbent offline NMT (Google Translate offline ~47 MB, Apple ~100 MB) wins literal character-BLEU. This model's edge:
- Pro-drop resolution — correctly inferring dropped subjects/objects from context in Japanese
- Keigo — appropriate politeness register for service interactions (restaurant, hotel, shop)
- Cultural and food context — menu descriptions, allergen terms, regional food vocabulary
Do not frame this as a BLEU race against Google or Apple. Frame it on pragmatic correctness in travel scenarios that require cultural and register awareness.
This GGUF was built with a patched convert_hf_to_gguf.py and tokenizes correctly only
in llama.cpp builds that use the resulting tokenizer.ggml.model = "t5" path.
Background: cyberagent/CAT-Translate-0.8b uses a SentencePiece Unigram tokenizer
(inherited from sbintuitions/sarashina2.2-0.5b). The stock llama.cpp conversion script
hardcodes tokenizer.ggml.model = "llama" for all SPM models, routing the runtime into the
greedy-merge SPM path. SPM greedy-merge ≠ Unigram Viterbi — it produces silently wrong token
splits. All known community GGUFs of this model family (mradermacher, mmnga-o) have this
bug.
The fix (applied at conversion time): llama.cpp/conversion/base.py::_set_vocab_sentencepiece()
was modified to (1) detect tokenizer.json → model.type == "Unigram" and emit
tokenizer.ggml.model = "t5", routing llama.cpp into its LLAMA_VOCAB_TYPE_UGM Viterbi
path; and (2) selectively promote BYTE tokens to NORMAL type only for byte characters that
have no existing NORMAL token — so control characters like \n tokenize correctly without
triggering the duplicate-text assertion crash (GGML_ASSERT(id_to_token.size() == token_to_id.size())).
Validation on stock llama.cpp b9716 (2026-06-19): /tokenize on the template string:
25/25 tokens match HF output exactly, including \n\n = [25, 25]. Broad parity test over
30 diverse strings: all match. COMET ≈ HF parity in both directions on the 40-sentence eval.
Verify your build: run /tokenize on a Japanese string and compare against HF tokenizer
output. If your llama.cpp or PocketPal build uses an unpatched conversion (model string
"llama" instead of "t5"), tokenization will be silently wrong and translation quality
will degrade. PocketPal validation on a real device remains a required gate before any app
deployment — b9716 desktop testing is a strong proxy, not a guarantee.
The patch is published at patches/llama_cpp_ugm_tokenizer.patch
(against llama.cpp b9716, commit db52540).
llama-cli \
-m cat-translate-0.8b-travel-Q4_K_M.gguf \
--no-warmup \
-p "Translate the following English text into Japanese.\n\nI'd like a table for two."# Start server
llama-server -m cat-translate-0.8b-travel-Q4_K_M.gguf --port 8080
# Query
curl http://localhost:8080/completion \
-H "Content-Type: application/json" \
-d '{"prompt": "Translate the following English text into Japanese.\n\nI have a peanut allergy.", "n_predict": 128}'Use plain-text prompts only — no chat template, no system message. Template:
"Translate the following {src_lang} text into {tgt_lang}.\n\n{source}"
For JA→EN: "Translate the following Japanese text into English.\n\n救急車を呼んでください。"
Open PocketPal → Models → Add Model → enter the Hugging Face repo ID
kaiish/capynode → select cat-translate-0.8b-travel-Q4_K_M.gguf.
Enable airplane mode after download to confirm fully offline operation.
This model is single-turn and uses no system message. Chat UIs that wrap input in a default system prompt, or that accumulate conversation history across turns, will break it — the model drops out of translation mode and responds conversationally instead. This is expected behaviour: CAT-Translate is invoked purely by the instruction in a single user turn, not by a persistent persona or context.
Recipe validated working fully offline on iPhone 11 in PocketPal:
- System prompt: empty — clear any default system message in the model settings.
- One translation per conversation — start a fresh chat for each translation; do not carry history between requests.
- Temperature 0 (greedy) for clean, deterministic output.
- Jinja chat-template ON — use the embedded template (enabled by default in PocketPal
when the GGUF contains a
tokenizer.chat_templatekey, which this one does).
For app integration, the cleaner path is to call the model with the instruction as a single user turn (or as a raw text completion) and no history — which is natural for a translation app since each translation is independent anyway.
Validated offline on iPhone 11 (4 GB) via PocketPal: correct EN→JA and JA→EN including emergency and food/allergen phrases; UGM tokenizer confirmed intact on the device build.
| Source | Pairs | License |
|---|---|---|
| Gemini distillation (gemini-3.1-pro-preview / gemini-2.5-flash) | ~2,550 | Synthetic, Google AI Studio free tier |
| Tatoeba | ~1,230 | CC BY 2.0 |
| BSD | dropped | CC BY-NC-SA 4.0 — excluded (non-commercial clause) |
Final training corpus: 6,458 train / 718 validation examples (bidirectional, sarashina2.2 instruct template). Domain distribution: food 42.5%, shopping 17.5%, transportation 15%, accommodation 15%, emergency 5%, general 5%.
License: MIT. See release/LICENSE for this repository's terms.
This GGUF is built on two upstream MIT-licensed components. Their copyright and permission notices are preserved verbatim in the bundled license files below — these are the authoritative texts that must be carried forward in any distribution or deployment (e.g., in an app's About or credits screen):
release/LICENSE.sarashina2.2-0.5b— Copyright (c) 2025 SB Intuitionsrelease/LICENSE.CAT-Translate-0.8b— Copyright (c) 2026 CyberAgent AI Lab
sbintuitions/sarashina2.2-0.5b (MIT — SB Intuitions, 2025)
│
└── cyberagent/CAT-Translate-0.8b (MIT — CyberAgent AI Lab, 2026)
│
└── LoRA fine-tune (travel domain adaptation)
│
└── imatrix Q4_K_M GGUF (this repo — MIT, Kaito Ishiguro, 2026)
Cleared for: quantizing, re-hosting on HuggingFace, and shipping in a closed-source paid, subscription, or ad-supported app, provided both bundled license files above are preserved.