English → Japanese and English → French translation models, built on top of a multilingual BERT encoder-decoder. This started as a "can I fine-tune BERT to translate?" experiment and ended up with two working models, a pile of training logs, and a couple of throwaway plotting scripts that somehow survived.
src/
train_japanese.py # trains the EN→JA model
train_french.py # trains the EN→FR model
translate_japanese.py # loads the saved JA model and translates a sentence
translate_french.py # same, for FR
plot_training.py # quick batch-loss / accuracy plot from a log file
data/
jpn.txt # Tatoeba EN/JA pairs
fra.txt # Tatoeba EN/FR pairs
logs/
training_results_*.txt # per-batch loss + accuracy for both runs
graph_*.txt # filtered logs used by plot_training.py
The trained .pth files aren't in the repo (they're big and the .gitignore keeps them out).
Train your own with the scripts, or wire it up to one you already have.
It's a transformers.EncoderDecoderModel with bert-base-multilingual-cased on both sides. The
same tokenizer handles English on the way in and Japanese/French on the way out, which is the
whole reason for using the multilingual checkpoint instead of plain BERT.
Training settings are the same for both languages:
- batch size 16
- 3 epochs
- AdamW, lr=5e-5
- 90/10 train/val split, max sequence length 128
Pairs come from Tatoeba (tatoeba.org). The cleaning step in both training scripts trims
everything after the first sentence terminator (. ? ! 。 ? !) so the model isn't trying to
emit multi-sentence outputs.
Numbers below are pulled from logs/training_results_japanese.txt and logs/training_results_french.txt.
| Train loss | Val loss | Val token accuracy | |
|---|---|---|---|
| EN → JA | 0.115 | 0.135 | 0.694 |
| EN → FR | 0.053 | 0.055 | 0.850 |
The French run converges noticeably faster and lower — same architecture, same hyperparameters. Japanese is just a harder target with this tokenizer (subword splitting on kana/kanji is uneven) and the dataset has more of the "one English sentence maps to several different Japanese translations" problem, which the model can't really win at.
The training script computes a corpus BLEU at the end, but the references / hypotheses lists
never actually get populated during validation (oversight from an earlier refactor), so the BLEU
in the log files reads as 0.0 and should be ignored. The per-token accuracy and val loss are the
real signal.
pip install -r requirements.txt
cd src
python train_japanese.py # or train_french.pyThe scripts read data from ../data/ and write logs to ../logs/, so run them from inside src/
(or edit the paths). Model weights get saved to the working directory; move them to models/
if you want the inference scripts to find them with their default path.
To translate a sentence, point model_path in translate_japanese.py or translate_french.py
at your saved .pth and run it. The example sentence is hardcoded at the bottom of the file —
edit it, or refactor into a CLI if you want to be fancy.
- The two training scripts are 90% the same file. They should share a module and just pass the language pair as an argument.
- BLEU evaluation needs the references/hypotheses lists populated during the validation loop
(decode
predsandlabelsper batch instead of trying to compute BLEU after the lists got reset to[]inside the loop). - The example test sentences inside the
translate_*.pyfiles should be a CLI argument, not hardcoded. - Longer training (3 epochs was just what fit on the GPU I had at the time).
- The French run looks suspiciously good — worth checking whether the val split is actually unseen data and not duplicates of the training set, since Tatoeba has a lot of near-duplicates.
MIT. See LICENSE.
Dataset credit goes to the Tatoeba project (CC-BY 2.0 FR) — attributions are in the data files themselves.