Skip to content

Shruichan/TranslationTransformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TranslationTransformer

English → Japanese and English → French translation models, built on top of a multilingual BERT encoder-decoder. This started as a "can I fine-tune BERT to translate?" experiment and ended up with two working models, a pile of training logs, and a couple of throwaway plotting scripts that somehow survived.

What's in here

src/
  train_japanese.py        # trains the EN→JA model
  train_french.py          # trains the EN→FR model
  translate_japanese.py    # loads the saved JA model and translates a sentence
  translate_french.py      # same, for FR
  plot_training.py         # quick batch-loss / accuracy plot from a log file
data/
  jpn.txt                  # Tatoeba EN/JA pairs
  fra.txt                  # Tatoeba EN/FR pairs
logs/
  training_results_*.txt   # per-batch loss + accuracy for both runs
  graph_*.txt              # filtered logs used by plot_training.py

The trained .pth files aren't in the repo (they're big and the .gitignore keeps them out). Train your own with the scripts, or wire it up to one you already have.

The model

It's a transformers.EncoderDecoderModel with bert-base-multilingual-cased on both sides. The same tokenizer handles English on the way in and Japanese/French on the way out, which is the whole reason for using the multilingual checkpoint instead of plain BERT.

Training settings are the same for both languages:

  • batch size 16
  • 3 epochs
  • AdamW, lr=5e-5
  • 90/10 train/val split, max sequence length 128

Pairs come from Tatoeba (tatoeba.org). The cleaning step in both training scripts trims everything after the first sentence terminator (. ? ! 。 ? !) so the model isn't trying to emit multi-sentence outputs.

Results

Numbers below are pulled from logs/training_results_japanese.txt and logs/training_results_french.txt.

Train loss Val loss Val token accuracy
EN → JA 0.115 0.135 0.694
EN → FR 0.053 0.055 0.850

The French run converges noticeably faster and lower — same architecture, same hyperparameters. Japanese is just a harder target with this tokenizer (subword splitting on kana/kanji is uneven) and the dataset has more of the "one English sentence maps to several different Japanese translations" problem, which the model can't really win at.

The training script computes a corpus BLEU at the end, but the references / hypotheses lists never actually get populated during validation (oversight from an earlier refactor), so the BLEU in the log files reads as 0.0 and should be ignored. The per-token accuracy and val loss are the real signal.

Running it

pip install -r requirements.txt
cd src
python train_japanese.py     # or train_french.py

The scripts read data from ../data/ and write logs to ../logs/, so run them from inside src/ (or edit the paths). Model weights get saved to the working directory; move them to models/ if you want the inference scripts to find them with their default path.

To translate a sentence, point model_path in translate_japanese.py or translate_french.py at your saved .pth and run it. The example sentence is hardcoded at the bottom of the file — edit it, or refactor into a CLI if you want to be fancy.

Things I'd change if I picked this up again

  • The two training scripts are 90% the same file. They should share a module and just pass the language pair as an argument.
  • BLEU evaluation needs the references/hypotheses lists populated during the validation loop (decode preds and labels per batch instead of trying to compute BLEU after the lists got reset to [] inside the loop).
  • The example test sentences inside the translate_*.py files should be a CLI argument, not hardcoded.
  • Longer training (3 epochs was just what fit on the GPU I had at the time).
  • The French run looks suspiciously good — worth checking whether the val split is actually unseen data and not duplicates of the training set, since Tatoeba has a lot of near-duplicates.

License

MIT. See LICENSE.

Dataset credit goes to the Tatoeba project (CC-BY 2.0 FR) — attributions are in the data files themselves.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors