Skip to content

mrtineu/fix-diacritic

Repository files navigation

Slovak Diacritic Restoration

This project is a machine learning solution focused on automatically restoring missing diacritics (mäkčene, dĺžne) in Slovak text. I created it in a 8-hour sprint (completed within a day) as a submission for the Slovak AI Olympics 2025/26.

The goal was to build a pipeline capable of taking raw, non-diacritic Slovak text (e.g., "Mame radi slovensky jazyk") and restoring its proper grammatical form ("Máme radi slovenský jazyk").

Project goal

The original idea was to:

  • prepare an NLP dataset from scratch without using specialized NLP libraries,
  • implement a zero-shot baseline using a Masked Language Model,
  • fine-tune a Token Classification model for the specific task of diacritic generation,
  • compare the speed and accuracy of both approaches.

What the project does

The pipeline follows this workflow:

  1. Dataset Preparation: Scrapes raw Slovak Wikipedia dumps and cleans the text entirely using regular expressions (as external NLP tools like spaCy/Stanza were forbidden by competition rules).
  2. Text Degradation: Automatically strips diacritics from the text to create pairs of (undiacritized input, true diacritic target).
  3. Zero-shot Inference: Uses gerulata/slovakbert as a Masked Language Model to predict missing diacritics iteratively.
  4. Fine-tuning: Fine-tunes the identical model, but using the Token Classification architecture. Instead of predicting raw text, the model learns token string operators (e.g., "dazd" -> "1:á,3:ď").
  5. Evaluation: Compares zero-shot and fine-tuned accuracy across a test set of over 30,000 sentences.

Dataset

Because external linguistic tools were restricted by the Olympiad's rules, I built a custom parser in data_preparation.ipynb.

  • Source: Slovak Wikipedia data dumps.
  • Processing: I used strict Regex patterns to extract clean paragraphs, split them into sentences, and build input-label pairs.
  • Volume: The final dataset contains ~30,000 sentences (almost 4 million characters).

Note on volume: I originally planned for a much larger dataset of 150,000 sentences. However, because the zero-shot inference approach was incredibly slow (taking 28 minutes just to predict 3,000 sentences), processing the full 15,000-sentence validation set would have risked Google Colab shutting down the instance. To finish within the deadline, I scaled the total size down to 30,000.

Models

I compared two approaches using the SlovakBERT (gerulata/slovakbert) foundation model:

1. Zero-shot (Masked LM)

This approach masks words that lack diacritics and asks the model to fill in the blank with the most statistically probable original Slovak word. This requires no additional training, just clever masking logic. While theoretically sound, it is extremely slow in practice because each missing diacritic requires passing the sentence through the model again.

2. Fine-tuned (Token Classification)

Instead of predicting language blindly, I configured the model to perform Token Classification. The model takes a token and predicts specific structural transformations. For example, if the input token is dazd, the model outputs a classification label that translates to "replace index 1 with á, and index 3 with ď".

Architectural Decision: I specifically chose Token Classification over a translation-based Seq2Seq approach. While a Seq2Seq model would "translate" text without diacritics to text with proper diacritics, it is noticeably slower during generation. Because I was under a strict 4-hour deadline, the single forward pass of Token Classification made it drastically faster and perfectly suited for this task.

Results

I utilized Google Colab's standard T4 GPUs exclusively to run and train the models.

One of my biggest surprises and findings from this project was the sheer speed differential. I anticipated Token Classification would be faster than the Masked baseline, but the gap was staggering. Predicting 3,000 sentences via the zero-shot approach took 28 minutes. In contrast, the Token Classification model took only 7 minutes total-and that included both fine-tuning the model for 2 epochs and running the validation predictions!

The fact that it achieved 97.5% accuracy after just 2 epochs of training completely validated the architectural choice.

Approach Accuracy Time to Predict (3k sentences)
Zero-shot 75.0% ~28 mins
Fine-tuned 97.5% ~7 mins (Includes fine-tuning!)

Visual Performance

Speed Comparison

Speed Comparison Comparing the prediction time constraints between Zero-shot and Fine-tuned models.

Accuracy Comparison

Accuracy Comparison Fine-tuning on token alterations showed a dominant margin over zero-shot guessing.

The "Zero-shot Inference Bug" (Learning Experience)

While the fine-tuning approach was incredibly successful, the zero-shot baseline involved a major struggle.

When testing the zero-shot model on a small batch of 100 sentences via my evaluate_test_set function, I achieved a promising 75% accuracy. The function correctly handled both long and short sentences.

However, when I deployed the exact same core logic to predict the entire validation dataset and generate predictions_zeroshot.tsv, the final output results were completely nonsensical. I spent a significant portion of my sprint trying to isolate the problem. The core takeaway from this bug was a lesson in scaling logic: a function that perfectly passes a small, isolated test suite does not guarantee success when integrated into a massive batched pipeline. The pipeline's logic for iterating through the validation loop simply failed.

Project structure

  • data_preparation.ipynb - Wikipedia parsing, Regex cleaning, and dataset pair generation.
  • zeroshot_solution.py - Zero-shot pipeline using Masked LM.
  • finetuning_solution.py - Token Classification model structure and training loop.
  • stats.json - Dataset parsing metrics.
  • REPORT.MD - Slovak summary of my findings and struggles.

How to run

The project revolves around the Google Colab environment with T4 GPUs.

  1. Clone the repository and upload the .py and .ipynb files to your Google Drive.
  2. Install dependencies:
    pip install torch transformers numpy mwparserfromhell
  3. Run data_preparation.ipynb to download and clean the Wikipedia dump.
  4. Run finetuning_solution.py to train the classification model.
  5. (Optional) Run zeroshot_solution.py to attempt the Masked LM pipeline.

Main takeaways

  • I learned the massive speed difference between Masked LMs and Token Classification models.
  • I practiced building a dataset from raw XML/Wikipedia dumps strictly using Regex(NLP would be much much better).
  • I learned how a pipeline logic bug can destroy results even if the isolated prediction function achieves 75% accuracy.

Future improvements

If I were to revisit this codebase, I would prioritize:

  • scaling the training dataset beyond 30,000 sentences (up to my original goal of 150,000+), given the speed of the fine-tuned model,
  • experimenting with character-level CNNs as an alternative to transformer models.

About

A machine learning pipeline that automatically restores missing diacritics in Slovak text. The project compares a zero-shot Masked Language Model baseline with a highly accurate, fine-tuned Token Classification model using SlovakBERT.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors