Slovak Diacritic Restoration

This project is a machine learning solution focused on automatically restoring missing diacritics (mäkčene, dĺžne) in Slovak text. I created it in a 8-hour sprint (completed within a day) as a submission for the Slovak AI Olympics 2025/26.

The goal was to build a pipeline capable of taking raw, non-diacritic Slovak text (e.g., "Mame radi slovensky jazyk") and restoring its proper grammatical form ("Máme radi slovenský jazyk").

Project goal

The original idea was to:

prepare an NLP dataset from scratch without using specialized NLP libraries,
implement a zero-shot baseline using a Masked Language Model,
fine-tune a Token Classification model for the specific task of diacritic generation,
compare the speed and accuracy of both approaches.

What the project does

The pipeline follows this workflow:

Dataset Preparation: Scrapes raw Slovak Wikipedia dumps and cleans the text entirely using regular expressions (as external NLP tools like spaCy/Stanza were forbidden by competition rules).
Text Degradation: Automatically strips diacritics from the text to create pairs of (undiacritized input, true diacritic target).
Zero-shot Inference: Uses gerulata/slovakbert as a Masked Language Model to predict missing diacritics iteratively.
Fine-tuning: Fine-tunes the identical model, but using the Token Classification architecture. Instead of predicting raw text, the model learns token string operators (e.g., "dazd" -> "1:á,3:ď").
Evaluation: Compares zero-shot and fine-tuned accuracy across a test set of over 30,000 sentences.

Dataset

Because external linguistic tools were restricted by the Olympiad's rules, I built a custom parser in data_preparation.ipynb.

Source: Slovak Wikipedia data dumps.
Processing: I used strict Regex patterns to extract clean paragraphs, split them into sentences, and build input-label pairs.
Volume: The final dataset contains ~30,000 sentences (almost 4 million characters).

Note on volume: I originally planned for a much larger dataset of 150,000 sentences. However, because the zero-shot inference approach was incredibly slow (taking 28 minutes just to predict 3,000 sentences), processing the full 15,000-sentence validation set would have risked Google Colab shutting down the instance. To finish within the deadline, I scaled the total size down to 30,000.

Models

I compared two approaches using the SlovakBERT (gerulata/slovakbert) foundation model:

1. Zero-shot (Masked LM)

This approach masks words that lack diacritics and asks the model to fill in the blank with the most statistically probable original Slovak word. This requires no additional training, just clever masking logic. While theoretically sound, it is extremely slow in practice because each missing diacritic requires passing the sentence through the model again.

2. Fine-tuned (Token Classification)

Instead of predicting language blindly, I configured the model to perform Token Classification. The model takes a token and predicts specific structural transformations. For example, if the input token is dazd, the model outputs a classification label that translates to "replace index 1 with á, and index 3 with ď".

Architectural Decision: I specifically chose Token Classification over a translation-based Seq2Seq approach. While a Seq2Seq model would "translate" text without diacritics to text with proper diacritics, it is noticeably slower during generation. Because I was under a strict 4-hour deadline, the single forward pass of Token Classification made it drastically faster and perfectly suited for this task.

Results

I utilized Google Colab's standard T4 GPUs exclusively to run and train the models.

One of my biggest surprises and findings from this project was the sheer speed differential. I anticipated Token Classification would be faster than the Masked baseline, but the gap was staggering. Predicting 3,000 sentences via the zero-shot approach took 28 minutes. In contrast, the Token Classification model took only 7 minutes total-and that included both fine-tuning the model for 2 epochs and running the validation predictions!

The fact that it achieved 97.5% accuracy after just 2 epochs of training completely validated the architectural choice.

Approach	Accuracy	Time to Predict (3k sentences)
Zero-shot	75.0%	~28 mins
Fine-tuned	97.5%	~7 mins (Includes fine-tuning!)

Visual Performance

Speed Comparison

Comparing the prediction time constraints between Zero-shot and Fine-tuned models.

Accuracy Comparison

Fine-tuning on token alterations showed a dominant margin over zero-shot guessing.

The "Zero-shot Inference Bug" (Learning Experience)

While the fine-tuning approach was incredibly successful, the zero-shot baseline involved a major struggle.

When testing the zero-shot model on a small batch of 100 sentences via my evaluate_test_set function, I achieved a promising 75% accuracy. The function correctly handled both long and short sentences.

However, when I deployed the exact same core logic to predict the entire validation dataset and generate predictions_zeroshot.tsv, the final output results were completely nonsensical. I spent a significant portion of my sprint trying to isolate the problem. The core takeaway from this bug was a lesson in scaling logic: a function that perfectly passes a small, isolated test suite does not guarantee success when integrated into a massive batched pipeline. The pipeline's logic for iterating through the validation loop simply failed.

Project structure

data_preparation.ipynb - Wikipedia parsing, Regex cleaning, and dataset pair generation.
zeroshot_solution.py - Zero-shot pipeline using Masked LM.
finetuning_solution.py - Token Classification model structure and training loop.
stats.json - Dataset parsing metrics.
REPORT.MD - Slovak summary of my findings and struggles.

How to run

The project revolves around the Google Colab environment with T4 GPUs.

Clone the repository and upload the .py and .ipynb files to your Google Drive.

Install dependencies:

pip install torch transformers numpy mwparserfromhell

Run data_preparation.ipynb to download and clean the Wikipedia dump.
Run finetuning_solution.py to train the classification model.
(Optional) Run zeroshot_solution.py to attempt the Masked LM pipeline.

Main takeaways

I learned the massive speed difference between Masked LMs and Token Classification models.
I practiced building a dataset from raw XML/Wikipedia dumps strictly using Regex(NLP would be much much better).
I learned how a pipeline logic bug can destroy results even if the isolated prediction function achieves 75% accuracy.

Future improvements

If I were to revisit this codebase, I would prioritize:

scaling the training dataset beyond 30,000 sentences (up to my original goal of 150,000+), given the speed of the fine-tuned model,
experimenting with character-level CNNs as an alternative to transformer models.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
.gitignore		.gitignore
README.md		README.md
REPORT.MD		REPORT.MD
data_preparation.ipynb		data_preparation.ipynb
finetuning_solution.py		finetuning_solution.py
predictions_finetuned.tsv		predictions_finetuned.tsv
predictions_zeroshot.tsv		predictions_zeroshot.tsv
stats.json		stats.json
zeroshot_solution.py		zeroshot_solution.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Slovak Diacritic Restoration

Project goal

What the project does

Dataset

Models

1. Zero-shot (Masked LM)

2. Fine-tuned (Token Classification)

Results

Visual Performance

Speed Comparison

Accuracy Comparison

The "Zero-shot Inference Bug" (Learning Experience)

Project structure

How to run

Main takeaways

Future improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Slovak Diacritic Restoration

Project goal

What the project does

Dataset

Models

1. Zero-shot (Masked LM)

2. Fine-tuned (Token Classification)

Results

Visual Performance

Speed Comparison

Accuracy Comparison

The "Zero-shot Inference Bug" (Learning Experience)

Project structure

How to run

Main takeaways

Future improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages