This software was developed as a group project for the Natural Language Processing class, under the mentorship of Vuk Batanović, PhD at the University of Belgrade - School of Electrical Engineering, in October 2025.
The goal of this project was to build, analyze and compare named entity recognition (NER) systems for the Serbian language on texts coming from three different domains: books, newspaper articles and Twitter/X posts. In order to do this in a controlled way, we first created a manually annotated gold‑standard corpus with PER, LOC and ORG labels, and then evaluated several models under identical conditions. This project covers the complete process from data collection to final evaluation.
Texts from article and Twitter/X domains were automatically collected from one news portal and multiple public Twitter/X profiles, whilst book excerpts were carefully selected by hand from very different sources. All texts were cleaned, deduplicated, and tokenized using the reldi-tokeniser, and then annotated according to the shared set of annotation rules. Before annotating the full corpus, a separate calibration set was prepared and independently labeled by multiple annotators in order to refine the guidelines and measure agreement.
Afterwards, a simple feature‑based baseline model (Multinomial Naive Bayes) was implemented, trained, and evaluated using the previously created corpus, along with publicly available NER models for Serbian (CLASSLA – standardized and non‑standardized variants, and BERTic). Their performance was compared on the same test data using both BIO and collapsed label variants, as well as across the three source domains. The rest of this document briefly summarizes the setup, while a more detailed description of the motivation, annotation process, experiments and results is given in the Serbian language PDF available in the Documentation directory.
- Python 3.10+
- Git (to clone the repository).
- For scraping news and tweets: network access and a stable internet connection.
Clone the repository and create a virtual environment:
git clone https://github.com/Zeljko103/natural-language-processing.git
cd natural-language-processingCreate and activate a virtual environment, then install dependencies:
-
Windows:
python -m venv venv venv\Scripts\activate pip install -r requirements.txt
-
Linux:
python3 -m venv venv source venv/bin/activate pip install -r requirements.txt
One-time download of CLASSLA models for Serbian (run in a Python shell):
import classla
classla.download("sr")
classla.download("sr", type="nonstandard")For the X/Twitter scraper, install Playwright browsers (first run only):
playwright install chromiumNews articles (Scrapy):
cd scrapers/article_scraper
scrapy crawl article_spiderThis saves raw articles under sources/articles/original/.
Tweets (X/Twitter, Playwright):
cd scrapers/x_scraper
python x_scraper.pyThis writes tweets to sources/tweets/original/tweets.txt. If you want to keep only tweets after a certain date, you can uncomment the filter_tweets_from_date(...) call in main.py (Phase 1) and set the cutoff timestamp.
From the project root (with the virtual environment activated):
python main.py # preprocessing, tokenization, CLASSLA + BERTic and evaluation
python baseline_model.py # baseline NB model cross-validation and evaluationAll intermediate files and evaluation reports are written to the folders defined in paths.py (datasets under dataset/, model outputs and scores under models/).
Below is a list of short explanations about the key directories and files in this project:
- scrapers — contains programs for automated collection of texts from news portals and Twitter.
- sources — contains all collected and filtered texts.
- dataset — contains tokenized and annotated data from collected texts.
- callibration_set — individually annotated subsets of collected texts, that were used to define and improve annotation rules before applying them on all collected data.
- gold_standard_set — completely annotated data from collected texts.
- models — contains NER results and evaluations for all used models.
- helper_functions — contains all functions used for filtering texts, tokenization, annotation, evaluation and working with publicly available models.
- paths.py — defines paths to all files that are used and generated during program execution.
- baseline_model.py — implementation of a baseline approach and functions for its evaluation.
- main.py — filters collected text and prepares it for annotation, performs NER using publicly available models and evaluates their predictions.
To collect both formal and informal text, three fundamentally different source domains were used for data collection: books (5.514 tokens), newspaper articles (5.288 tokens), and Twitter posts (5.267 tokens). Books were carefully selected to be as diverse as possible, in terms of genres, authors, audiences, time periods, and locations of their stories. News articles and Twitter posts were scraped using Scrapy framework and playwright library.
Collected texts were tokenized using reldi-tokeniser and manually annotated with PER, LOC and ORG labels. The number of identified and marked named entities in all texts together is shown in the image below:
Three types of NER systems were evaluated on the same manually annotated Serbian corpus and their performances were compared across domains and label variants.
The baseline is a simple feature-based token classifier trained with Multinomial Naive Bayes (baseline_model.py). It uses:
- Local token features (lowercased form, capitalization flags, presence of digits, token position in the sentence).
- Left context features (lowercased forms and casing information for a fixed number of preceding tokens).
- Short prefixes and suffixes (1–3 characters) and character trigrams, which help with rich morphology.
These features are converted to sparse vectors using DictVectorizer, and a separate model is trained in each fold of a 10-fold GroupKFold cross-validation, where sentences are used as groups to avoid leakage between train and test. The model is evaluated in two variants:
- BIO labels: full BIO tags as they appear in the gold standard.
- Collapsed labels: entity type only (PER/LOC/ORG vs. O), using the
collapse_labelhelper.
For each variant the code reports per-class precision, recall and F1 per fold (CSV) and aggregated reports over all tokens and domains.
The project used two pretrained CLASSLA NER models for Serbian:
- Standardized Serbian: tuned for standard language.
- Non-standardized Serbian: tuned for non-standard and social media text.
Both models were applied to:
- Book texts.
- Newspaper articles.
- Twitter/X posts.
For each domain (and for all domains together), evaluation was done in BIO and collapsed modes. The results were written to models/evaluations/ and include the usual classification tables, as well as files listing token-level mismatches between predictions and the gold standard.
The third evaluated system was a transformer-based NER model based on BERTic for South Slavic languages, loaded via simpletransformers as an ELECTRA model (classla/bcms-bertic-ner). In this project, it was used as an off-the-shelf model with a predefined label set and maximum sequence length, without additional fine-tuning on the project data.
BERTic was run on the same tokenized gold standard sentences, and its predictions were post-processed so that the labels were mapped into the same tag set as the corpus (including collapsing variants when requested). As was the case with the other models, performance was reported per domain and over all domains together, in both BIO and collapsed setups.
The following graph compares the resulting performances of all used NER models when tasked with recognizing persons, organizations and locations in texts written in the Serbian language:


