Skip to content

Zeljko103/natural-language-processing

Repository files navigation

Named Entity Recognition (NER) in Texts Written in the Serbian Language

This software was developed as a group project for the Natural Language Processing class, under the mentorship of Vuk Batanović, PhD at the University of Belgrade - School of Electrical Engineering, in October 2025.

Table of Contents

  1. Introduction
  2. Software Requirements and Execution
  3. Project Structure
  4. Collected Data
  5. Models
  6. Results

1. Introduction

The goal of this project was to build, analyze and compare named entity recognition (NER) systems for the Serbian language on texts coming from three different domains: books, newspaper articles and Twitter/X posts. In order to do this in a controlled way, we first created a manually annotated gold‑standard corpus with PER, LOC and ORG labels, and then evaluated several models under identical conditions. This project covers the complete process from data collection to final evaluation.

Texts from article and Twitter/X domains were automatically collected from one news portal and multiple public Twitter/X profiles, whilst book excerpts were carefully selected by hand from very different sources. All texts were cleaned, deduplicated, and tokenized using the reldi-tokeniser, and then annotated according to the shared set of annotation rules. Before annotating the full corpus, a separate calibration set was prepared and independently labeled by multiple annotators in order to refine the guidelines and measure agreement.

Afterwards, a simple feature‑based baseline model (Multinomial Naive Bayes) was implemented, trained, and evaluated using the previously created corpus, along with publicly available NER models for Serbian (CLASSLA – standardized and non‑standardized variants, and BERTic). Their performance was compared on the same test data using both BIO and collapsed label variants, as well as across the three source domains. The rest of this document briefly summarizes the setup, while a more detailed description of the motivation, annotation process, experiments and results is given in the Serbian language PDF available in the Documentation directory.


2. Software Requirements and Execution

2.1. Requirements

  • Python 3.10+
  • Git (to clone the repository).
  • For scraping news and tweets: network access and a stable internet connection.

2.2. Installation and environment setup

Clone the repository and create a virtual environment:

git clone https://github.com/Zeljko103/natural-language-processing.git
cd natural-language-processing

Create and activate a virtual environment, then install dependencies:

  • Windows:

    python -m venv venv
    venv\Scripts\activate
    pip install -r requirements.txt
  • Linux:

    python3 -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt

One-time download of CLASSLA models for Serbian (run in a Python shell):

import classla
classla.download("sr")
classla.download("sr", type="nonstandard")

For the X/Twitter scraper, install Playwright browsers (first run only):

playwright install chromium

2.3. Running the scrapers

News articles (Scrapy):

cd scrapers/article_scraper
scrapy crawl article_spider

This saves raw articles under sources/articles/original/.

Tweets (X/Twitter, Playwright):

cd scrapers/x_scraper
python x_scraper.py

This writes tweets to sources/tweets/original/tweets.txt. If you want to keep only tweets after a certain date, you can uncomment the filter_tweets_from_date(...) call in main.py (Phase 1) and set the cutoff timestamp.

2.4. Running preprocessing and models

From the project root (with the virtual environment activated):

python main.py          # preprocessing, tokenization, CLASSLA + BERTic and evaluation
python baseline_model.py  # baseline NB model cross-validation and evaluation

All intermediate files and evaluation reports are written to the folders defined in paths.py (datasets under dataset/, model outputs and scores under models/).


3. Project Structure

Below is a list of short explanations about the key directories and files in this project:

  • scrapers — contains programs for automated collection of texts from news portals and Twitter.
  • sources — contains all collected and filtered texts.
  • dataset — contains tokenized and annotated data from collected texts.
    • callibration_set — individually annotated subsets of collected texts, that were used to define and improve annotation rules before applying them on all collected data.
    • gold_standard_set — completely annotated data from collected texts.
  • models — contains NER results and evaluations for all used models.
  • helper_functions — contains all functions used for filtering texts, tokenization, annotation, evaluation and working with publicly available models.
  • paths.py — defines paths to all files that are used and generated during program execution.
  • baseline_model.py — implementation of a baseline approach and functions for its evaluation.
  • main.py — filters collected text and prepares it for annotation, performs NER using publicly available models and evaluates their predictions.

4. Collected Data

To collect both formal and informal text, three fundamentally different source domains were used for data collection: books (5.514 tokens), newspaper articles (5.288 tokens), and Twitter posts (5.267 tokens). Books were carefully selected to be as diverse as possible, in terms of genres, authors, audiences, time periods, and locations of their stories. News articles and Twitter posts were scraped using Scrapy framework and playwright library.

Token sources graph

Collected texts were tokenized using reldi-tokeniser and manually annotated with PER, LOC and ORG labels. The number of identified and marked named entities in all texts together is shown in the image below:

Labeled entities


5. Models

Three types of NER systems were evaluated on the same manually annotated Serbian corpus and their performances were compared across domains and label variants.

5.1. Baseline model (Multinomial Naive Bayes)

The baseline is a simple feature-based token classifier trained with Multinomial Naive Bayes (baseline_model.py). It uses:

  • Local token features (lowercased form, capitalization flags, presence of digits, token position in the sentence).
  • Left context features (lowercased forms and casing information for a fixed number of preceding tokens).
  • Short prefixes and suffixes (1–3 characters) and character trigrams, which help with rich morphology.

These features are converted to sparse vectors using DictVectorizer, and a separate model is trained in each fold of a 10-fold GroupKFold cross-validation, where sentences are used as groups to avoid leakage between train and test. The model is evaluated in two variants:

  • BIO labels: full BIO tags as they appear in the gold standard.
  • Collapsed labels: entity type only (PER/LOC/ORG vs. O), using the collapse_label helper.

For each variant the code reports per-class precision, recall and F1 per fold (CSV) and aggregated reports over all tokens and domains.

5.2. CLASSLA NER models

The project used two pretrained CLASSLA NER models for Serbian:

  • Standardized Serbian: tuned for standard language.
  • Non-standardized Serbian: tuned for non-standard and social media text.

Both models were applied to:

  • Book texts.
  • Newspaper articles.
  • Twitter/X posts.

For each domain (and for all domains together), evaluation was done in BIO and collapsed modes. The results were written to models/evaluations/ and include the usual classification tables, as well as files listing token-level mismatches between predictions and the gold standard.

5.3. BERTic (Transformer-based NER)

The third evaluated system was a transformer-based NER model based on BERTic for South Slavic languages, loaded via simpletransformers as an ELECTRA model (classla/bcms-bertic-ner). In this project, it was used as an off-the-shelf model with a predefined label set and maximum sequence length, without additional fine-tuning on the project data.

BERTic was run on the same tokenized gold standard sentences, and its predictions were post-processed so that the labels were mapped into the same tag set as the corpus (including collapsing variants when requested). As was the case with the other models, performance was reported per domain and over all domains together, in both BIO and collapsed setups.


6. Results

The following graph compares the resulting performances of all used NER models when tasked with recognizing persons, organizations and locations in texts written in the Serbian language:

All domains collapsed

About

NER for Serbian across multiple domains with CLASSLA, BERTic, and baseline models on a manually annotated gold-standard dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages