Italian Reviews Sentiment Analysis

This project implements a complete Data Engineering and Natural Language Processing (NLP) pipeline for the sentiment analysis of reviews written in Italian.

The system is responsible for extracting reviews from a dataset, processing them to clean the text and normalize the dates, using an advanced AI model (fine-tuned UmBERTo) to infer the sentiment of the text, and finally visualizing the results and key metrics through an interactive analytical dashboard.

Dataset

The project currently uses the Italian TripAdvisor Dataset available on Kaggle. It contains thousands of authentic reviews written in Italian, providing a rich base for sentiment classification and NLP testing.

Objectives and Main Features

The project focuses not only on the basic assignment of a sentiment (Positive, Negative, or Neutral), but also includes advanced quality-control heuristics:

Dirty Text Handling: Deep cleaning of HTML entities and double encoding (UTF-8) of accented or special characters that often corrupt Italian datasets.
Local AI Enrichment: Uses a Transformer-based model (Frabbate/umberto-commoncrawl-cased-sentiment) executed entirely locally on the CPU, ensuring data control and no costs for external APIs.
Safety & Consistency Checks: Cross-references the original numerical user ratings (stars) with the model's predictions to identify anomalous behaviors, such as "high rating but negative text" or potential sarcasm ("low rating but positive text").
Data Visualization: An exploratory platform to analyze pie charts of distributions, breakdowns of model confidence (score), and highlight reviews that require manual attention.

Data Pipeline Workflow

The entire application is a logically structured "staged" process:

1. Data Ingestion (`ingestion.py`)

Raw reviews are first read from a source CSV file and stored in a Landing Zone in the PostgreSQL database (stg_reviews_raw table). In this phase, essential loading-related metadata is added, preserving the data as-is.

2. Preprocessing and Cleaning (`preprocessing.py`)

Raw data is extracted from the database and goes through a cleaning pipeline:

Unescaping of any remaining HTML entities (e.g., à converted to à).
Normalization and repair of text decoding via the ftfy library.
Parsing of Italian dates (e.g., "26 ottobre 2016" translated to standard YYYY-MM-DD).
Reliability Filter: Discarding comments that are too short (e.g., less than 59 characters) whose semantic validity for sentiment analysis is poor or nil.

The pre-processed data is saved in the stg_reviews_clean table.

3. AI Enrichment (`ai_enrichment.py`)

This is the NLP heart of the project:

Clean records are passed in batches to UmBERTo, an Italian language model fine-tuned for Sequence Classification.
Predictions and confidence levels (sentiment_score) are integrated into the data.
The Consistency business logic comes into play, where logical stars-text discrepancies are flagged.

The final, "enriched" data is sent to the target analytical table slv_reviews_ai_enriched_finetuned.

4. Interactive Visualization (`dashboard.py`)

An interactive Web App developed in Streamlit connects to the database and allows for agile navigation through all results. It provides direct KPIs, Plotly charts to highlight sentiment proportions, filters for "Safety" (content safety), and allows inspection of the detailed table containing the model's output.

Requirements and Environment Setup

To correctly launch the pipeline, you need a local Postgres instance and the Python ecosystem.

Starting the Database: The DB relies on Docker using the provided docker-compose.yml.
```
docker compose up -d
```

Python Virtual Environment: Always use the virtual environment to isolate packages. Using uv is recommended.

# Activate the virtual environment
source .venv/bin/activate

# To install dependencies
uv pip install pandas sqlalchemy psycopg dotenv transformers torch ftfy streamlit plotly tqdm

Environment Variables: Make sure to fill out an .env file containing the correct PostgreSQL credentials (POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB).

Pipeline Execution

All processes must be executed sequentially to rebuild the pipeline end-to-end:

# Phase 1
python ingestion.py

# Phase 2
python preprocessing.py

# Phase 3 (The model will be downloaded in the background during the first run)
python ai_enrichment.py

# Phase 4 (Starts the Streamlit dashboard)
streamlit run dashboard.py

Future Developments (Work in Progress)

To take the NLP analysis to the next level, the following continuous improvements are planned for the pipeline:

Aspect-Based Topic Modeling & Keyword Extraction Instead of just knowing if a review is negative, the model will extract why. By integrating libraries like KeyBERT or specialized spaCy text clustering, the system will automatically group reviews into topics (e.g., "shipping delays," "customer service," "broken items"), providing precise actionable insights on the dashboard.
Emotion Detection (Beyond Polarity) Recognizing that sentiment is only one facet of feedback, a secondary Italian-optimized NLP model (such as MilaNLProc/feel-it-italian-emotion) will be added to the enrichment phase. Segmenting output by discrete psychological feelings (Anger, Joy, Fear, Sadness) will let us immediately distinguish mildly disappointed clients from completely furious ones.
Data Pipeline Orchestration (Airflow & dbt) Transitioning the architecture from standalone sequential Python scripts to a fully integrated and scheduled pipeline. By introducing Apache Airflow for orchestration and dbt for in-database SQL transformations, the system will achieve production-grade reliability and scalability.
Multi-Platform Web Scraping Integration Extending the data ingestion phase by deploying custom web scrapers. This will allow the pipeline to continuously fetch fresh reviews from external platforms like Amazon, Trustpilot, and App Stores, providing a unified and real-time sentiment overview.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.agents/rules		.agents/rules
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ai_enrichment.py		ai_enrichment.py
dashboard.png		dashboard.png
dashboard.py		dashboard.py
docker-compose.yml		docker-compose.yml
ingestion.py		ingestion.py
preprocessing.py		preprocessing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Italian Reviews Sentiment Analysis

Dataset

Objectives and Main Features

Data Pipeline Workflow

1. Data Ingestion (`ingestion.py`)

2. Preprocessing and Cleaning (`preprocessing.py`)

3. AI Enrichment (`ai_enrichment.py`)

4. Interactive Visualization (`dashboard.py`)

Requirements and Environment Setup

Pipeline Execution

Future Developments (Work in Progress)

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Italian Reviews Sentiment Analysis

Dataset

Objectives and Main Features

Data Pipeline Workflow

1. Data Ingestion (ingestion.py)

2. Preprocessing and Cleaning (preprocessing.py)

3. AI Enrichment (ai_enrichment.py)

4. Interactive Visualization (dashboard.py)

Requirements and Environment Setup

Pipeline Execution

Future Developments (Work in Progress)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

1. Data Ingestion (`ingestion.py`)

2. Preprocessing and Cleaning (`preprocessing.py`)

3. AI Enrichment (`ai_enrichment.py`)

4. Interactive Visualization (`dashboard.py`)