This project implements a complete Data Engineering and Natural Language Processing (NLP) pipeline for the sentiment analysis of reviews written in Italian.
The system is responsible for extracting reviews from a dataset, processing them to clean the text and normalize the dates, using an advanced AI model (fine-tuned UmBERTo) to infer the sentiment of the text, and finally visualizing the results and key metrics through an interactive analytical dashboard.
The project currently uses the Italian TripAdvisor Dataset available on Kaggle. It contains thousands of authentic reviews written in Italian, providing a rich base for sentiment classification and NLP testing.
The project focuses not only on the basic assignment of a sentiment (Positive, Negative, or Neutral), but also includes advanced quality-control heuristics:
- Dirty Text Handling: Deep cleaning of HTML entities and double encoding (UTF-8) of accented or special characters that often corrupt Italian datasets.
- Local AI Enrichment: Uses a Transformer-based model (
Frabbate/umberto-commoncrawl-cased-sentiment) executed entirely locally on the CPU, ensuring data control and no costs for external APIs. - Safety & Consistency Checks: Cross-references the original numerical user ratings (stars) with the model's predictions to identify anomalous behaviors, such as "high rating but negative text" or potential sarcasm ("low rating but positive text").
- Data Visualization: An exploratory platform to analyze pie charts of distributions, breakdowns of model confidence (score), and highlight reviews that require manual attention.
The entire application is a logically structured "staged" process:
Raw reviews are first read from a source CSV file and stored in a Landing Zone in the PostgreSQL database (stg_reviews_raw table). In this phase, essential loading-related metadata is added, preserving the data as-is.
Raw data is extracted from the database and goes through a cleaning pipeline:
- Unescaping of any remaining HTML entities (e.g.,
àconverted toà). - Normalization and repair of text decoding via the
ftfylibrary. - Parsing of Italian dates (e.g., "26 ottobre 2016" translated to standard
YYYY-MM-DD). - Reliability Filter: Discarding comments that are too short (e.g., less than 59 characters) whose semantic validity for sentiment analysis is poor or nil.
The pre-processed data is saved in the stg_reviews_clean table.
This is the NLP heart of the project:
- Clean records are passed in batches to UmBERTo, an Italian language model fine-tuned for Sequence Classification.
- Predictions and confidence levels (
sentiment_score) are integrated into the data. - The Consistency business logic comes into play, where logical stars-text discrepancies are flagged.
The final, "enriched" data is sent to the target analytical table slv_reviews_ai_enriched_finetuned.
An interactive Web App developed in Streamlit connects to the database and allows for agile navigation through all results. It provides direct KPIs, Plotly charts to highlight sentiment proportions, filters for "Safety" (content safety), and allows inspection of the detailed table containing the model's output.
To correctly launch the pipeline, you need a local Postgres instance and the Python ecosystem.
- Starting the Database:
The DB relies on Docker using the provided
docker-compose.yml.docker compose up -d
- Python Virtual Environment:
Always use the virtual environment to isolate packages. Using
uvis recommended.# Activate the virtual environment source .venv/bin/activate # To install dependencies uv pip install pandas sqlalchemy psycopg dotenv transformers torch ftfy streamlit plotly tqdm
- Environment Variables:
Make sure to fill out an
.envfile containing the correct PostgreSQL credentials (POSTGRES_USER,POSTGRES_PASSWORD,POSTGRES_DB).
All processes must be executed sequentially to rebuild the pipeline end-to-end:
# Phase 1
python ingestion.py
# Phase 2
python preprocessing.py
# Phase 3 (The model will be downloaded in the background during the first run)
python ai_enrichment.py
# Phase 4 (Starts the Streamlit dashboard)
streamlit run dashboard.pyTo take the NLP analysis to the next level, the following continuous improvements are planned for the pipeline:
-
Aspect-Based Topic Modeling & Keyword Extraction Instead of just knowing if a review is negative, the model will extract why. By integrating libraries like
KeyBERTor specializedspaCytext clustering, the system will automatically group reviews into topics (e.g., "shipping delays," "customer service," "broken items"), providing precise actionable insights on the dashboard. -
Emotion Detection (Beyond Polarity) Recognizing that sentiment is only one facet of feedback, a secondary Italian-optimized NLP model (such as
MilaNLProc/feel-it-italian-emotion) will be added to the enrichment phase. Segmenting output by discrete psychological feelings (Anger, Joy, Fear, Sadness) will let us immediately distinguish mildly disappointed clients from completely furious ones. -
Data Pipeline Orchestration (Airflow & dbt) Transitioning the architecture from standalone sequential Python scripts to a fully integrated and scheduled pipeline. By introducing Apache Airflow for orchestration and dbt for in-database SQL transformations, the system will achieve production-grade reliability and scalability.
-
Multi-Platform Web Scraping Integration Extending the data ingestion phase by deploying custom web scrapers. This will allow the pipeline to continuously fetch fresh reviews from external platforms like Amazon, Trustpilot, and App Stores, providing a unified and real-time sentiment overview.
