Skip to content

stormtroober/Italian_Reviews_SentimentAnalysis

Repository files navigation

Italian Reviews Sentiment Analysis

NLP Sentiment Analysis Dashboard

This project implements a complete Data Engineering and Natural Language Processing (NLP) pipeline for the sentiment analysis of reviews written in Italian.

The system is responsible for extracting reviews from a dataset, processing them to clean the text and normalize the dates, using an advanced AI model (fine-tuned UmBERTo) to infer the sentiment of the text, and finally visualizing the results and key metrics through an interactive analytical dashboard.

Dataset

The project currently uses the Italian TripAdvisor Dataset available on Kaggle. It contains thousands of authentic reviews written in Italian, providing a rich base for sentiment classification and NLP testing.

Objectives and Main Features

The project focuses not only on the basic assignment of a sentiment (Positive, Negative, or Neutral), but also includes advanced quality-control heuristics:

  1. Dirty Text Handling: Deep cleaning of HTML entities and double encoding (UTF-8) of accented or special characters that often corrupt Italian datasets.
  2. Local AI Enrichment: Uses a Transformer-based model (Frabbate/umberto-commoncrawl-cased-sentiment) executed entirely locally on the CPU, ensuring data control and no costs for external APIs.
  3. Safety & Consistency Checks: Cross-references the original numerical user ratings (stars) with the model's predictions to identify anomalous behaviors, such as "high rating but negative text" or potential sarcasm ("low rating but positive text").
  4. Data Visualization: An exploratory platform to analyze pie charts of distributions, breakdowns of model confidence (score), and highlight reviews that require manual attention.

Data Pipeline Workflow

The entire application is a logically structured "staged" process:

1. Data Ingestion (ingestion.py)

Raw reviews are first read from a source CSV file and stored in a Landing Zone in the PostgreSQL database (stg_reviews_raw table). In this phase, essential loading-related metadata is added, preserving the data as-is.

2. Preprocessing and Cleaning (preprocessing.py)

Raw data is extracted from the database and goes through a cleaning pipeline:

  • Unescaping of any remaining HTML entities (e.g., à converted to à).
  • Normalization and repair of text decoding via the ftfy library.
  • Parsing of Italian dates (e.g., "26 ottobre 2016" translated to standard YYYY-MM-DD).
  • Reliability Filter: Discarding comments that are too short (e.g., less than 59 characters) whose semantic validity for sentiment analysis is poor or nil.

The pre-processed data is saved in the stg_reviews_clean table.

3. AI Enrichment (ai_enrichment.py)

This is the NLP heart of the project:

  • Clean records are passed in batches to UmBERTo, an Italian language model fine-tuned for Sequence Classification.
  • Predictions and confidence levels (sentiment_score) are integrated into the data.
  • The Consistency business logic comes into play, where logical stars-text discrepancies are flagged.

The final, "enriched" data is sent to the target analytical table slv_reviews_ai_enriched_finetuned.

4. Interactive Visualization (dashboard.py)

An interactive Web App developed in Streamlit connects to the database and allows for agile navigation through all results. It provides direct KPIs, Plotly charts to highlight sentiment proportions, filters for "Safety" (content safety), and allows inspection of the detailed table containing the model's output.

Requirements and Environment Setup

To correctly launch the pipeline, you need a local Postgres instance and the Python ecosystem.

  1. Starting the Database: The DB relies on Docker using the provided docker-compose.yml.
    docker compose up -d
  2. Python Virtual Environment: Always use the virtual environment to isolate packages. Using uv is recommended.
    # Activate the virtual environment
    source .venv/bin/activate
    
    # To install dependencies
    uv pip install pandas sqlalchemy psycopg dotenv transformers torch ftfy streamlit plotly tqdm
  3. Environment Variables: Make sure to fill out an .env file containing the correct PostgreSQL credentials (POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB).

Pipeline Execution

All processes must be executed sequentially to rebuild the pipeline end-to-end:

# Phase 1
python ingestion.py

# Phase 2
python preprocessing.py

# Phase 3 (The model will be downloaded in the background during the first run)
python ai_enrichment.py

# Phase 4 (Starts the Streamlit dashboard)
streamlit run dashboard.py

Future Developments (Work in Progress)

To take the NLP analysis to the next level, the following continuous improvements are planned for the pipeline:

  1. Aspect-Based Topic Modeling & Keyword Extraction Instead of just knowing if a review is negative, the model will extract why. By integrating libraries like KeyBERT or specialized spaCy text clustering, the system will automatically group reviews into topics (e.g., "shipping delays," "customer service," "broken items"), providing precise actionable insights on the dashboard.

  2. Emotion Detection (Beyond Polarity) Recognizing that sentiment is only one facet of feedback, a secondary Italian-optimized NLP model (such as MilaNLProc/feel-it-italian-emotion) will be added to the enrichment phase. Segmenting output by discrete psychological feelings (Anger, Joy, Fear, Sadness) will let us immediately distinguish mildly disappointed clients from completely furious ones.

  3. Data Pipeline Orchestration (Airflow & dbt) Transitioning the architecture from standalone sequential Python scripts to a fully integrated and scheduled pipeline. By introducing Apache Airflow for orchestration and dbt for in-database SQL transformations, the system will achieve production-grade reliability and scalability.

  4. Multi-Platform Web Scraping Integration Extending the data ingestion phase by deploying custom web scrapers. This will allow the pipeline to continuously fetch fresh reviews from external platforms like Amazon, Trustpilot, and App Stores, providing a unified and real-time sentiment overview.

About

Automated pipeline for Italian text sentiment analysis. It cleans reviews, infers sentiment via UmBERTo, checks rating consistency, and visualizes insights on a dashboard.

Resources

License

Stars

Watchers

Forks

Contributors

Languages