This repository contains an NLP pipeline designed to distinguish between real and fake news headlines using various machine learning architectures, ranging from traditional linear models to modern Transformer-based fine-tuning.
The objective was to build a robust binary classifier capable of identifying "Fake News" based solely on textual headlines.
While initial experiments focused on Scikit-Learn classifiers, the project was recently updated to include Transformer-based Transfer Learning. Fine-tuning DistilBERT (a distilled version of BERT) significantly outperformed traditional methods by capturing deeper semantic relationships in the headlines.
| Model | Vectorizer / Architecture | Accuracy | F1-Score / Loss |
|---|---|---|---|
| DistilBERT (Fine-tuned) | AutoTokenizer (Transformer) | 98.04% | 0.0706 (Loss) |
| LinearSVC | TfidfVectorizer | 94.17% | 0.9416 |
| Logistic Regression | TfidfVectorizer | 94.14% | 0.9413 |
| XGBClassifier | TfidfVectorizer | 93.30% | 0.9328 |
| Multinomial NB | CountVectorizer | 93.10% | 0.9308 |
Key Takeaways:
- Transformers vs. Linear Models: The jump from 94% to 98% accuracy demonstrates the power of contextual embeddings. Unlike TF-IDF, DistilBERT understands word order and nuance [web:4].
- Efficiency: Using
distilbert-base-uncasedallowed for high performance with a smaller memory footprint compared to full BERT, completing 3 epochs with an evaluation speed of ~9182 samples/sec. - Modern NLP Pipeline: The latest iteration utilizes the Hugging Face ecosystem (
Transformerslibrary), implementingDataCollatorWithPaddingfor efficient dynamic batching during training.
The model was trained on a balanced dataset of news headlines:
- Total Samples: 34,152 headlines.
- Class 0 (Fake News): 17,572 headlines.
- Class 1 (Real News): 16,580 headlines.
- Preprocessing: For Transformers, we use the
AutoTokenizerfordistilbert-base-uncased. Traditional models used lemmatization and Regex cleaning.
- Deep Learning: Hugging Face Transformers (
AutoModelForSequenceClassification,AutoTokenizer), PyTorch/TensorFlow. - Machine Learning: Scikit-Learn (Pipelines, LinearSVC, LogisticRegression), XGBoost.
- NLP: DistilBERT, NLTK (Lemmatization), TF-IDF.
- Core: Python 3.12, Pandas, NumPy.
- Python 3.12 or higher.
- GPU recommended for running
notebooks/modern_nlp.ipynb.
-
Clone the repository:
git clone git@github.com:coffeedrunkpanda/news-credibility-classifier.git cd news-credibility-classifier -
Install dependencies:
pip install -r requirements.txt
-
Prepare the environment: Download the dataset and best-trained models and place them in the following structure:
- CSVs go into
./data/ .joblibfiles go into./outputs/models/
- CSVs go into
├── notebooks/ # EDA and model experimentation
├── src/ # Modular Python scripts for the ML pipeline
├── outputs/ # Saved models and evaluation artifacts
├── reports/ # Final project documentation (PDF)
└── scripts/ # Automation for hyperparameter optimization
Built with passion by @coffeedrunkpanda and @harmandeep2993 during the Ironhack Bootcamp. We combined independent research to compare various NLP methodologies.