An AI-powered web application that uses Natural Language Processing (NLP) and Machine Learning to detect fake news articles with high accuracy. This project demonstrates end-to-end ML model development, from data preprocessing to web deployment.
- 🤖 Ensemble ML Models: 4 models (Naive Bayes, Random Forest, Logistic Regression, SVM) with majority voting
- 🔍 AI Fact-Checker: Entity verification via Wikipedia, numerical claim validation, scam pattern detection
- 🌐 Google Fact Check API (Optional): Professional fact-checking from Snopes, PolitiFact, and other verified sources
- 🧠 Advanced NLP: Text preprocessing with tokenization, stemming, stopword removal, and TF-IDF vectorization
- 🎨 Modern Web Interface: Beautiful, responsive Flask-based UI with dark/light themes
- ⚡ Real-time Analysis: Instant classification with confidence scores and detailed insights
- 📊 Comprehensive Analytics: Probability distributions, model performance metrics, and fact-check warnings
- 🛡️ Input Validation: Smart edge case handling for URLs, non-English text, and invalid inputs
- 🧪 Thorough Testing: Unit tests, model evaluation suite, and edge case analysis
fake-news-detection/
├── app.py # Flask web application
├── config.py # Configuration settings
├── requirements.txt # Python dependencies
├── README.md # Project documentation
├── .gitignore # Git ignore rules
├── templates/ # HTML templates
│ ├── index.html # Main page
│ └── about.html # About page
├── src/ # Source code modules
│ ├── __init__.py
│ ├── data_processing.py # NLP preprocessing utilities
│ ├── model.py # ML model implementation
│ ├── fact_checker.py # AI fact-checking module (NEW!)
│ └── utils.py # Helper functions
├── data/ # Dataset storage
│ ├── raw/ # Raw datasets
│ ├── processed/ # Processed datasets
│ └── sample_data.csv # Sample dataset
├── models/ # Trained models
│ ├── naive_bayes_model.joblib
│ ├── vectorizer.joblib
│ └── training_results.txt
├── notebooks/ # Jupyter notebooks
│ ├── data_exploration.ipynb
│ └── text_processing_basics.ipynb
└── tests/ # Unit tests
├── __init__.py
└── test_model.py
- Python 3.8 or higher
- pip package manager
- Git (for cloning)
git clone https://github.com/Ujjwal-eng/fake_news_detection.git
cd fake_news_detection# Windows
python -m venv .venv
.venv\Scripts\activate
# macOS/Linux
python3 -m venv .venv
source .venv/bin/activatepip install -r requirements.txtpython -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet'); nltk.download('omw-1.4')"
python -m spacy download en_core_web_smFor enhanced fact-checking from professional sources like Snopes and PolitiFact:
-
Get a free API key from Google Cloud Console
- Follow the detailed setup guide: GOOGLE_API_SETUP.md
- Free tier includes 10,000 requests per day
-
Add API key to environment:
# Windows PowerShell $env:GOOGLE_FACT_CHECK_API_KEY="your_google_api_key_here" # Windows Command Prompt set GOOGLE_FACT_CHECK_API_KEY=your_google_api_key_here # Linux/Mac export GOOGLE_FACT_CHECK_API_KEY="your_google_api_key_here"
-
Or create a
.envfile (recommended for permanent storage):# Create .env file in project root echo "GOOGLE_FACT_CHECK_API_KEY=your_google_api_key_here" > .env
Note: The
.envfile is automatically ignored by git for security. -
Or add to shell profile:
# Linux/Mac - add to ~/.bashrc or ~/.zshrc echo 'export GOOGLE_FACT_CHECK_API_KEY="your_key_here"' >> ~/.bashrc source ~/.bashrc # Windows - use System Environment Variables (GUI) # Settings → System → About → Advanced system settings → Environment Variables
Note: The app works perfectly without the Google API using Wikipedia + pattern matching. The Google API is an optional enhancement that provides fact-checks from professional fact-checkers.
python app.pyThen open your browser and navigate to: http://localhost:5000
To train new models with custom dataset:
# Place dataset in data/raw/
# Then run the training script
python -m src.modelFor experimentation and analysis:
jupyter notebook
# Open notebooks/ directoryThe models are trained on 11,632 professionally labeled articles spanning multiple years and contexts:
- 2016-2017: Political news from ISOT Fake News Dataset (5,816 articles)
- 2020-2021: COVID-19 misinformation and health news (3,581 articles)
- 2022-2023: Recent fake news from diverse sources (6,299 articles)
Dataset Composition:
- Real News: 5,816 articles (50%)
- Fake News: 5,816 articles (50%)
- Total: 11,632 perfectly balanced articles
Labeled fake news datasets require:
- ✅ Professional fact-checking (6-12 months)
- ✅ Expert verification from multiple sources
- ✅ Legal review to avoid defamation
- ✅ Consensus from fact-checking organizations
This makes 2016-2023 the most recent professionally verified data available for training.
text,label
"News article text here...",0
"Another news article...",1- text: The news article content (string)
- label: 0 for real news, 1 for fake news (integer)
Trained on 11,632 balanced articles (2016-2023):
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Naive Bayes | 83.2% | 83.7% | 83.2% | 83.1% |
| Logistic Regression | 90.0% | 90.0% | 90.0% | 90.0% |
| Random Forest | 87.8% | 87.9% | 87.8% | 87.8% |
| SVM | 89.7% | 89.7% | 89.7% | 89.7% |
Ensemble Voting System: Combines all 4 models with majority voting and confidence-based tie-breaking for optimal accuracy.
- ISOT Fake News Dataset: 44,898 political articles (2016-2017) from Kaggle
- COVID-19 Fake News Dataset: 10,700 health-related articles (2020-2021)
- Recent Fake News Dataset: 6,335 diverse articles (2022-2023)
Final training set: 11,632 articles after deduplication and balancing
- Python 3.8+: Primary programming language
- Flask: Web framework for the application
- scikit-learn: Machine learning models and evaluation
- NLTK: Natural language processing toolkit
- Pandas & NumPy: Data manipulation and analysis
- Joblib: Model serialization
- Naive Bayes Classifier
- Random Forest Classifier
- Logistic Regression
- Support Vector Machine (SVM)
- TF-IDF Vectorization
- HTML5 & CSS3
- JavaScript (Vanilla)
- Responsive Design
Raw Text → Lowercasing → Tokenization → Stop Word Removal → Stemming → Clean Text
- TF-IDF Vectorization: Converts text to numerical features
- N-gram Analysis: Captures word patterns (unigrams, bigrams)
- Feature Selection: Identifies most informative features
- Multiple ML models trained on labeled datasets
- Ensemble predictions for improved accuracy
- Probability estimation for confidence scoring
- User submits news text
- Backend processes and vectorizes text
- Model predicts and returns result with confidence
- ✅ End-to-end ML project development
- ✅ Natural Language Processing techniques
- ✅ Model training, evaluation, and optimization
- ✅ Web application development with Flask
- ✅ RESTful API design
- ✅ Version control with Git/GitHub
- ✅ Code organization and best practices
🖼️ Click to View Screenshots (7 images)
Clean and intuitive interface for news analysis
Modern dark theme for comfortable viewing
Detailed prediction with confidence scores and probability distributions
Ensemble model predictions with individual model breakdowns
Fake news detection with ensemble voting in dark mode
Real news verification with confidence analysis in dark mode
- Deep learning models (LSTM, BERT, Transformers)
- Multi-language support (Hindi, Spanish, etc.)
- Source credibility analysis
- Real-time news monitoring and alerts
- Browser extension for instant verification
- Mobile application (iOS/Android)
- Integration with fact-checking APIs (Alt News, BOOM Live)
- User authentication and history tracking
- Temporal model updates with 2024+ data as it becomes available
This system is designed as an educational tool and ML demonstration project. It should not be used as the sole method for verifying news authenticity.
The models analyze writing patterns and linguistic features:
- ✅ Sensational language and emotional manipulation
- ✅ Conspiracy theory rhetoric patterns
- ✅ Poor grammar and structure (common in low-quality fake news)
- ✅ Clickbait-style headlines
- ✅ Absence of proper attribution and sources
- ✅ Vague or missing details
The fact-checker performs content verification and claim validation:
- ✅ Entity Verification: Cross-references organizations, locations, and infrastructure with Wikipedia
- ✅ Numerical Validation: Detects unrealistic claims (impossible percentages, distances, speeds)
- ✅ Scam Pattern Detection: Identifies viral message patterns ("forward this", "share urgently")
- ✅ Confidence Override: Automatically flags articles with verifiable false claims as FAKE
How it works: If the fact-checker detects contradictions (e.g., non-existent metro lines, impossible statistics), it overrides the ML prediction and classifies the article as FAKE NEWS, even if the writing style appears professional.
Despite the dual-layer approach, some limitations remain:
- ❌ Cannot verify very recent events not yet documented on Wikipedia
- ❌ Cannot access paywalled sources or private databases
- ❌ Limited to English language content
- ❌ Cannot verify claims requiring real-time data or government databases
Example: A fake news article about a non-existent Delhi Metro line, written in professional journalistic style with specific details, would be misclassified as REAL because it matches the writing patterns of legitimate news. The models detect patterns, not facts.
Solution: For production systems, combine these ML models with:
- Fact-checking APIs (Alt News, BOOM Live, Snopes)
- Knowledge graphs (Wikipedia, Wikidata)
- Government database integration
- Source credibility analysis
See MODEL_LIMITATIONS.md for detailed analysis.
- Training Period: 2016-2023 (most recent professionally labeled data)
- Performance: Optimized for news from the training period
- Modern Terms: May have limited exposure to very recent terminology (2024-2025 specific events/technologies)
Always:
- ✅ Cross-reference with multiple reliable sources
- ✅ Check the original source's credibility and reputation
- ✅ Consider the context, date, and author of publication
- ✅ Consult professional fact-checkers for important decisions
- ✅ Verify through established fact-checking organizations
This project demonstrates understanding of:
- Data labeling challenges in ML
- Temporal dataset drift and model limitations
- Professional ML project development
- Real-world constraints in fake news detection
Ujjwal Bansal
- GitHub: @Ujjwal-eng
- Project: Fake News Detection
This project is licensed under the MIT License - see the LICENSE file for details.
- scikit-learn team for excellent ML libraries
- NLTK developers for NLP tools
- Flask community for the lightweight web framework
- Kaggle for providing datasets
- Research papers on fake news detection for inspiration
If you have questions or suggestions:
- Open an issue on GitHub
- Fork the project and submit a pull request
- Star ⭐ the repository if you find it helpful!
Made with ❤️ and Python
