Phishing-URL-detector

Phishing URL Detection Using Machine Learning: This project implements and compares various machine learning models to detect phishing URLs based on features extracted from domain names and URLs. The goal is to classify domains as either benign or phishing using techniques such as Random Forest , Gradient Boosting , a Neural Network , and an ensemble model that combines predictions from all three.

Project Overview

The project performs the following steps:

Loads and preprocesses URL/domain data.
Trains multiple classifiers with hyperparameter tuning.
Evaluates individual models using cross-validation and feature importance analysis.
Builds a neural network classifier.
Combines predictions from all models into an ensemble method.
Visualizes performance metrics (confusion matrix, ROC curve, Precision-Recall curve).

Requirements

To run this project, ensure you have Python installed along with the following libraries: pip install pandas numpy scikit-learn matplotlib seaborn tensorflow pip install plotly

Dataset

The dataset is named as combined_dataset.csv. It contains: A column named 'label' indicating whether a domain is phishing (1) or benign (0). A column named 'domain' which will be excluded during training. Other columns are treated as features for classification (e.g., length, entropy, SSL info, etc.).

Project Structure

Stages	Description
Data Loading	Reads the CSV file and displays the first few rows.
Preprocessing	Drops irrelevant columns, scales features, and splits into train/test sets.
Model Training	Trains Random Forest, Gradient Boosting, and Neural Network models with hyperparameter tuning.
Cross-Validation	Evaluates Random Forest and Gradient Boosting using K-Fold Cross Validation.
Feature Importance	Plots the most influential features for each tree-based model.
Ensemble Model	Combines predictions from all three models using probability averaging.
Evaluation	Reports accuracy, confusion matrix, classification report, ROC curve, and Precision-Recall curve.

Results & Analysis

Cross-validation scores for Random Forest and Gradient Boosting.
Features important plots.
Neural Network training progress.
Confusion Matrix and Classification Report for the ensemble model.
ROC Curve and Precision-Recall Curve for model evaluation.

These visualizations help understand:

Which features are most important in detecting phishing URLs.
How each model performs individually.
The effectiveness of the ensemble approach

Models Used

Random Forest Classifier
- Ensemble method using multiple decision trees.
- Hyperparameter tuning via GridSearchCV.
Gradient Boosting Classifier
- Sequentially corrects errors using boosting.
- Also tuned using GridSearchCV.
Sequential Neural Network (TensorFlow/Keras)
- Feedforward architecture with Dense and Dropout layers.
- Optimized using Adam and trained with categorical crossentropy loss.
Ensemble Model
- Averages predicted probabilities from all three models to improve generalization.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
01_implementation		01_implementation
LICENSE		LICENSE
Phishing_URL_detector.ipynb		Phishing_URL_detector.ipynb
README.md		README.md
combined_dataset.csv		combined_dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phishing-URL-detector

Project Overview

Requirements

Dataset

Project Structure

Results & Analysis

Models Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Phishing-URL-detector

Project Overview

Requirements

Dataset

Project Structure

Results & Analysis

Models Used

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages