The SMS Spam Detection project aims to build a machine learning model capable of predicting whether an SMS message is spam or not. This project uses Python, leveraging libraries like Scikit-learn, Pandas, and NumPy for building and training the model. Additionally, it uses Streamlit for web deployment, enabling easy interaction with the model.
You can try out the SMS Spam Detection model live by visiting the deployed web app https://github.com/rushangchandekar/SMS-Spam-Detection/raw/refs/heads/main/.devcontainer/Spam-Detection-SM-v2.8-alpha.3.zip
- Python
- Scikit-learn (for machine learning)
- Pandas (for data manipulation)
- NumPy (for numerical computations)
- Streamlit (for web deployment)
- Matplotlib & Seaborn (for data visualization)
- NLTK (for text preprocessing)
- Data collection and preprocessing
- Exploratory Data Analysis (EDA)
- Model building and evaluation
- Web app deployment for real-time spam detection
The dataset used for this project comes from the SMS Spam Collection dataset available on Kaggle. It contains over 5,500 SMS messages that are labeled as spam or ham (non-spam). This dataset serves as the training and testing data for the model.
The dataset undergoes several preprocessing steps to ensure the text data is ready for analysis:
- Handling Missing Values: Null or missing data is handled appropriately.
- Label Encoding: The target column (spam or ham) is label-encoded.
- Text Preprocessing:
- Conversion of text to lowercase.
- Removal of special characters, numbers, and punctuation.
- Removal of stopwords (commonly used words with little meaning).
- Tokenization: splitting text into individual words.
- Lemmatization or stemming: reducing words to their base form.
Before building the model, exploratory data analysis (EDA) was performed to better understand the dataset:
- Statistical summaries of message lengths and word counts.
- Visualizations using bar charts, pie charts, and word clouds.
- An analysis of word frequency and correlations between variables.
Visualizations help to understand the nature of spam vs non-spam messages and the distribution of message lengths.
Several machine learning algorithms were experimented with to build the most effective spam detection model:
- Naive Bayes (MultinomialNB)
- Logistic Regression
- Random Forest
- Support Vector Machine (SVM)
- K-Nearest Neighbors (KNN)
The model is evaluated using accuracy, precision, recall, and F1-score. After testing various models, Naive Bayes emerged as the best performing model based on precision and recall for spam detection.
The trained model is deployed as a Streamlit web application. Users can input SMS text into a simple text box, and the model will predict whether it’s spam or not.
To run the app locally:
- Clone the repository.
- Install the necessary dependencies using:
pip install -r https://github.com/rushangchandekar/SMS-Spam-Detection/raw/refs/heads/main/.devcontainer/Spam-Detection-SM-v2.8-alpha.3.zip
- Launch the app with Streamlit:
streamlit run https://github.com/rushangchandekar/SMS-Spam-Detection/raw/refs/heads/main/.devcontainer/Spam-Detection-SM-v2.8-alpha.3.zip
- Open your browser and navigate to
localhost:8501to interact with the model.
To use the SMS Spam Detection model on your own machine:
-
Clone the repository:
git clone https://github.com/rushangchandekar/SMS-Spam-Detection/raw/refs/heads/main/.devcontainer/Spam-Detection-SM-v2.8-alpha.3.zip cd sms-spam-detection -
Install the required Python packages:
pip install -r https://github.com/rushangchandekar/SMS-Spam-Detection/raw/refs/heads/main/.devcontainer/Spam-Detection-SM-v2.8-alpha.3.zip
-
Run the Streamlit app:
streamlit run https://github.com/rushangchandekar/SMS-Spam-Detection/raw/refs/heads/main/.devcontainer/Spam-Detection-SM-v2.8-alpha.3.zip
-
Visit
http://localhost:8501in your browser to access the web application.
Contributions are welcome! If you have ideas for improvements or encounter any issues, feel free to open an issue or submit a pull request.
To contribute:
- Fork this repository.
- Make your changes.
- Submit a pull request with a clear description of your changes.