This repository contains the solution to Task 1 of the Data Analyst Internship, focusing on data cleaning and preprocessing using Python (Pandas) in Google Colab.
Clean and prepare a raw dataset by:
- Identifying and handling missing values
- Removing duplicates
- Standardizing column names and text data
- Ensuring consistent data types and formats
Dataset Name: Mall Customer Segmentation Data
Source: Provided during the internship task
File: Mall_Customers.csv
| Step | Description |
|---|---|
| Missing Values Check | No missing values found |
| Duplicate Check | No duplicate rows present |
| Column Renaming | Standardized to lowercase with underscores |
| Text Standardization | Gender values standardized to title case |
| Data Type Check | All data types confirmed appropriate |
| File | Description |
|---|---|
task1_data_cleaning.ipynb |
Google Colab notebook with the complete cleaning process |
Mall_Customers.csv |
Original dataset |
cleaned_mall_customers.csv |
Cleaned and processed version of the dataset |
README.md |
Summary and documentation of the task |
- Python 3
- Pandas
- Google Colab
- GitHub
- Hands-on experience with Pandas for cleaning real-world datasets
- Techniques to detect and handle common data issues
- Understanding importance of standardization and preprocessing before analysis
This task is submitted as part of the internship program.
To view the solution notebook or the cleaned dataset, explore the files above.