This project focuses on building and evaluating machine learning classification models to predict thyroid disease status based on a publicly available medical dataset from the UCI Machine Learning Repository (Garavan Institute, Sydney). The goal is to establish a robust prediction baseline and identify the most critical clinical features driving the classification.
| Category | Technology | Purpose |
|---|---|---|
| Language | Python (3.x) | Core programming language. |
| Data Handling | Pandas, NumPy | Data loading, manipulation, and numerical operations. |
| Modeling | Scikit-learn (sklearn) | Implementing Decision Tree and K-Nearest Neighbors classifiers. |
| Visualization | Matplotlib, Seaborn | Generating model comparison and feature importance plots. |
| Environment | Jupyter Notebook/IDE | Interactive development and execution. |
- Dataset:
thyroid.csv(UCI Thyroid Disease Data Set) - Data Size (Raw): 9,172 patient records.
- Data Challenges:
- Missing values encoded as
?. - Categorical features encoded as non-numeric strings (
t/f,F/M, etc.). - Unlabeled columns (due to missing header).
- Highly sparse/redundant measurement flag columns.
- Missing values encoded as
The project followed a standard, sequential machine learning pipeline, focusing heavily on robust data preparation to ensure clean inputs for the models.
| Step | Action Taken | Rationale |
|---|---|---|
| Standardization | Replaced all ? values with np.nan. |
Standardizing missing values for proper handling. |
| Binary Encoding | Converted boolean columns (t/f) to numerical (1/0). |
Preparing categorical flags for machine learning algorithms. |
| Sex Encoding & Imputation | Encoded F as 2, M as 1. Imputed missing sex values using the MODE (most frequent category). |
Correct statistical approach for imputing categorical features. |
| Column Dropping | Removed the six *_measured flag columns and the highly sparse TBG column. |
The *_measured columns are redundant after imputation, and TBG had over 95% missing data. |
| Feature/Target Split | Separated the cleaned data into Features (X) and Target (y). | Prepared data for supervised learning. |
| Train/Test Split | Split X and y into training and testing sets (e.g., 70/30 split). |
Ensuring model performance is evaluated on unseen data. |
Two different classification algorithms were chosen for baseline comparison:
- Decision Tree Classifier: Trained with a maximum depth of
3to prioritize interpretability and prevent overfitting. - K-Nearest Neighbors (K-NN): Trained with a default number of neighbors (often
k=5) to provide a distance-based classification baseline.
| Model Algorithm | Hyperparameters | Accuracy Score (on Test Set) |
|---|---|---|
| Decision Tree | max_depth=3 |
67.96% |
| K-Nearest Neighbors | Default k |
61.91% |
Conclusion: The Decision Tree performs better than the baseline K-NN model, indicating that a simple rule-based structure is effective on this dataset.
The Decision Tree model provided significant insights into the predictive power of the features.
| Rank | Feature Name | Relative Importance Score |
|---|---|---|
| 1 | psych |
0.495 |
| 2 | T3 |
0.300 |
| 3 | on_thyroxine |
0.202 |
| 4+ | All Others |
Crucial Finding: Zero Importance (
Features such as age, sex, sick, pregnant, and notably, on_antithyroid_medication, were assigned a score of
Confirmed Reason for on_antithyroid_medication (and similar flags): Data Sparsity.
-
Dataset Analysis: The raw data for
on_antithyroid_medicationshows that 98.74% of records are 'f' (False/No). -
Model Impact: The Decision Tree avoids this feature because it cannot find a balanced, effective split. Splitting
$98.74%$ of the data away from$1.26%$ provides negligible gain in purity compared to splitting on highly variable features likeT3orpsych.
The next phase of the project is dedicated to hyperparameter tuning and model refinement to improve the current accuracy scores.
-
K-NN Tuning: Identify the optimal number of neighbors (
$k$ ) using a validation curve to maximize accuracy and close the gap with the Decision Tree. -
Decision Tree Tuning: Test various
max_depthvalues to potentially increase performance without sacrificing generalization (avoiding overfitting). - Model Selection: Retrain the best-performing algorithm (Decision Tree or tuned K-NN) with its optimal hyperparameters for the final result.
After establishing the baseline, the project was scaled up using a gradient boosting framework and a production-ready interface.
To improve upon the 67% accuracy of the Decision Tree, the model was upgraded to XGBoost (Extreme Gradient Boosting).
- New Algorithm: XGBoost Classifier.
- Feature Engineering: Expanded feature set to 20 clinical markers, including scaled numerical values for TSH, T3, TT4, T4U, and FTI.
- Result: The model achieved significantly higher precision and recall compared to the baseline models.
A professional-grade web interface was developed to allow healthcare providers to interact with the model in real-time.
| Feature | Description |
|---|---|
| Framework | Streamlit (Python-based web framework). |
| Real-time Inputs | 12+ clinical inputs including age, sex, and laboratory blood levels (TSH, T3, etc.). |
| Instant Diagnosis | Automated prediction with a visual confidence score (Probability percentage). |
| Report Generation | Integration with ReportLab to generate a downloadable PDF Medical Report for patients. |
The final deployment includes the following core files:
main.py: The Streamlit web application logic.thyroid_model.json: The trained XGBoost model "brain."scaler.joblib: The numerical scaler used to normalize patient data.requirements.txt: List of dependencies (XGBoost, Streamlit, ReportLab, etc.).
To run the diagnostic system locally:
- Install dependencies:
pip install -r requirements.txt - Launch the app:
streamlit run main.py
- XGBoost Section: It explains why you moved away from the simple Decision Tree (to get better results).
- Web Application Table: It highlights your skills in Full-Stack Data Science (Model + UI + PDF generation).
- Structure & Usage: This tells anyone visiting your GitHub exactly how to use your project.
This is the perfect way to wrap up the project. Adding Phase 3 shows that you didn't just build a model that gives a "Yes/No" answer, but one that is transparent and explainable.
Add this section to the very bottom of your README.md. It explains the "Feature Importance" chart and why it makes your app professional.
The final phase focused on transforming the "Black Box" model into an Explainable AI system. This ensures that every diagnosis can be audited and understood by a medical professional.
The application now generates a real-time Feature Importance Chart for every diagnosis using the XGBoost gain metric.
| Metric | Description |
|---|---|
| Feature Gain | Measures the "Information Value" each clinical marker contributes to the final decision. |
| Top Drivers | Automatically identifies and ranks the top 10 markers (e.g., TSH, T3, Psych) influencing the current result. |
| Visualization | Built using Matplotlib and Seaborn for high-clarity data storytelling. |
The chart serves as a "Reasoning Map" for the AI's decision:
- The X-Axis (Gain): Represents how much accuracy was gained by looking at that feature. Longer bars indicate that the AI relied heavily on that specific laboratory value.
- The Y-Axis (Clinical Markers): Lists the patient's data points. If TSH is at the top, it confirms the AI is following standard medical protocols where TSH is the primary indicator of thyroid health.
- The "Why" Factor: This chart answers the patient's question: "Why did the AI give me this result?" It shows exactly which part of their blood work or medical history triggered the diagnosis.
- Logic: Integrated XGBoost’s
get_score(importance_type='gain')directly into the Streamlit UI. - Library Update: Added
matplotlibandseabornto the production environment for on-the-fly rendering.
- Phase 1: Baseline models (Decision Tree/K-NN) and data cleaning.
- Phase 2: Upgrading to XGBoost, creating the Web App, and PDF reporting.
- Phase 3: Implementing Explainable AI (XAI) to show diagnostic reasoning.



