This repository contains the complete implementation and analysis for a university assignment on classification problems, focusing on corporate bankruptcy prediction using financial indicators.
The project evaluates and compares multiple machine learning classifiers under class imbalance conditions and answers specific performance-related questions defined in the assignment.
The dataset consists of financial ratios, binary activity indicators, company status (healthy or bankrupt), and the corresponding year for each company. Each row represents a different company.
The implementation follows the full assignment specification, including:
- data loading from Excel,
- exploratory data analysis and visualization,
- missing value checks,
- Min–Max normalization,
- Stratified K-Fold cross-validation (k=4),
- class imbalance handling with undersampling (3:1 ratio),
- training and evaluation of multiple classification models,
- generation of confusion matrices,
- storage of experimental results in CSV/Excel format,
- additional analysis and visualization using pivot tables in Excel.
-
notebooks/
Executable Jupyter notebook developed in Google Colab.
This is the main implementation and contains all experiments, figures and outputs. -
data/
Input dataset provided by the assignment. -
results/
Output files generated from the Python code and further analyzed in Excel (e.g.balancedDataOutcomes.csv/.xlsx). -
report/
Final project report (PDF), written in Greek, answering all assignment questions.
The project was originally developed in Google Colab. The Jupyter notebook can be executed cell-by-cell in a notebook environment.
An exported .py version of the notebook is also included for reference, but the
notebook is the recommended way to run the code.
The following eight (8) classification models were trained and evaluated:
- Linear Discriminant Analysis (LDA)
- Logistic Regression
- Decision Tree
- Random Forest
- k-Nearest Neighbors (k-NN)
- Naive Bayes
- Support Vector Machine (SVM)
- XGBoost (additional model)
Model performance was evaluated on both training and test sets using:
- Accuracy
- Precision
- Recall (Sensitivity)
- F1 Score
- ROC-AUC
- Specificity (computed during Excel analysis)
Due to class imbalance, F1 Score was selected as the primary metric for model comparison.
The Python code generates a CSV file (balancedDataOutcomes.csv) containing
detailed results for all folds, models and datasets.
This file was later converted to Excel and used to compute additional metrics and
create comparison plots using pivot tables.
The final report summarizes the results and answers:
- which model performs best overall, and
- whether the required performance constraints are satisfied.
[1] Scikit-learn Developers. Model Evaluation: Classification Metrics.
https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics
[2] XGBoost Developers. XGBoost Documentation.
https://xgboost.readthedocs.io/en/stable/
[3] Wikipedia contributors. Confusion matrix.
https://en.wikipedia.org/wiki/Confusion_matrix