Binary classification of NASA Near-Earth Objects (NEOs) as hazardous or non-hazardous, with probability calibration, threshold tuning, and SHAP-based explainability.
| Document | Link |
|---|---|
| Wiki (EN) | Home_EN.md |
| Wiki (ZH) | Home_ZH.md |
| Paper (EN) | Paper_EN.md |
| Paper (ZH) | Paper_ZH.md |
| Proposal (EN) | Proposal_Report_EN.md |
| Proposal (ZH) | Proposal_Report_ZH.md |
| Dataset (Kaggle) | NASA Nearest Earth Objects |
| Tool | Version | Install |
|---|---|---|
| Python | ≥ 3.12 | python.org |
| uv | any recent | pip install uv or docs.astral.sh/uv |
git clone <repo-url>
cd Data-Mining-ProjectsDownload neo.csv from Kaggle and put it at:
data/neo.csv
uv syncThis creates a virtual environment and installs all packages listed in pyproject.toml.
Note (shared / restricted environments) If the default cache locations are not writable, prefix every
uvcommand with:UV_CACHE_DIR=/tmp/uv-cache MPLCONFIGDIR=/tmp/mpl-cache uv ...
Execute the three stages in order:
# Stage 1 — Exploratory Data Analysis
uv run neo-eda
# Stage 2 — Model training, tuning, and evaluation
uv run neo-train
# Stage 3 — Permutation importance and SHAP explanations
# (requires neo-train to have run first)
uv run neo-explainEach stage prints a short summary and the paths of files it wrote.
Optionally, benchmark training and prediction efficiency across all base models (shared features and split for a fair comparison):
uv run neo-benchmark # writes reports/tables/efficiency_benchmark.csv.
├── data/
│ └── neo.csv ← dataset (included in repo)
├── docs/
│ ├── wiki/ ← technical wiki (EN + ZH)
│ ├── paper/ ← final paper (EN + ZH)
│ ├── plan/ ← implementation plan (EN + ZH)
│ └── project-proposal/ ← project proposal (EN + ZH)
├── models/
│ └── final_model.joblib ← saved artifact (generated, not tracked by git)
├── reports/
│ ├── figures/ ← PNG plots
│ └── tables/ ← CSV / JSON outputs
├── src/neo_hazard/
│ ├── config.py ← paths and shared constants
│ ├── data.py ← data loading and validation
│ ├── eda.py ← neo-eda entry point
│ ├── evaluation.py ← metrics and threshold selection
│ ├── explain.py ← neo-explain entry point
│ ├── features.py ← feature engineering
│ ├── plots.py ← figure helpers
│ └── train.py ← neo-train entry point
├── pyproject.toml
└── uv.lock
After running all three stages the following files are produced:
| File | Description |
|---|---|
reports/tables/dataset_summary.json |
Row/column counts, class balance |
reports/tables/numeric_summary.csv |
Descriptive statistics per feature |
reports/tables/class_distribution.csv |
Class counts and ratios |
reports/tables/correlation_matrix.csv |
Pearson correlation matrix |
reports/tables/model_metrics_validation.csv |
Validation metrics for all models |
reports/tables/hyperparameter_tuning_results.csv |
Best CV score per tuned model |
reports/tables/threshold_tuning_validation_calibrated.csv |
F1/recall/precision at each threshold |
reports/tables/final_test_metrics.csv |
Test-set metrics for the chosen model |
reports/tables/final_test_predictions.csv |
Per-row predictions on the test set |
reports/tables/permutation_importance.csv |
Feature importance by PR-AUC drop |
reports/tables/shap_local_case_contributions.csv |
Top SHAP features for selected cases |
reports/figures/target_distribution.png |
Class imbalance bar chart |
reports/figures/numeric_distributions.png |
Histograms of raw features |
reports/figures/correlation_heatmap.png |
Pearson heatmap |
reports/figures/final_precision_recall_curve.png |
PR curve on test set |
reports/figures/final_roc_curve.png |
ROC curve on test set |
reports/figures/final_calibration_curve.png |
Probability calibration curve |
reports/figures/permutation_importance.png |
Bar chart of permutation importance |
reports/figures/shap_global_bar.png |
SHAP global feature importance |
reports/figures/shap_summary_beeswarm.png |
SHAP beeswarm summary plot |
models/final_model.joblib |
Serialised model artifact |