Skip to content

nourawadallah/Kepler-KOI-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kepler Exoplanet Classification — KOI Features

A machine learning pipeline to classify exoplanet candidates from NASA's Kepler Space Telescope into three categories: FALSE POSITIVE, CANDIDATE, and CONFIRMED, and rank unresolved candidates by their likelihood of being real planets.


Overview

NASA's Kepler Space Telescope identified thousands of potential planets by monitoring stars across the galaxy, recording physical properties such as orbital periods, planetary radii, and stellar temperatures in Kelvin. However, a significant portion of these signals originates from eclipsing binary stars, instrument artefacts, or other astrophysical phenomena that are not actual planets. Distinguishing real planets from false positives is a critical step in exoplanet research.

Kepler flagged over 9,000 objects of interest during its mission. Each one had to be reviewed and labelled:

  • Confirmed — a verified exoplanet
  • False Positive — a mimicking signal, usually an eclipsing binary star or instrument artefact
  • Candidate — not yet reviewed and assigned a definitive label

Manual review is expensive and slow. This project automates the triage using machine learning, and for the ~400 unresolved candidates in the dataset, ranks them by how likely they are to be real planets, helping prioritise which ones deserve follow-up attention first.


Dataset

Kepler Exoplanet Dataset on Kaggle — originally from the NASA Exoplanet Archive

9,564 observations with 9 orbital and stellar features.

Label Class Description
0 FALSE POSITIVE Not a real exoplanet, likely an eclipsing binary or instrument artefact
1 CANDIDATE Not yet reviewed and assigned a definitive label
2 CONFIRMED Verified exoplanet

What I Did

The dataset had 9 orbital and stellar features, so I started by engineering 5 new ones to capture relationships the raw features couldn't express on their own:

  • koi_multiplicity — counts how many candidates share the same host star. Multi-planet systems carry a strong prior toward authenticity; the probability of multiple independent false signals occurring around the same star by chance is extremely low.
  • score_sq — KOI score squared, to amplify the difference between high and low confidence signals. This turned out to be the single most important feature in the model according to SHAP analysis.
  • radius_ratio — planet radius divided by stellar radius (koi_prad / koi_srad). A planet's absolute size means little without context. This ratio captures the relationship between the two and helps identify eclipsing binaries, which tend to have extreme values.
  • score_period — confidence score multiplied by orbital period, combining two independent signals: a high-confidence, long-period object is a strong planet candidate; a low-confidence, short-period object is more likely a false positive.
  • temp_ratio — equilibrium temperature divided by stellar effective temperature. Unusual ratios can indicate a misidentified signal rather than a real planet.

From there, I handled missing values with iterative imputation, log-transformed right-skewed features, and scaled with RobustScaler (more stable than StandardScaler with outliers). All preprocessing steps are wrapped in a scikit-learn Pipeline to prevent data leakage. SMOTE oversampling was applied post-scaling to address class imbalance.

I then tuned XGBoost with RandomizedSearchCV over 40 iterations using stratified 5-fold cross-validation, and combined it with a Random Forest in a stacking ensemble with a Logistic Regression meta-learner.


Results

Multiclass Classification (FALSE POSITIVE / CANDIDATE / CONFIRMED)

Model Accuracy Macro F1
Decision Tree 76% 0.71
XGBoost (tuned) 80% 0.76
Stacking Ensemble 80% 0.76

The trickiest class was CANDIDATE, and that's not a model failure, that's just reality. These are objects scientists haven't confirmed yet, so no algorithm can reliably classify something that astronomers themselves are still unsure about.

Binary Classification (FALSE POSITIVE vs CONFIRMED)

To test this, I ran a separate binary classifier on just FALSE POSITIVE vs CONFIRMED, dropping CANDIDATEs entirely. The results confirm that the multiclass ambiguity was coming from the labels, not the model.

Metric Score
Accuracy 97%
ROC-AUC 0.995
Average Precision 0.99

Candidate Priority Ranking

Rather than forcing a classification on unresolved candidates, the pipeline assigns each one a confirmation probability and ranks them for follow-up prioritisation.

Priority Tier Threshold Count
High priority P(confirmed) > 0.80 42
Likely false positive P(confirmed) < 0.20 281
Genuinely ambiguous 0.40 < P(confirmed) < 0.60 13

Astrophysical Interpretation

SHAP analysis identifies the features driving model predictions. The results are consistent with established astrophysical reasoning.

Multiclass (FALSE POSITIVE / CANDIDATE / CONFIRMED)

  • score_sq — the most important feature overall. Squaring the confidence score amplifies the gap between high and low confidence objects, making the separation between confirmed planets and false positives more distinct.
  • koi_score — Kepler's own instrument confidence score, the second strongest signal. The model's reliance on this validates the approach; it is independently converging on the telescope's own judgment.
  • koi_prad — planetary radius. Real planets fall within a physically constrained size range; objects outside this range are statistically more likely to be false positives.
  • radius_ratio — more discriminative than planetary radius alone. Extreme ratios are characteristic of eclipsing binary stars mimicking planetary transits.
  • koi_kepmag — the brightness of the star as observed by Kepler. Fainter stars produce noisier light curves, which increases the chance of a misidentified signal.
  • koi_multiplicity — stars with multiple candidates are far more likely to host real planets. The model picked this up without being told; it learned the multiplicity boost from the data.
  • score_period — combines confidence and orbital period into one signal. High confidence at long periods strongly favours a real planet.
  • koi_steff — stellar effective temperature. Hotter stars are more active, which can produce false signals that mimic planetary transits.
  • koi_period — short orbital periods are disproportionately associated with false positives, as grazing eclipsing binaries preferentially cluster at short periods.

Binary (FALSE POSITIVE vs CONFIRMED)

  • score_sq — remains the dominant feature, with an even larger margin than in the multiclass task. With candidates removed, the model leans even harder on the confidence signal.
  • koi_score — second strongest, consistent with the multiclass result.
  • koi_prad — planetary radius plays a larger role here than in multiclass, likely because confirmed planets cluster more tightly in size than the full three-class distribution.
  • koi_multiplicity — jumps in relative importance in the binary task. Without the ambiguous candidate class diluting the signal, the multiplicity boost becomes a cleaner discriminator.
  • radius_ratio — consistent with multiclass. Eclipsing binaries have characteristic radius ratios that the model reliably identifies.
  • score_period — slightly more important in the binary task, reinforcing that the combination of confidence and orbital period is a strong signal for confirmed planets specifically.
  • koi_steff — stellar effective temperature remains relevant, consistent with its role in the multiclass task.
  • koi_kepmag — star brightness continues to contribute, though with less weight than in multiclass.
  • koi_period — short periods remain associated with false positives in the binary task as well.
  • koi_insol — insolation flux appears in the binary top 10, reflecting that the amount of stellar energy a planet receives helps distinguish real orbital configurations from mimicking signals.

Files

/kaggle  
    exoplanets_data.csv  
/models  
    preprocessing_pipeline.pkl  
    stacking_ensemble.pkl  
    xgb_binary.pkl  
    xgb_multiclass.pkl  
README.md  
kepler_koi_classification.ipynb  
requirements.txt  

How to Run

pip install -r requirements.txt
jupyter notebook notebooks/kepler_koi_classification.ipynb

Tools & Libraries

Python · XGBoost · scikit-learn · imbalanced-learn · SHAP · pandas · NumPy · Matplotlib · Seaborn

About

Machine learning pipeline to classify NASA Kepler Objects of Interest as confirmed exoplanets or false positives, with candidate priority ranking by confirmation probability.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors