Skip to content

utkarshraj11/when-labels-matter

Repository files navigation

when-labels-matter

Predicting fatal grade-crossing incidents from FRA Form 57 data, with classical vs modern ML methods, two ways: with labels (classification) and without (anomaly detection).

The setup

The FRA Form 57 dataset records every highway-rail grade crossing incident in the US. After cleaning we get around 32,000 incidents, of which about 10.8% are fatal. The question: can we predict whether an incident will be fatal?

I tried two framings, two methods each:

Classical Modern
Classification (with labels) Random Forest TabPFN v2
Anomaly Detection (no labels) Isolation Forest ECOD

Takeaways

Classification works really well, anomaly detection barely works at all.

Two findings worth calling out:

TabPFN matches the tuned Random Forest with zero tuning. RF took a 30-minute Bayesian hyperparameter search. TabPFN is just one cloud API call. Same numbers.

Bayesian tuning of Isolation Forest actually hurt. F1 dropped from 0.167 to 0.103 after tuning. When the data has no real anomaly structure, the search just chases fold-level noise.

The takeaway: fatal grade-crossing incidents are not statistical outliers in feature space. They are common feature patterns with rare outcomes. Unsupervised methods cannot recover the fatality signal from feature distributions alone.

Run it

pip install pandas numpy scikit-learn scikit-optimize imbalanced-learn \
            pyod plotly joblib jupyter tabpfn-client

Cached .pkl files make reruns fast (~2 min). Delete them for a full fresh run (~35 min for the Bayesian search).

TabPFN needs a free account on the TabPFN cloud. You'll be prompted on first use.

The raw csv is not in the repo. Download from the FRA Safety Data.

Files

data_challenge.ipynb   the analysis notebook
report.pdf             4-page write-up
report.tex             LaTeX source
*.pkl                  cached CV results

References

  • Breiman (2001). Random forests. Machine Learning.
  • Hollmann et al. (2025). Accurate predictions on small data with a tabular foundation model. Nature.
  • Liu, Ting & Zhou (2008). Isolation forest. ICDM.
  • Li et al. (2023). ECOD: Unsupervised outlier detection using empirical cumulative distribution functions. IEEE TKDE.
  • Han et al. (2022). ADBench: Anomaly detection benchmark. NeurIPS Datasets and Benchmarks.
  • Zhao, Nasrullah & Li (2019). PyOD: A Python toolbox for scalable outlier detection. JMLR.

Note

This started as a course project for CS235 (Data Mining Techniques) at UC Riverside.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors