when-labels-matter

Predicting fatal grade-crossing incidents from FRA Form 57 data, with classical vs modern ML methods, two ways: with labels (classification) and without (anomaly detection).

The setup

The FRA Form 57 dataset records every highway-rail grade crossing incident in the US. After cleaning we get around 32,000 incidents, of which about 10.8% are fatal. The question: can we predict whether an incident will be fatal?

I tried two framings, two methods each:

	Classical	Modern
Classification (with labels)	Random Forest	TabPFN v2
Anomaly Detection (no labels)	Isolation Forest	ECOD

Takeaways

Classification works really well, anomaly detection barely works at all.

Two findings worth calling out:

TabPFN matches the tuned Random Forest with zero tuning. RF took a 30-minute Bayesian hyperparameter search. TabPFN is just one cloud API call. Same numbers.

Bayesian tuning of Isolation Forest actually hurt. F1 dropped from 0.167 to 0.103 after tuning. When the data has no real anomaly structure, the search just chases fold-level noise.

The takeaway: fatal grade-crossing incidents are not statistical outliers in feature space. They are common feature patterns with rare outcomes. Unsupervised methods cannot recover the fatality signal from feature distributions alone.

Run it

pip install pandas numpy scikit-learn scikit-optimize imbalanced-learn \
            pyod plotly joblib jupyter tabpfn-client

Cached .pkl files make reruns fast (~2 min). Delete them for a full fresh run (~35 min for the Bayesian search).

TabPFN needs a free account on the TabPFN cloud. You'll be prompted on first use.

The raw csv is not in the repo. Download from the FRA Safety Data.

Files

data_challenge.ipynb   the analysis notebook
report.pdf             4-page write-up
report.tex             LaTeX source
*.pkl                  cached CV results

References

Breiman (2001). Random forests. Machine Learning.
Hollmann et al. (2025). Accurate predictions on small data with a tabular foundation model. Nature.
Liu, Ting & Zhou (2008). Isolation forest. ICDM.
Li et al. (2023). ECOD: Unsupervised outlier detection using empirical cumulative distribution functions. IEEE TKDE.
Han et al. (2022). ADBench: Anomaly detection benchmark. NeurIPS Datasets and Benchmarks.
Zhao, Nasrullah & Li (2019). PyOD: A Python toolbox for scalable outlier detection. JMLR.

Note

This started as a course project for CS235 (Data Mining Techniques) at UC Riverside.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Form57_Data_Dictionary.xlsx		Form57_Data_Dictionary.xlsx
data_challenge.ipynb		data_challenge.ipynb
ecod_baseline_cv.pkl		ecod_baseline_cv.pkl
ecod_sensitivity.pkl		ecod_sensitivity.pkl
if_best_params_bayes.pkl		if_best_params_bayes.pkl
readme.md		readme.md
rf_best_params_bayes.pkl		rf_best_params_bayes.pkl
rf_test_predictions.pkl		rf_test_predictions.pkl
sigkdd_report.pdf		sigkdd_report.pdf
tabpfn_base_cv.pkl		tabpfn_base_cv.pkl
tabpfn_test_predictions.pkl		tabpfn_test_predictions.pkl
tabpfn_tune_n_estimators.pkl		tabpfn_tune_n_estimators.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

when-labels-matter

The setup

Takeaways

Run it

Files

References

Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

when-labels-matter

The setup

Takeaways

Run it

Files

References

Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages