Kaggle Experiments

Tracking every trial, every failure, every lesson across Kaggle competitions.

Folder	Competition	Best Public	Status
`churn/`	Playground S6E3 — Customer Churn	0.91704 (private 0.91815)	84 trials, 15 subs, ended
`irrigation/`	Playground S6E4 — Irrigation Need	0.97833	15 trials, 14 subs, in progress
`birdclef/`	BirdCLEF+ 2026 — Bird Species	0.938	87 trials, 80 subs, in progress (ends 2026-06-03)
`ts-forecasting/`	Hedge Fund — Time Series	0.1499	4 subs, 3 scored zero
`march-mania/`	March Mania 2026 — NCAA Basketball	not submitted	missed deadline

churn (Playground S6E3)

TL;DR: Predict telecom customer churn. AUC-ROC metric. 4,142 teams. Key challenge: Top scores packed in 0.914–0.917. Broke through the GBDT ceiling by forking a RealMLP notebook on Kaggle. Final: Best public 0.91704, private 0.91815. 15 submissions, 84 trials.

Experiment flow

Trial	Why	Result	Next
001 LightGBM	Baseline	val 0.9161, public 0.9138	Target encoding + tuning
002–004 FE+tuning	ChargeGap, Optuna 50 rounds	val 0.9166 → 0.9139	Ensemble
005–008 ensemble	LGBM + XGB + CatBoost blends	val 0.9167 → 0.9140	Weight optimization
009–013 failures	External data, 10-fold, feature merging	All degraded	Drop external data
014–017 optimal blend	5-model OOF grid search	val 0.9168 → 0.91404 — GBDT ceiling	Multi-seed, NN
018–024 all-in	83 groupby vars, multi-seed (7×5=35 models)	val 0.9169 → stuck at 0.9140	Need NN
025–059 big search	MLP, RealMLP, Ridge, DART, pseudo-labeling... 35 trials	All failed to break through locally	Fork Kaggle notebook
060 RealMLP fork	RealMLP 20-fold on Kaggle notebook	val 0.9194 → public 0.91683 — +0.003 breakthrough	Blend with XGB
061 RealMLP+XGB	RealMLP×0.85 + XGB×0.15	public 0.91686	Ridge ensemble
074 TE std enriched	TE mean+std + 120 combos + digit features	val 0.91762 — local GBDT best	Ridge input
075–084 Ridge ensemble	55 OOFs into Ridge(alpha=100). No filtering	val 0.9196 → public 0.91707 → private 0.91815	Competition ended

Submissions

sub	trial	public	what happened
01	001 baseline	0.91377	Raw features only
03	014 ensemble	0.91404	5-model blend. GBDT ceiling
07	060 RealMLP	0.91683	Kaggle notebook fork. NN broke GBDT wall
10	078 RealMLP+TE blend	0.91690	RealMLP×0.80 + XGB_TE_std×0.20 rank blend
12	080 Ridge 51 OOFs	0.91702	Ridge(alpha=100) 51 OOFs
15	084 Ridge 56 OOFs	0.91704	Ridge(alpha=50) all 56 OOFs. best public
14	083 Ridge 53 OOFs	0.91701	private 0.91815

irrigation (Playground S6E4)

TL;DR: Classify irrigation need (Low/Medium/High) from soil/weather/crop data. Balanced accuracy metric. Key challenge: Metric was balanced_accuracy, not accuracy. High class is only 3.3%. Slow learning rate (lr=0.01) + sklearn pairwise TE on 171 combinations were the key breakthroughs.

Experiment flow

Trial	Why	Result	Next
001 LightGBM baseline	Baseline	val 0.9844(acc), public 0.9589	Fix metric to bal_acc
002 FE + ensemble	Domain features + 3-model blend	val 0.9853(acc), public 0.9609	bal_acc re-evaluation
003 balanced blend	Fixed metric + class_weight + threshold(High×2.6)	val 0.9711(bal_acc), public 0.9691	Expand pairwise TE
004–006 TE exploration	Various target encoding approaches (28–171 pairs)	val 0.9692–0.9699	Factorize pairs, TE on cats only
007 stacking	Ridge meta-learner + bias tuning	val 0.9707	Switch to sklearn TE
008b fullpair	171 pairwise factorize + cat TE(24) + threshold(High×3.7)	val 0.9738, public 0.9721	Slow LR + full pairwise TE
011 slow XGB	lr=0.01, 4000 rounds hard cap + sklearn TE on 171 pairwise (750 features)	val 0.9794, public 0.97799	Multi-seed
013 multiseed	3-seed XGB + orig append + coord descent bias tuning	val 0.9796, public 0.97833	Pseudolabeling, 5-seed
015 pseudo-label	pseudo-label (conf>0.95, 249K samples) + trial_011 arch	val 0.9796, public 0.97771 — regression	Pseudo-labeling ineffective

Submissions

sub	trial	public	what happened
01	001 baseline	0.9589	Raw features + LightGBM
03	003 balanced	0.9691	Metric fix + threshold. +0.008 jump
08b	008b fullpair	0.9721	171 pairwise + multiclass TE + threshold
11	011 slow XGB	0.97799	lr=0.01 + sklearn TE 750 features. +0.006 jump
13	013 multiseed	0.97833	3-seed + bias tuning. best
14	015 pseudo-label	0.97771	Pseudo-labeling regression

birdclef (BirdCLEF+ 2026)

TL;DR: Classify 234 bird/frog/insect species from 60s field recordings (5s segments). Macro-averaged ROC-AUC. Ends 2026-06-03. Key challenge: Code Competition — submissions only via Kaggle notebooks, CPU 90min limit. A blend of pretrained components (Perch, ProtoSSM, SED) plateaus hard — the last ~16 trials all land on the same ceiling. Best: Public 0.938 (trial_080, ProtoSSM 50% + SED 40% + Perch 10%). 87 trials, 80 submissions.

Experiment flow

Trial	Why tried	Result	Next
~001–015	Fork a 0.926 public notebook, retrain on competition data, swap to ONNX Perch (faster inference model) to beat the 90min timeout	Climbed 0.912 → 0.928. Timeout was the real enemy, not accuracy	Add a second model to the blend
~016–044	Add EffNet / ConvNeXt / multi-window components to the Perch blend	0.929–0.932. ConvNeXt 5-fold kept getting silently rejected — the hidden re-run refused its dataset mount	Drop unstable components, isolate what the auto-submit accepts
050–061	pseudo-label mix, fold-ensemble the weak components, ConvNeXt axis	0.934 ceiling. 2→4 fold helped (+0.001), but everything stuck at 0.934 for 10+ trials	Try a stronger primary model than the Perch blend
079–080	Apply a reference config (ProtoSSM as main + SED component); push SED weight 18% → 40%	0.935 → 0.938. SED weight was the one real lever (+0.003)	Find where SED saturates
082–087	Sweep all four blend axes around the 0.938 best	All 0.937–0.938 — saturated (see below)	Parameter space exhausted → model retrain

The 0.938 ceiling — four axes all saturated

After trial_080 hit 0.938, 6 trials swept every blend parameter and none broke through:

Axis	Trial	What	Result
weight	082, 083	SED 40%→50%, Perch 10%→12%	0.937 / 0.938 — SED 40% is the peak, ±2pp does nothing
scale	084	z-score normalize SED logits to match Proto	0.938 — no effect. ROC-AUC is rank-based, so any monotonic transform is a no-op
component	085, 086	re-add EffNet 5% / 10%	0.938 / 0.937 — a different model family still can't shift the row-ranks; bigger weight just drowns the strong ProtoSSM
inference	087	ProtoSSM TTA 5→7 time shifts	0.938 — time-shift ensemble already saturated at 5

Lesson: once a blend plateaus, parameter tuning (weights, scaling, extra components, TTA) cannot move a rank-based metric. The only remaining lever is retraining the base models. Diminishing returns were obvious by trial ~084 but confirmed by sweeping each axis once.

Submissions (milestones)

sub	public	why this number
01	0.912	first valid submission
04	0.928	fork + retrain on competition data
09	0.928	ONNX Perch — finally beat the 90min timeout
12	0.929	Perch + EffNet blend
50–61	0.934	pseudo-mix + fold ensembles; hard ceiling for 10+ trials
72	0.935	reference config (ProtoSSM main + SED)
73	0.938	SED weight 18%→40% — the breakthrough lever
75–80	0.937–0.938	swept weight/scale/component/TTA — all saturated

ts-forecasting (Hedge Fund)

TL;DR: Predict 36,923 financial time series. 89% of test series are unseen. Weighted RMSE. Key challenge: One series (weight 13 trillion) could zero out the entire score if predicted wrong.

Three zeros

Val didn't reflect test (sub_02) — Test had no lag features, val did. val 0.89 → public 0.0000
High-weight series explosion (sub_03) — Predicted 6.37 for a series worth 0.000009 → error 5.6×10¹⁴
Group mean is poison (sub_04) — Training mean (-0.67) on new series (true ≈ 0) is worse than predicting zero

Finding: Competition host said "public 0.5+ scores likely use future data (cheating)". Honest ceiling is 0.3–0.5.

march-mania (March Machine Learning Mania 2026)

TL;DR: Predict NCAA tournament win probabilities. Brier Score. Key challenge: Missed submission deadline. "9 days to go" meant days until tournament results, not submission cutoff.

Best local score: Men 0.161, Women 0.132 — would have been competitive.

Lessons learned

Don't trust val blindly — Verify val reflects the actual test scenario (ts-forecasting: 3 zeros)
Check prediction distribution before submitting — Score of 0 means prediction explosion/bias
Fork Kaggle notebooks when hitting local limits — Broke GBDT ceiling with RealMLP fork (+0.003), BirdCLEF also solved by forking
Don't filter weak models — Ridge ensemble with all 55 OOFs beat cherry-picked subsets
Code Competition = half the work is environment — Don't build from scratch, fork and extend
Check the actual deadline first — "X days to go" might not mean submission deadline
Never submit without OOF validation — Even post-processing needs local val first
Improving val is the right direction — In churn, val-public gap was negative but private beat public. Val is the more accurate indicator
Slow learning rate + hard cap > early stopping — lr=0.01 with fixed 4000 rounds beat mlogloss-based early stopping for balanced_accuracy (irrigation trial_011 vs 012)
Multi-seed averaging helps public more than val — Variance reduction shows on unseen data (irrigation trial_013: val +0.0002, public +0.0003)
Pseudo-labeling doesn't always help — In irrigation, pseudo-labeling (conf>0.95) caused regression despite high confidence threshold

Details: TRIAL_GUIDE.md

Name		Name	Last commit message	Last commit date
Latest commit History 279 Commits
.ai-bouncer-tasks/2026-05-10/pseudo-mix-all-folds		.ai-bouncer-tasks/2026-05-10/pseudo-mix-all-folds
.claude		.claude
birdclef		birdclef
churn		churn
irrigation		irrigation
march-mania		march-mania
ppt-claude-intro/iter6		ppt-claude-intro/iter6
ts-forecasting		ts-forecasting
.DS_Store		.DS_Store
.gitignore		.gitignore
README.ko.md		README.ko.md
README.md		README.md
TRIAL_GUIDE.md		TRIAL_GUIDE.md
linkedin-post-churn.md		linkedin-post-churn.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle Experiments

churn (Playground S6E3)

Experiment flow

Submissions

irrigation (Playground S6E4)

Experiment flow

Submissions

birdclef (BirdCLEF+ 2026)

Experiment flow

The 0.938 ceiling — four axes all saturated

Submissions (milestones)

ts-forecasting (Hedge Fund)

Three zeros

march-mania (March Machine Learning Mania 2026)

Lessons learned

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kaggle Experiments

churn (Playground S6E3)

Experiment flow

Submissions

irrigation (Playground S6E4)

Experiment flow

Submissions

birdclef (BirdCLEF+ 2026)

Experiment flow

The 0.938 ceiling — four axes all saturated

Submissions (milestones)

ts-forecasting (Hedge Fund)

Three zeros

march-mania (March Machine Learning Mania 2026)

Lessons learned

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages