Skip to content

kangraemin/kaggle

Repository files navigation

Kaggle Experiments

Tracking every trial, every failure, every lesson across Kaggle competitions.

한국어


Folder Competition Best Public Status
churn/ Playground S6E3 — Customer Churn 0.91704 (private 0.91815) 84 trials, 15 subs, ended
irrigation/ Playground S6E4 — Irrigation Need 0.97833 15 trials, 14 subs, in progress
birdclef/ BirdCLEF+ 2026 — Bird Species 0.938 87 trials, 80 subs, in progress (ends 2026-06-03)
ts-forecasting/ Hedge Fund — Time Series 0.1499 4 subs, 3 scored zero
march-mania/ March Mania 2026 — NCAA Basketball not submitted missed deadline

churn (Playground S6E3)

TL;DR: Predict telecom customer churn. AUC-ROC metric. 4,142 teams. Key challenge: Top scores packed in 0.914–0.917. Broke through the GBDT ceiling by forking a RealMLP notebook on Kaggle. Final: Best public 0.91704, private 0.91815. 15 submissions, 84 trials.

Experiment flow

Trial Why Result Next
001 LightGBM Baseline val 0.9161, public 0.9138 Target encoding + tuning
002–004 FE+tuning ChargeGap, Optuna 50 rounds val 0.9166 → 0.9139 Ensemble
005–008 ensemble LGBM + XGB + CatBoost blends val 0.9167 → 0.9140 Weight optimization
009–013 failures External data, 10-fold, feature merging All degraded Drop external data
014–017 optimal blend 5-model OOF grid search val 0.9168 → 0.91404 — GBDT ceiling Multi-seed, NN
018–024 all-in 83 groupby vars, multi-seed (7×5=35 models) val 0.9169 → stuck at 0.9140 Need NN
025–059 big search MLP, RealMLP, Ridge, DART, pseudo-labeling... 35 trials All failed to break through locally Fork Kaggle notebook
060 RealMLP fork RealMLP 20-fold on Kaggle notebook val 0.9194 → public 0.91683+0.003 breakthrough Blend with XGB
061 RealMLP+XGB RealMLP×0.85 + XGB×0.15 public 0.91686 Ridge ensemble
074 TE std enriched TE mean+std + 120 combos + digit features val 0.91762 — local GBDT best Ridge input
075–084 Ridge ensemble 55 OOFs into Ridge(alpha=100). No filtering val 0.9196 → public 0.91707private 0.91815 Competition ended

Submissions

sub trial public what happened
01 001 baseline 0.91377 Raw features only
03 014 ensemble 0.91404 5-model blend. GBDT ceiling
07 060 RealMLP 0.91683 Kaggle notebook fork. NN broke GBDT wall
10 078 RealMLP+TE blend 0.91690 RealMLP×0.80 + XGB_TE_std×0.20 rank blend
12 080 Ridge 51 OOFs 0.91702 Ridge(alpha=100) 51 OOFs
15 084 Ridge 56 OOFs 0.91704 Ridge(alpha=50) all 56 OOFs. best public
14 083 Ridge 53 OOFs 0.91701 private 0.91815

irrigation (Playground S6E4)

TL;DR: Classify irrigation need (Low/Medium/High) from soil/weather/crop data. Balanced accuracy metric. Key challenge: Metric was balanced_accuracy, not accuracy. High class is only 3.3%. Slow learning rate (lr=0.01) + sklearn pairwise TE on 171 combinations were the key breakthroughs.

Experiment flow

Trial Why Result Next
001 LightGBM baseline Baseline val 0.9844(acc), public 0.9589 Fix metric to bal_acc
002 FE + ensemble Domain features + 3-model blend val 0.9853(acc), public 0.9609 bal_acc re-evaluation
003 balanced blend Fixed metric + class_weight + threshold(High×2.6) val 0.9711(bal_acc), public 0.9691 Expand pairwise TE
004–006 TE exploration Various target encoding approaches (28–171 pairs) val 0.9692–0.9699 Factorize pairs, TE on cats only
007 stacking Ridge meta-learner + bias tuning val 0.9707 Switch to sklearn TE
008b fullpair 171 pairwise factorize + cat TE(24) + threshold(High×3.7) val 0.9738, public 0.9721 Slow LR + full pairwise TE
011 slow XGB lr=0.01, 4000 rounds hard cap + sklearn TE on 171 pairwise (750 features) val 0.9794, public 0.97799 Multi-seed
013 multiseed 3-seed XGB + orig append + coord descent bias tuning val 0.9796, public 0.97833 Pseudolabeling, 5-seed
015 pseudo-label pseudo-label (conf>0.95, 249K samples) + trial_011 arch val 0.9796, public 0.97771 — regression Pseudo-labeling ineffective

Submissions

sub trial public what happened
01 001 baseline 0.9589 Raw features + LightGBM
03 003 balanced 0.9691 Metric fix + threshold. +0.008 jump
08b 008b fullpair 0.9721 171 pairwise + multiclass TE + threshold
11 011 slow XGB 0.97799 lr=0.01 + sklearn TE 750 features. +0.006 jump
13 013 multiseed 0.97833 3-seed + bias tuning. best
14 015 pseudo-label 0.97771 Pseudo-labeling regression

birdclef (BirdCLEF+ 2026)

TL;DR: Classify 234 bird/frog/insect species from 60s field recordings (5s segments). Macro-averaged ROC-AUC. Ends 2026-06-03. Key challenge: Code Competition — submissions only via Kaggle notebooks, CPU 90min limit. A blend of pretrained components (Perch, ProtoSSM, SED) plateaus hard — the last ~16 trials all land on the same ceiling. Best: Public 0.938 (trial_080, ProtoSSM 50% + SED 40% + Perch 10%). 87 trials, 80 submissions.

Experiment flow

Trial Why tried Result Next
~001–015 Fork a 0.926 public notebook, retrain on competition data, swap to ONNX Perch (faster inference model) to beat the 90min timeout Climbed 0.912 → 0.928. Timeout was the real enemy, not accuracy Add a second model to the blend
~016–044 Add EffNet / ConvNeXt / multi-window components to the Perch blend 0.929–0.932. ConvNeXt 5-fold kept getting silently rejected — the hidden re-run refused its dataset mount Drop unstable components, isolate what the auto-submit accepts
050–061 pseudo-label mix, fold-ensemble the weak components, ConvNeXt axis 0.934 ceiling. 2→4 fold helped (+0.001), but everything stuck at 0.934 for 10+ trials Try a stronger primary model than the Perch blend
079–080 Apply a reference config (ProtoSSM as main + SED component); push SED weight 18% → 40% 0.935 → 0.938. SED weight was the one real lever (+0.003) Find where SED saturates
082–087 Sweep all four blend axes around the 0.938 best All 0.937–0.938 — saturated (see below) Parameter space exhausted → model retrain

The 0.938 ceiling — four axes all saturated

After trial_080 hit 0.938, 6 trials swept every blend parameter and none broke through:

Axis Trial What Result
weight 082, 083 SED 40%→50%, Perch 10%→12% 0.937 / 0.938 — SED 40% is the peak, ±2pp does nothing
scale 084 z-score normalize SED logits to match Proto 0.938 — no effect. ROC-AUC is rank-based, so any monotonic transform is a no-op
component 085, 086 re-add EffNet 5% / 10% 0.938 / 0.937 — a different model family still can't shift the row-ranks; bigger weight just drowns the strong ProtoSSM
inference 087 ProtoSSM TTA 5→7 time shifts 0.938 — time-shift ensemble already saturated at 5

Lesson: once a blend plateaus, parameter tuning (weights, scaling, extra components, TTA) cannot move a rank-based metric. The only remaining lever is retraining the base models. Diminishing returns were obvious by trial ~084 but confirmed by sweeping each axis once.

Submissions (milestones)

sub public why this number
01 0.912 first valid submission
04 0.928 fork + retrain on competition data
09 0.928 ONNX Perch — finally beat the 90min timeout
12 0.929 Perch + EffNet blend
50–61 0.934 pseudo-mix + fold ensembles; hard ceiling for 10+ trials
72 0.935 reference config (ProtoSSM main + SED)
73 0.938 SED weight 18%→40% — the breakthrough lever
75–80 0.937–0.938 swept weight/scale/component/TTA — all saturated

ts-forecasting (Hedge Fund)

TL;DR: Predict 36,923 financial time series. 89% of test series are unseen. Weighted RMSE. Key challenge: One series (weight 13 trillion) could zero out the entire score if predicted wrong.

Three zeros

  1. Val didn't reflect test (sub_02) — Test had no lag features, val did. val 0.89 → public 0.0000
  2. High-weight series explosion (sub_03) — Predicted 6.37 for a series worth 0.000009 → error 5.6×10¹⁴
  3. Group mean is poison (sub_04) — Training mean (-0.67) on new series (true ≈ 0) is worse than predicting zero

Finding: Competition host said "public 0.5+ scores likely use future data (cheating)". Honest ceiling is 0.3–0.5.


march-mania (March Machine Learning Mania 2026)

TL;DR: Predict NCAA tournament win probabilities. Brier Score. Key challenge: Missed submission deadline. "9 days to go" meant days until tournament results, not submission cutoff.

Best local score: Men 0.161, Women 0.132 — would have been competitive.


Lessons learned

  1. Don't trust val blindly — Verify val reflects the actual test scenario (ts-forecasting: 3 zeros)
  2. Check prediction distribution before submitting — Score of 0 means prediction explosion/bias
  3. Fork Kaggle notebooks when hitting local limits — Broke GBDT ceiling with RealMLP fork (+0.003), BirdCLEF also solved by forking
  4. Don't filter weak models — Ridge ensemble with all 55 OOFs beat cherry-picked subsets
  5. Code Competition = half the work is environment — Don't build from scratch, fork and extend
  6. Check the actual deadline first — "X days to go" might not mean submission deadline
  7. Never submit without OOF validation — Even post-processing needs local val first
  8. Improving val is the right direction — In churn, val-public gap was negative but private beat public. Val is the more accurate indicator
  9. Slow learning rate + hard cap > early stopping — lr=0.01 with fixed 4000 rounds beat mlogloss-based early stopping for balanced_accuracy (irrigation trial_011 vs 012)
  10. Multi-seed averaging helps public more than val — Variance reduction shows on unseen data (irrigation trial_013: val +0.0002, public +0.0003)
  11. Pseudo-labeling doesn't always help — In irrigation, pseudo-labeling (conf>0.95) caused regression despite high confidence threshold

Details: TRIAL_GUIDE.md

About

Kaggle 대회 실험 및 제출 관리. Trial-Submission 구조로 가설 검증부터 reflection까지 체계적으로 관리.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors