Result: 11th of 23 teams · +£46,742.40 net P&L · 46.7% return on a £100k starting bankroll
95 teams registered. 23 competed on the day. A live algorithmic trading competition run by the Queen Mary Machine Learning Society across 9 rounds.
Standard regression gets you a fair-value estimate. Market making requires something harder: a calibrated spread. Submit too tight and you become the market maker, forced to accept every trade at prices that may be far from the true value. Submit too wide and you are safe but uncompetitive. The core challenge was balancing prediction accuracy, uncertainty quantification, and risk management simultaneously — under time pressure, round by round, with real money on the line.
Built a baseline modelling pipeline before the event:
- Loaded and audited all 9 train/test dataset pairs
- Benchmarked linear regression, ridge, lasso, and random forest using 5-fold CV on each stock independently
- Assigned the best-performing model per stock rather than forcing one approach across all datasets
- Tuned regularisation parameters via
GridSearchCV - Converted RMSE into quote ranges: aggressive (±0.5σ), balanced (±1.0σ), and defensive (±1.5σ) bands
Merged the best components from three team notebooks into a single pipeline:
A. SVD-based NumPy Ridge with exact leave-one-out CV
Rather than approximating LOO error, the pipeline computes it exactly using the hat-matrix shortcut — mathematically optimal alpha selection with no loops over folds. Sweeps 60 alpha values across a log-spaced grid, fits using the SVD decomposition of the training matrix, and averages predictions across near-optimal alphas to reduce variance. For small datasets (fewer than 200 rows), uses 300 bootstrap samples to get a stable prediction and a reliable uncertainty estimate.
B. Size-gated sklearn model zoo
Dataset sizes varied widely across the 9 stocks. Rather than applying the same model everywhere, the pipeline gates model availability by number of training rows:
| Model | Minimum rows | Rationale |
|---|---|---|
| Ridge / ElasticNet | any | always available, strong on small data |
| GradientBoosting | ≥ 200 | captures non-linearity |
| HistGradientBoosting | ≥ 500 | faster GBM for mid-size data |
| ExtraTrees | ≥ 2,000 | low-bias ensemble for large datasets |
C. Ensemble and decision logic
If the runner-up model is within 5% of the best model's CV RMSE, both predictions are averaged. The final spread was then derived from the combined uncertainty estimate, with the leaderboard-aware decision engine adjusting aggressiveness based on current cash position and round number.
Used a bespoke Excel control sheet to record fair-value predictions, submitted quotes, true values, and running P&L across all 9 rounds in real time.
Python · NumPy · Pandas · Scikit-learn · Google Colab
Models: Ridge · Lasso · ElasticNet · GradientBoosting · HistGradientBoosting · ExtraTrees
├── Market_Making_AI_Hackathon.ipynb # Pre-event baseline and tuning pipeline
├── Hackathon_Combined_Bl.ipynb # Final combined model used on the day
├── qmml_hackathon_playbook_v2.docx # Round-by-round strategy notes
├── qmml_hackathon_control_sheet.xlsx # Live P&L and quote tracking sheet
└── README.md
Why per-stock model selection? The nine datasets varied significantly in size (29 to ~20,000 rows). A single model class cannot be optimal across that range. Lasso dominated on small, sparse datasets; tree-based ensembles won on larger ones.
Why exact LOO over k-fold? On very small datasets, k-fold produces high-variance estimates. Exact LOO via the hat-matrix shortcut is both computationally equivalent and statistically unbiased, making it strictly preferable when the dataset allows it.
Why bootstrap for small datasets? Three of the nine stocks had fewer than 200 training rows. Standard CV uncertainty estimates are unreliable at that scale. 300 bootstrap samples provide a more honest picture of prediction variance, which feeds directly into spread width.
Why ensemble the top two? When two models are within 5% of each other in cross-validated RMSE, neither has meaningfully won. Averaging their predictions reduces variance at no cost to bias — a free improvement before submitting.
What separated the top teams? After speaking to the top finishers, our model and strategy were nearly identical. The gap came down to position sizing in Round 1. Both teams saw the same high-confidence signal — we played it safe. When a model gives you a large-edge, high-confidence prediction, that is precisely when you size up. A lesson in translating prediction quality into trading decisions, not just quotes.