A reproducible research study on short-horizon predictability, market efficiency, and transaction-cost structure in on-chain prediction markets. The headline finding is negative, and the methodology used to establish it honestly is the contribution.
Español: README.es.md
We investigate whether a profitable trading edge exists in Polymarket's
5-minute "Bitcoin Up or Down" binary markets using only publicly available
data. We build a 24/7 data-collection infrastructure (exchange trade flow,
the market order book, and the Chainlink BTC/USD data stream that resolves the
markets) and evaluate a sequence of hypotheses under a strict out-of-sample,
cost-aware, anti-overfitting protocol. We show: (i) public price/technical
features carry no directional signal at the 5-minute horizon; (ii) recorded
trade-flow features yield only a weak, regime-dependent edge after a
non-obvious data-quality correction; (iii) the market's true transaction cost
(a 0.07·p·(1−p) taker fee plus a dynamic spread) imposes an irreducible
edge barrier of ≈1.75 probability points at p=0.5, which the available signal
does not clear; and (iv) out-of-sample, the order book itself is a better
predictor of the outcome (AUC 0.837) than any predictor we construct from the
resolution oracle or from a measured 2-second exchange→oracle lead
(AUC 0.812). We conclude the market is efficient and not exploitable by a
public-data taker, and we precisely characterise the only structurally
favoured seat (market making). The value of this repository is the
evaluation discipline: it documents how a tempting hypothesis was killed
with evidence, including the detection and correction of three measurement
errors that each would have produced a false conclusion.
Figure 1. Left: the Polymarket order-book mid is a calibrated estimate of the realised outcome probability (points track y = x), i.e. the market is efficient. Right: out-of-sample, the book (AUC 0.837) outpredicts every public-data predictor we built (0.812); a measured 2-second exchange→oracle lead adds nothing.
Problem. Polymarket lists rolling binary markets "Will BTC be ≥ its price
5 minutes ago?". A market resolves Up iff price(end) ≥ price(start). The
research question: with public data only, is there a strategy with positive
expected value net of real transaction costs?
Objective. Estimate P(BTC(t+5m) ≥ BTC(t)) accurately enough to trade
profitably, or determine rigorously that no such public-data edge exists.
Contribution. This is a negative result, reported honestly. The deliverable is not a profitable system (none was found and none is claimed); it is a reproducible, cost-aware, out-of-sample evaluation framework and the documented reasoning (including three self-caught measurement errors) that turns "it looked like it worked" into "it does not, and here is why."
Understanding why the result is negative requires understanding the instrument. This section is prerequisite to the methodology.
- A new market opens every 5 minutes, 24/7 (288 windows/day per asset),
with a deterministic slug
{asset}-updown-5m-{unix}whereunixis the 300-second-aligned window start. - At open the strike is fixed = the Chainlink BTC/USD value at that instant. It does not move for the rest of the block.
- For the 5 minutes the CLOB order book is live: Up and Down shares
trade in
[$0, $1], tick$0.01, minimum order$5, observed liquidity$6k–$17k. A position can be exited before close by selling at the live price. You are not forced to hold to resolution. - At close, the Chainlink value at the end is compared to the strike:
end ≥ strike→ Up shares pay$1, Down pay$0(ties resolve Up, matching the targetP(BTC(t+5m) ≥ BTC(t))). - The payoff is binary and determined only by the two oracle readings (start, end). For a hold-to-resolution position the intermediate price path is irrelevant: being "ahead" with one minute left means nothing if the last minute reverses it.
- Near open the outcome is maximally uncertain: the mid is ≈
0.50. - As time elapses and BTC moves relative to the (fixed) strike, the mid tracks
the conditional probability
P(Up | current gap, time remaining); assecs→0it converges toward0or1, effectively a step function. - Empirically the mid is well calibrated to the realised
P(Up)across the full range (0.30→0.263,0.50→0.492,0.70→0.752,0.95→0.975), andmid | gap>0 ≈ P(Up | gap>0)at every time bucket: the price already embeds the observable oracle gap. - The spread is dynamic: ≈1¢ when balanced / pre-open, widening to ≈5¢ mid-window, and up to ≈19¢ on BTC at informative moments, where market makers widen precisely when the outcome becomes contested, protecting against informed flow. Markets are tradeable ≈92% of the time.
- The live price is a fast, efficient estimate of the resolution probability. Out-of-sample it is a better predictor (AUC 0.837) than any public-data predictor we built (0.812), including one exploiting a measured 2-second exchange→oracle lead (§5).
- The market is negative-sum for takers: a taker-only fee
0.07·p·(1−p)(an irreducible ≈1.75 pp floor at p=0.5) plus the spread. The average taker must lose; winning takers are paid by losing takers, not by "the house". - "Enter late when it is nearly decided" fails: at ≈95% decided the book
already prices Up at ≈
$0.95→ pay$0.95to win$1at 95% ≈ break-even before cost, negative after. The number of participants is irrelevant. Efficiency is a property of the price; more participants make it more efficient, not less. - "Buy at 0.50, sell the peak" fails: it is the same prediction problem (the price moves with BTC faster than one predicts), it doubles the cost (two fee-bearing legs, with the fee maximised at p=0.5), and it adds a second unsolved prediction (the exit timing).
- The 5-minute horizon is a pure speed/microstructure game, leaving no room for the research/judgment edge that is the only way a non-automated human wins in prediction markets generally. The common discretionary user is therefore strictly worse positioned than the automated approach (slower, unable to monitor 288 windows/day, paying the same fee) and is precisely the flow professional makers profit from. Short-run "wins" are variance, not skill; the long-run taker expectation is negative, the dynamics of a sharp bookmaker plus vig, not a beatable coin flip.
- The only structurally favoured seat is the maker (fee-exempt, spread-capturing), i.e. professional market making under adverse-selection risk, requiring capital and infrastructure, not prediction, and not accessible to the common user.
In one line: the 5-minute block is an efficient, speed-dominated, negative-sum-for-takers binary game whose live price already contains the public information and whose fee is engineered so takers subsidise makers. The wall is not a tuning problem; it is the market's design.
- EMH (Fama, 1970): a competitive price already incorporates public information; residual predictability for a public-data participant ≈ 0, exactly what §5 confirms empirically.
- Microstructure (Easley, López de Prado & O'Hara, VPIN, 2012; Harris, Trading and Exchanges, 2003): order-flow imbalance, aggressor pressure and large-trade ("whale") activity are the candidate leading signals when lagging price features fail (tested in §5.2).
- Binary payoff: PnL is not proportional to the underlying's return; mishandling this silently invalidates a backtest (§6, Error 1).
| Stream | Source | Cadence | Volume collected |
|---|---|---|---|
| Spot/perp trades | Bybit WebSocket (BTC/ETH/SOL perp) | tick | ~11 days, ~8M trades/symbol |
| Resolution oracle | Chainlink BTC/USD via Polymarket public WS | 1 Hz (~1.3 s delivery latency) | 410 windows (~34 h) |
| Order book | Polymarket CLOB (btc-updown-5m-*) |
~1 s poll | time-aligned with oracle |
| Cost/liquidity probe | Polymarket Gamma + CLOB | 40 s | 8,915 samples (tradeable 91.8%) |
Collection runs as resilient launchd daemons (auto-restart, network-state
aware, periodic flush). Market discovery is deterministic from the
5-minute-aligned slug; the repository's original keyword-based finder was
found defective (§6, Error 2).
Evaluation protocol (applied to every hypothesis):
- Out-of-sample: temporal train/test split; nuisance parameters estimated on train only, evaluated on held-out test.
- Cost-aware: PnL computed with the correct binary payoff and the
real fee, taker filled at the recorded ask (
simulation/cost_model.py). - Anti-overfitting: predictors are pre-specified and simple; results are reported per time-bucket (no "best bucket" cherry-picking); regime dependence is reported, not hidden.
- Robust core metric: out-of-sample log-loss / AUC vs. outcome, an execution-assumption-free measure. If a PnL number contradicts the robust metric, the robust metric is trusted and the PnL is treated as an artifact.
Transaction-cost model (decoded from primary sources, Polymarket fee
documentation): crypto markets charge a taker-only fee
fee = C · 0.07 · p · (1−p) (C = shares, p = price); makers pay zero. Verified
against the published peak of $1.75 per 100 shares at p=0.5. The implied
break-even edge over the mid is ≈ spread/2 + a·0.07·(1−a), i.e. an
irreducible ≈1.75 probability-point floor at p≈0.5, before spread.
LGBM / TCN / latent-SDE on OHLCV + technical indicators + funding. Result: validation BCE ≈ 0.71 at k=0, worse than the random baseline (0.693). No directional signal. Architecture changes do not help.
Aggressor imbalance, buy pressure, large-trade ratio (whale proxy), VWAP deviation, 5-minute bins. Result, raw: "no signal." After correcting a data-quality defect (§6, Error 1, partial bins at recorder-gap edges): Δ AUC = +0.024 (t ≈ 3.0, 15/25 paired wins), AUC ≈ 0.534. A real but weak edge; regime-dependent (one CV fold negative); BCE only marginally below random. First positive result of the project, and far too weak to clear §4's cost floor.
The markets resolve on the Chainlink BTC/USD data stream, not on Bybit (§6, Error 3). The oracle is observable in real time (public WS, 1 Hz). With 407 recorded windows:
- Order-book calibration: book mid ≈ realised
P(Up)across the entire range; the book is highly efficient. - Naive oracle-vs-book taker: −0.19 PnL per $1 over 77,656 trades, negative in every time bucket.
- Lead–lag: Bybit leads the Chainlink tick by ≈2 s
(corr
ret_Bybit(t)vsret_Chainlink(t+2s)= +0.71; contemporaneous +0.04), a clean structural lead.
Predictors compared on a held-out second half (205 windows): A = oracle gap only; B = oracle gap corrected with the real-time Bybit lead (basis handled by differencing, not levels).
| Predictor | OOS log-loss | OOS AUC |
|---|---|---|
| Order-book mid | 0.491 | 0.837 |
| A: Chainlink only | 0.610 | 0.812 |
| B: Bybit-corrected | 0.610 | 0.812 |
The book predicts the outcome better than we do, and the 2-second lead adds zero incremental predictive power (B ≈ A to four decimals). A small positive taker PnL appeared in the simulation but contradicted the robust metric and was internally inconsistent (negative in the tightest buckets; absurd "maker" figures), so it was classified as an execution-model artifact and not reported as a result (§4.4).
Each of these, uncorrected, would have produced a false conclusion. Finding and reporting them is the substantive contribution.
- Dirty-data masking. Recorder network gaps left partial 5-minute bins
whose flow features were computed over a fraction of the interval. They
were not
NaN, so they silently entered the model and reversed the trade-flow verdict (from "no signal" to "weak signal" once cleaned at source). Lesson: clean at construction, not downstream. - Wrong instrument / broken plumbing. The simulation PnL modelled a linear return, not the binary payoff; the order-book discovery used a keyword search that returned unrelated markets. Both invalidate naive backtests until fixed.
- Wrong resolution source. All early signal work used Bybit as the label; the markets actually resolve on the Chainlink BTC/USD stream. The label was wrong until corrected (raised by domain questioning, then verified against the market's authoritative resolution rule).
There are two independent walls (mechanically argued in §2, empirically confirmed here):
- Predictability wall. Short-horizon BTC direction is not predictable from public data better than the market itself. Out-of-sample the order book (AUC 0.837) dominates every predictor we built (AUC 0.812), including one exploiting a measured exchange→oracle latency lead.
- Cost/structure wall. The taker-only
0.07·p·(1−p)fee plus a dynamic 1–19¢ spread create a break-even barrier (~2–4 pp) far above the available signal. The only structurally favoured seat is the maker (fee-exempt, spread-capturing), i.e. professional market making with adverse-selection risk, a different problem from prediction, not claimed solved here.
The market is efficient because faster, better-informed participants are already extracting and thereby eliminating the edge. Efficiency is the evidence that the winning seat exists and is occupied.
- Resolution-oracle dataset is short (~34 h / 407 windows); trade-flow ~11 d.
- Single venue; down-token book approximated as
1 − up. - The maker seat is not rigorously tested (only a deliberately optimistic, artifact-level approximation, explicitly discarded).
- Residual-volatility scaling for the conditional probability is an empirical approximation; the robust AUC/log-loss core does not depend on it.
- No claim of profitability is made; this is a negative result by design.
With public data, as a taker, the Polymarket 5-minute BTC market is not exploitable: the order book is more informed than our best predictor and the fee/spread structure exceeds the residual edge. The result holds for the automated approach and, a fortiori, for the common discretionary user, who is strictly worse positioned (§2.3). The investigation is reported as a negative result with full reasoning. The transferable output is the evaluation framework (out-of-sample, cost-aware, anti-overfitting, and honest about self-inflicted measurement error) together with a reusable real-time market/oracle data infrastructure.
data/ collectors + raw/aggregated parquet (trades, resolution, history)
features/ technical.py, fractal.py (feature construction)
simulation/ cost_model.py (correct binary payoff + real fee + breakeven)
polymarket/ client/finder/executor/resolver (venue interface)
scripts/
record_trades.py Bybit trade-flow recorder (daemon)
record_resolution.py Chainlink oracle + Polymarket book recorder (daemon)
polymarket_cost_probe.py spread/executability probe (daemon)
build_trade_features.py raw trades -> cleaned 5m features
test_trade_signal.py trade-flow signal test (OOS, multi-seed)
analyze_resolution.py oracle-vs-book efficiency / edge analysis
analyze_leadlag.py Bybit->Chainlink lead-lag
analyze_money_test.py decisive OOS money-test (predictor A vs B vs book)
Run order: ingest.py -> build_trade_features.py -> test_trade_signal.py;
data daemons feed analyze_*.py. Self-check the cost model with
python -m simulation.cost_model (verifies the $1.75/100-share fee peak).
- E. F. Fama. Efficient Capital Markets: A Review of Theory and Empirical Work. Journal of Finance, 1970.
- D. Easley, M. López de Prado, M. O'Hara. Flow Toxicity and Liquidity in a High-Frequency World (VPIN). Review of Financial Studies, 2012.
- L. Harris. Trading and Exchanges: Market Microstructure for Practitioners. Oxford University Press, 2003.
- Polymarket. Trading Fees. https://docs.polymarket.com/trading/fees
- Chainlink. Data Streams. https://docs.chain.link/data-streams
Author's note: this study deliberately reports a negative result. In quantitative research, a rigorously established and honestly documented "this does not work, and here is precisely why" is a stronger signal of methodology than an unverified claim that something does.
