PCA-based market-neutral statistical arbitrage, implemented as a clean, tested, reproducible research framework. It recovers statistical risk factors from the cross-section of S&P 500 returns using Principal Component Analysis, projects those factors out of each stock to isolate the idiosyncratic residual, models the residual as a mean-reverting process, and trades the resulting signal in a book that is dollar- and beta-neutral by construction.
This is a faithful re-implementation and extension of Avellaneda and Lee (2010),
built entirely on free data so anyone can clone, run, and audit it end to end.
The methodology is specified in SPEC.md, which is the single source
of truth for scope; the modules and functions map one-to-one onto the equations
there, so the repository doubles as a teaching reference.
This is not investment advice and not a live trading system. It runs on free data with known survivorship bias (see Data and the survivorship caveat). Results are methodology demonstrations, not return forecasts. Published statistical-arbitrage edges have decayed substantially since the 2000s.
At each rebalance date the pipeline runs nine steps on a strictly historical window: resolve the universe, compute returns, extract PCA factors (with correlation-matrix cleaning), regress out the factors to get residuals, fit the Ornstein-Uhlenbeck model and compute s-scores, turn s-scores into signed positions, size them into a neutral and constrained book, simulate execution with costs, and mark the ledger. The design avoids look-ahead in code rather than by convention, fills trades with a one-day lag by default, and is deterministic given a config and a seed.
git clone https://github.com/slye-us/statarb.git
cd statarb
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,data]"The core library needs only numpy, pandas, scipy, PyYAML, pyarrow, and
matplotlib. The data extra adds the optional free-data providers (yfinance,
pandas-datareader); the dev extra adds the test and lint toolchain.
Run the network-free synthetic backtest, which exercises the whole pipeline and writes every figure and table:
statarb run --config configs/smoke.yaml --output artifactsRun the headline configuration on real S&P 500 data (requires network and the
data extra; first run downloads and caches prices):
statarb data --config configs/baseline.yaml --refresh-universe
statarb run --config configs/baseline.yaml --output artifacts
statarb verify --config configs/baseline.yamlEvery default in configs/baseline.yaml maps to the parameter table in
SPEC.md Section 4. Experiments are declarative: change the YAML, rerun, diff
the artifacts.
The figures below come from configs/smoke.yaml, a deterministic synthetic
market with a planted factor structure and genuinely mean-reverting residuals.
It is an illustration of the machinery, not a performance claim. On this data the
strategy recovers the planted edge (gross Sharpe well above one) and the cost
model takes a realistic bite out of it, while the book stays neutral to machine
precision.
The net market beta and net dollar exposure sit at roughly 1e-17 throughout: the
book demonstrably is market-neutral, it does not merely claim to be. Full
numbers are in docs/example_run/summary.md.
statarb/
data/ point-in-time universe, price ingestion + cache, pluggable providers
factors/ correlation matrix, cleaning (Ledoit-Wolf / MP), selection, eigenportfolios
signals/ factor-regression residuals, OU fit + s-score, trading rules
portfolio/ dollar/beta-neutral construction with constraints, ex-ante risk
backtest/ no-look-ahead event loop, cost models, ledger
evaluation/ metrics, deflated/probabilistic Sharpe, bootstrap, report builder
cli.py one-command reproduction
The factor engine cleans the correlation matrix before eigendecomposition to
control noise in the T < N regime, where a 252-day window and a near-500-name
universe leave the sample matrix rank-deficient. Cleaning is configurable
(ledoit_wolf, mp_clip, or none) and is on by default.
Performance is never summarized by a single Sharpe ratio. The report includes return and risk metrics, drawdowns, turnover and cost diagnostics with a breakeven-cost figure, realized neutrality over time, and overfitting controls: the probabilistic and deflated Sharpe ratios (Bailey and Lopez de Prado) and a stationary-bootstrap confidence interval on the Sharpe. A walk-forward split with frozen hyperparameters is supported in config.
Free data is the project's central validity caveat, and it is treated as a first-class concern rather than a footnote. Two facts shape what the framework can honestly claim:
- Point-in-time S&P 500 membership reconstructed from public change logs is accurate for recent years and progressively incomplete further back. It is best-effort, not a clean historical index.
- Free sources generally do not serve prices for delisted, acquired, or bankrupt tickers, which are exactly the names a survivorship correction needs.
Because of this, the realistic operating mode is a current-membership
("surviving names") backtest, and its results carry an upward survivorship bias
by construction. The honest product of the data layer is the
survivorship-sensitivity analysis, not a single headline number. See SPEC.md
Section 6.2 and Section 11.
make lint # ruff
make format # black
make typecheck # mypy
make test # pytest
make cov # pytest with coverage
make run # statarb run --config configs/baseline.yamlContinuous integration runs lint, format check, type check, and the test suite on Python 3.10 through 3.12, plus a nightly synthetic smoke backtest. New math modules require unit tests; the no-look-ahead guarantee and the neutrality constraints have dedicated tests and should stay green.
Avellaneda, M. and Lee, J. (2010). Statistical Arbitrage in the US Equities Market. Quantitative Finance, 10(7), 761-782. Bailey, D. and Lopez de Prado, M. (2014). The Deflated Sharpe Ratio. Journal of Portfolio Management. Laloux, L., Cizeau, P., Bouchaud, J.-P. and Potters, M. (1999). Noise Dressing of Financial Correlation Matrices. Physical Review Letters.
MIT. See LICENSE.

