The hardest part of empirical asset pricing is not the factor model — it is the data. A brilliant strategy on flawed data produces confident nonsense. This register enumerates every known way a homemade equity database can be academically devalued, and the concrete countermeasure HyperDB applies. It is the project's central quality argument.
Status legend:
Designed= specified in the pipeline architecture;Verified= checked by the benchmark suite on a built database (post-build, no claims until then).
T1 — Ticker reuse over time. A delisted ticker is later reassigned to a different
company (e.g., GM: old General Motors → bankruptcy 2009; new GM → IPO 2010). Naively keyed
by (exchange,ticker), two distinct firms collapse into one fictitious 35-year series.
Devalues: fabricated continuity, contaminated long-run returns, broken survivorship logic.
Countermeasure: permanent surrogate identity (entity_id) anchored on ISIN + listing
intervals + corporate-action linkage; a trading gap followed by a new ISIN ⇒ new entity. See
IDENTITY.md. Verify: GM/MCI/anchor reuse cases resolve to ≥2 entities. Designed
T2 — Ticker change for the same firm. Same company, new ticker (FB→META 2022; GOOG→
GOOGL). Naive keying splits one firm into two truncated series.
Countermeasure: ISIN-continuity links listings across ticker changes into one entity_id.
Verify: FB/META and Google class anchors map to single entities. Designed
T3 — Cross-listings / ADRs double-count. The same economic firm trades on several venues
(home line + ADR). Pooling them inflates the cross-section and double-weights the name.
Countermeasure: one entity_id, many listings; a primary_listing flag (home/most-liquid)
used for cross-sectional sorts. Verify: multi-listing entities flagged; sorts use primary only.
Designed
T4 — Share-class duplication. BRK.A/BRK.B, GOOGL/GOOG are distinct securities of
one company. Treating them as unrelated breaks company-level aggregation; merging them breaks
security-level returns. Countermeasure: separate securities, grouped by a company_id
(PERMCO-analogue). Designed
T5 — Missing/!unstable ISIN (EM, pre-2000). When the anchor identifier is absent, entity
resolution degrades. Countermeasure: fallback heuristics (name + domicile + listing interval)
with an explicit confidence flag; low-confidence links surfaced, never hidden. Designed
T6 — Survivorship bias. Excluding dead firms upward-biases returns (failed firms vanish).
Countermeasure: universe ingests active + delisted instruments (already in
universe.py); coverage report shows the delisted share. Verify: delisted count > 0 per
era/exchange. Designed
T7 — Missing delisting returns (Shumway 1997). Performance-related delistings (bankruptcy)
have their final, large negative return omitted far more often than neutral ones ⇒ small-cap
returns overstated. Countermeasure: classify delisting reason; impute missing
performance-delist returns (−30% NYSE/AMEX, −55% Nasdaq) per Shumway & Warther; neutral
delists unpenalized. Verify: small-cap decile return shifts with/without adjustment.
Designed
T8 — Retroactive split/dividend adjustment. Vendor adjusted_close is recomputed over the
whole history when a new split occurs ⇒ using it embeds future information.
Countermeasure: reconstruct a point-in-time total-return index from raw close + dividends +
splits known as of each date; keep vendor adjusted only as a cross-check. Verify:
corr(return_pit, return_vendor) > 0.999 on clean names; divergences flagged. Designed
T9 — Factor release lag (the misalignment bug). Ken French / q-factors are published after
month-end; merging by calendar month without an availability rule mis-dates them by one month —
a well-documented real-world failure (a long-only β flips from ≈ −0.2 to ≈ +1.0 once
corrected). Countermeasure: availability-aware factor alignment + a hard anchor-event test
(COVID US Mkt-RF ≈ −13.35% must land on 2020-03; corr ≥ 0.95 @ lag 0 or the build fails).
Verify: anchor test in benchmarks.py. Designed
T10 — Market-cap weighting look-ahead. Value-weighting by contemporaneous mv[t]
over-weights within-month winners (corr(mv[t],ret[t]) ≈ 0.85). Countermeasure: all weights
use beginning-of-month mv[t-1]. Designed
T11 — Fundamentals reporting lag & restatements. Using a financial statement before its
filing date, or using restated (not as-reported) figures, is look-ahead.
Countermeasure: lag quarterly +3m / annual +6m (fiscal-year-end-aware where known); store
as-reported with report/filing dates; vintaged so restatements don't leak backward. Designed
T12 — Point-in-time index membership. Using today's S&P 500 list historically is
survivorship + look-ahead. Countermeasure: dim_index_membership with start/end intervals
(seeded from monthly index-constituent files); membership is always as-of-date. Designed
T13 — Bad ticks, reversals, stale & sentinel prices. Data-entry errors (a price and its
reversal), padded/stale prices, and EODHD's 999,999.99 placeholder create spurious extreme
returns. Countermeasure: Ince-Porter dynamic screens (reversal filter, stale-run flag, price
floor) + removal of sentinels in the analysis view, all logged with counts. Verify: screen
impact table. Designed
T14 — Corporate-action errors (splits/spinoffs/rights). Missed or mis-dated splits create
fake ±50%+ jumps; spinoffs and rights issues are notoriously mishandled.
Countermeasure: PIT reconstruction catches split mismatches vs vendor; anchor splits
validated (AAPL 4:1 2020-08, 7:1 2014-06); spinoffs flagged as lower-confidence. Designed
T15 — Brutal data gaps / thin cross-sections. Sparse history or thin months silently weaken
inference. Countermeasure: coverage metrics (per-asset trading-day completeness vs real
calendar; cross-sectional breadth per month) with thresholds and a published coverage report;
gaps represented as absent rows, never filled. Verify: coverage gates. Designed
T16 — Backfill bias in vendor data. Vendors backfill history when adding a name; placeholder
zeros masquerade as data (e.g., vendor ESG scores backfilled with placeholder zeros for 38–53% of pre-2010 observations).
Countermeasure: detect placeholder/backfill runs; start-date / zero-contamination analysis
before trusting early history. Designed
T17 — FX timing & redenomination. Mismatched FX stamps or unhandled currency
redenominations (legacy→EUR 1999; many EM redenominations) corrupt USD/EUR returns.
Countermeasure: FX snapped to the same month-end stamp as returns; triangulation residual ≈ 0;
explicit redenomination map. Designed
T18 — Trading-calendar inference error. Inferring "non-trading" from absent data mis-marks
market-wide holidays. Countermeasure: real exchange calendars cross-checked against inferred
days. Designed
T19 — Non-synchronous / time-zone effects. Global markets close at different times; "same
date" returns are not contemporaneous across regions (biases global betas/correlations).
Countermeasure: document the local-date convention; apply appropriate lead/lag for
cross-region statistics. Designed
T20 — Microstructure / illiquidity. Zero-volume days, bid-ask bounce, non-trading bias
estimates. Countermeasure: liquidity flags (volume, Amihud), non-trading excluded from return
computation. Designed
T21 — Data-snooping in the pipeline itself. Tuning screen thresholds to get nice results.
Countermeasure: every threshold fixed ex ante from the literature, documented in
config; no in-sample tuning. Designed
T22 — Non-reproducibility / vendor revision. Vendor data changes; results can't be
re-derived. Countermeasure: immutable, vintaged Raw + per-response manifests + hash-based
reproducibility check; results pinned to a vintage_id. Verify: re-run yields identical
hashes. Designed
T23 — Winsorization / outlier choices. Hidden trimming changes conclusions.
Countermeasure: provide flags, not forced winsorization; document any trimming and offer a
sensitivity in the audit. Designed
| # | Threat | Status |
|---|---|---|
| T1–T5 | Identity / entity resolution (PERMNO problem) | Designed |
| T6–T7 | Survivorship & delisting returns | Designed |
| T8–T12 | Look-ahead / point-in-time | Designed |
| T13–T14 | Data errors & corporate actions | Designed |
| T15–T16 | Coverage, gaps, backfill | Designed |
| T17–T20 | FX, calendar, time-zone, microstructure | Designed |
| T21–T23 | Snooping, reproducibility, winsorization | Designed |
No threat is marked Verified until the database is built and cli.py audit --benchmarks
confirms it on real data. Until then these are designed countermeasures, not claimed results.