An end-to-end event-driven analytics and ML pipeline that automatically collects, classifies, and analyses Apple Inc. events from 6 data sources, joins them to market returns, and predicts 30-day performance outcomes using a Random Forest classifier.
Architecture generalises directly to operations and manufacturing — replace news feeds with incident logs or shift reports, replace stock alpha with OEE or throughput deviation, and the pipeline runs identically.
┌─────────────────────────────────────────────────────────────────┐
│ Data Sources (6) │
│ Apple Newsroom · Investor Relations · SEC EDGAR · SEC 8-K │
│ Seeking Alpha RSS · Yahoo Finance RSS · yfinance (prices) │
└────────────────────┬────────────────────────────────────────────┘
│ ~1,000+ records
▼
┌────────────────────────────────────────────────────────────────┐
│ Stage 1 — Collect HTTP client with retry + browser headers│
│ Stage 2 — Classify Rule-based: 10 categories, importance │
│ Stage 3 — NLP Enrich VADER sentiment (title + full content) │
│ Stage 4 — Store SQLite, upsert deduplication by URL │
└────────────────────┬───────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Stage 5 — Market Data AAPL + SPY daily closes (yfinance) │
│ Stage 6 — Label return_1d/5d/30d · alpha_1d/5d/30d │
│ Stage 7 — Analyse Stats by category, source, importance │
│ Stage 8 — ML Model Random Forest · Logistic Regression │
│ Stratified 5-fold CV · time-aware split│
└────────────────────┬───────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Streamlit Dashboard (6 pages) │
│ Overview · Data Explorer · Market Analysis · ML Model │
│ Signal Dashboard · Run Pipeline │
└────────────────────────────────────────────────────────────────┘
| Model | CV F1 (weighted) | Std |
|---|---|---|
| Random Forest | see output/models/results.json after running |
— |
| Logistic Regression | see output/models/results.json after running |
— |
Run
python -m apple_archive.cli trainto populate results. The model predicts whether AAPL will outperform the S&P 500 by more than 1% over 30 days following an event (binary classification onalpha_30d).
Screenshots are generated after running the pipeline. Add yours to
docs/screenshots/and reference them here.
| Overview | Market Analysis |
|---|---|
| (run pipeline → screenshot) | (run pipeline → screenshot) |
| Signal Dashboard | ML Model |
|---|---|
| (run pipeline → screenshot) | (run pipeline → screenshot) |
Organisations generate continuous streams of events — product launches, regulatory filings, operational incidents — but lack a systematic way to measure their downstream impact. This pipeline answers: which event types drive the largest performance changes, and can that be predicted before the outcome is known?
Applied here to Apple stock returns, but the same architecture applies to:
- Manufacturing: link machine downtime events to OEE impact
- Operations: connect supplier incidents to delivery performance deviation
- Business analytics: track which initiative categories correlate with KPI movement
| # | Stage | What it does |
|---|---|---|
| 1 | Collect | Scrapes 6 sources; HTTP client uses Chrome headers + exponential backoff to handle 403/429 |
| 2 | Classify | Assigns event_category (10 types), importance (1–5), long_term_view, confidence using keyword rules + SEC form type map |
| 3 | NLP | VADER sentiment analysis on headlines and full article text; produces nlp_compound, nlp_pos, nlp_neg |
| 4 | Store | SQLite with WAL mode; upsert-by-URL ensures no duplicates across pipeline runs |
| 5 | Market data | Downloads AAPL and SPY daily adjusted closes via yfinance; cached to JSON for offline use |
| 6 | Label | Joins each event to 1d, 5d, 30d return windows; calculates alpha = AAPL return − SPY return to isolate company-specific impact |
| 7 | Analyse | Aggregates mean/median return, stdev, win-rate by event category, source, and importance level |
| 8 | ML Model | Trains Random Forest + Logistic Regression; evaluated with stratified 5-fold CV and a time-aware holdout split (train pre-2023, test post-2023) |
| Feature | Source | Why it was included |
|---|---|---|
event_category |
classifier.py | Primary categorical signal — earnings events behave differently from product launches |
source |
collector | Source credibility varies; SEC filings are more reliable than RSS articles |
record_type |
collector | Filing vs. article vs. press release differ in information content |
importance |
classifier.py | Proxy for event magnitude; high-importance events (10-K, product launches) have larger reactions |
confidence |
classifier.py | Classification certainty; low-confidence records add noise |
long_term_view_score |
classifier.py | Encodes bullish/neutral/bearish as −1/0/+1 for linear models |
nlp_title_compound |
VADER | Headline sentiment captures immediate market framing |
nlp_compound |
VADER | Full-content sentiment; differs from headline when articles are nuanced |
nlp_pos / nlp_neg |
VADER | Separating positive and negative word ratios is more informative than compound alone |
has_full_content |
HTTP enricher | Records with full text are more reliably classified |
content_length |
HTTP enricher | Longer SEC filings tend to carry more substantive disclosures |
Target: alpha_30d direction — whether AAPL outperformed or underperformed the S&P 500 by more than 1% in the 30 days following the event. Raw return was rejected as a target because it conflates Apple-specific signals with general market movement.
1. Market-adjusted alpha over raw return Raw AAPL return is misleading — a 3% gain during a 4% market rally is underperformance. Subtracting SPY return isolates company-specific signal and meaningfully improves model quality.
2. Time-aware validation alongside random CV Standard k-fold CV leaks future data into training on time-ordered datasets, inflating metrics. The pipeline runs both: stratified CV for sample efficiency and a temporal split (pre/post date boundary) for realistic out-of-sample evaluation.
3. Rule-based classification, not ML, for event categories Event categories are defined by domain knowledge (e.g., SEC form type directly identifies the event). Rules are interpretable, debuggable, and do not require labeled training data. ML is reserved for the prediction task where rules cannot capture non-linear feature interactions.
4. URL-based deduplication in SQLite
Running the pipeline multiple times would duplicate records without upsert logic. SQLite's INSERT OR REPLACE keyed on URL ensures idempotent runs across incremental updates.
The pipeline structure is domain-agnostic. The mapping to manufacturing / operations:
| This project | Manufacturing equivalent |
|---|---|
| Apple press releases, RSS feeds | Machine alarm logs, shift handover reports |
| SEC regulatory filings | Quality audit reports, compliance submissions |
event_category (earnings, product launch) |
Incident category (mechanical failure, supplier delay) |
importance score |
Severity level |
alpha_30d (AAPL vs SPY) |
OEE deviation vs. baseline / peer line |
| ML prediction: UP / DOWN | Predict: high-impact vs. low-impact incident |
| Signal Dashboard | Maintenance decision support dashboard |
Analytical queries are in queries/analytics.sql. Examples:
-- Event frequency and sentiment trend by month
SELECT strftime('%Y-%m', published_at) AS month,
event_category,
COUNT(*) AS event_count,
ROUND(AVG(CAST(json_extract(metadata_json, '$.nlp_compound') AS REAL)), 3) AS avg_sentiment
FROM records
WHERE published_at IS NOT NULL
GROUP BY 1, 2
ORDER BY 1, event_count DESC;Run directly against the SQLite database:
sqlite3 output/apple_archive.sqlite < queries/analytics.sqlgit clone https://github.com/SepehrKalantariSol/event-impact-analytics
cd event-impact-analytics
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtOptional — set a descriptive User-Agent for SEC requests (required by SEC guidelines):
export APPLE_ARCHIVE_USER_AGENT="apple-archive-research/0.1 (contact: your@email.com)"streamlit run app.pyOpens at http://localhost:8501
# Full pipeline — collect fresh data, label, analyse
python -m apple_archive.cli
# Full pipeline with fresh AAPL/SPY price download
python -m apple_archive.cli --refresh-market-data
# Collect more news articles
python -m apple_archive.cli --max-news-pages 50
# Individual stages only (no network calls)
python -m apple_archive.cli train # retrain ML model
python -m apple_archive.cli analyze # recompute return statistics
python -m apple_archive.cli label # re-label events with market returnsapple_long_term_ai/
├── apple_archive/
│ ├── pipeline.py # Orchestrates all 8 stages
│ ├── cli.py # CLI entry point (argparse)
│ ├── config.py # Paths, URLs, RunConfig dataclass
│ ├── models.py # ArchiveRecord dataclass
│ ├── http.py # HTTP client: Chrome headers, retry, jitter
│ ├── newsroom_rss.py # Apple Newsroom RSS + article enricher
│ ├── investor.py # Apple Investor Relations scraper
│ ├── sec_edgar.py # SEC EDGAR submissions metadata
│ ├── sec_8k.py # SEC 8-K press release full-text collector
│ ├── seeking_alpha.py # Seeking Alpha RSS collector
│ ├── yahoo_finance_news.py# Yahoo Finance RSS collector
│ ├── classifier.py # Rule-based classifier: 10 categories
│ ├── nlp.py # VADER sentiment enrichment
│ ├── storage.py # SQLite store with upsert deduplication
│ ├── market_data.py # AAPL + SPY price history agent
│ ├── label_dataset.py # Event → market return labeler
│ ├── analysis.py # Return stats aggregator
│ ├── ml_model.py # Random Forest + Logistic Regression trainer
│ └── utils.py # normalize_date, clean_text, write_jsonl
├── queries/
│ └── analytics.sql # SQL analytical queries
├── tests/
│ ├── test_utils.py
│ └── test_classifier.py
├── app.py # Streamlit GUI (6 pages)
├── requirements.txt
└── .env.example
All generated at runtime — not committed to git.
output/
├── apple_archive.sqlite # SQLite database (WAL mode)
├── normalized/
│ ├── all_records.jsonl
│ ├── newsroom.jsonl
│ ├── investor_relations.jsonl
│ ├── sec_edgar.jsonl
│ └── long_term_outlook.json # Weighted bullish/bearish outlook
├── market_data/
│ ├── aapl_prices.json # AAPL daily adjusted closes
│ └── spy_prices.json # SPY daily adjusted closes
├── dataset/
│ ├── labeled_events.jsonl # ML-ready dataset
│ ├── labeled_events.csv # Same data, CSV for inspection
│ ├── analysis.json # Return stats by category/source
│ └── skipped_no_price.jsonl # Events outside price history range
└── models/
├── random_forest.pkl # Best model (serialised pipeline)
└── results.json # CV scores, feature importances
pytest tests/ -vTests cover: date normalisation (5 formats including RFC 2822), text cleaning, event classification for each major category, SEC form type mapping, and long-term outlook aggregation.
- Temporal features: day-of-week, days-to-earnings-date, market volatility index (VIX) as additional features
- Embedding-based NLP: replace VADER with a fine-tuned FinBERT model for higher sentiment accuracy on financial text
- Data quality monitoring: per-run null rates, parse failure rates, and deduplication counts logged to
output/pipeline_quality.json
Research tool only. Not financial advice. Do not use predictions for trading decisions.




