Skip to content

SepehrKalantariSol/event-impact-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apple Market Intelligence

Python scikit-learn Streamlit SQLite License

An end-to-end event-driven analytics and ML pipeline that automatically collects, classifies, and analyses Apple Inc. events from 6 data sources, joins them to market returns, and predicts 30-day performance outcomes using a Random Forest classifier.

🖥️ Application Preview

📊 Overview

Overview


🔎 Data Explorer

Data Explorer


📈 Market Analysis

Market Analysis


🤖 ML Model

ML Model


⚙️ Run Pipeline

Run Pipeline

Architecture generalises directly to operations and manufacturing — replace news feeds with incident logs or shift reports, replace stock alpha with OEE or throughput deviation, and the pipeline runs identically.


Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│  Data Sources (6)                                               │
│  Apple Newsroom · Investor Relations · SEC EDGAR · SEC 8-K      │
│  Seeking Alpha RSS · Yahoo Finance RSS · yfinance (prices)      │
└────────────────────┬────────────────────────────────────────────┘
                     │ ~1,000+ records
                     ▼
┌────────────────────────────────────────────────────────────────┐
│  Stage 1 — Collect     HTTP client with retry + browser headers│
│  Stage 2 — Classify    Rule-based: 10 categories, importance   │
│  Stage 3 — NLP Enrich  VADER sentiment (title + full content)  │
│  Stage 4 — Store       SQLite, upsert deduplication by URL     │
└────────────────────┬───────────────────────────────────────────┘
                     │
                     ▼
┌────────────────────────────────────────────────────────────────┐
│  Stage 5 — Market Data  AAPL + SPY daily closes (yfinance)     │
│  Stage 6 — Label        return_1d/5d/30d · alpha_1d/5d/30d     │
│  Stage 7 — Analyse      Stats by category, source, importance  │
│  Stage 8 — ML Model     Random Forest · Logistic Regression    │
│                         Stratified 5-fold CV · time-aware split│
└────────────────────┬───────────────────────────────────────────┘
                     │
                     ▼
┌────────────────────────────────────────────────────────────────┐
│  Streamlit Dashboard (6 pages)                                 │
│  Overview · Data Explorer · Market Analysis · ML Model         │
│  Signal Dashboard · Run Pipeline                               │
└────────────────────────────────────────────────────────────────┘

Results

Model CV F1 (weighted) Std
Random Forest see output/models/results.json after running
Logistic Regression see output/models/results.json after running

Run python -m apple_archive.cli train to populate results. The model predicts whether AAPL will outperform the S&P 500 by more than 1% over 30 days following an event (binary classification on alpha_30d).


Dashboard Screenshots

Screenshots are generated after running the pipeline. Add yours to docs/screenshots/ and reference them here.

Overview Market Analysis
(run pipeline → screenshot) (run pipeline → screenshot)
Signal Dashboard ML Model
(run pipeline → screenshot) (run pipeline → screenshot)

What Problem This Solves

Organisations generate continuous streams of events — product launches, regulatory filings, operational incidents — but lack a systematic way to measure their downstream impact. This pipeline answers: which event types drive the largest performance changes, and can that be predicted before the outcome is known?

Applied here to Apple stock returns, but the same architecture applies to:

  • Manufacturing: link machine downtime events to OEE impact
  • Operations: connect supplier incidents to delivery performance deviation
  • Business analytics: track which initiative categories correlate with KPI movement

Pipeline Stages

# Stage What it does
1 Collect Scrapes 6 sources; HTTP client uses Chrome headers + exponential backoff to handle 403/429
2 Classify Assigns event_category (10 types), importance (1–5), long_term_view, confidence using keyword rules + SEC form type map
3 NLP VADER sentiment analysis on headlines and full article text; produces nlp_compound, nlp_pos, nlp_neg
4 Store SQLite with WAL mode; upsert-by-URL ensures no duplicates across pipeline runs
5 Market data Downloads AAPL and SPY daily adjusted closes via yfinance; cached to JSON for offline use
6 Label Joins each event to 1d, 5d, 30d return windows; calculates alpha = AAPL return − SPY return to isolate company-specific impact
7 Analyse Aggregates mean/median return, stdev, win-rate by event category, source, and importance level
8 ML Model Trains Random Forest + Logistic Regression; evaluated with stratified 5-fold CV and a time-aware holdout split (train pre-2023, test post-2023)

Feature Engineering

Feature Source Why it was included
event_category classifier.py Primary categorical signal — earnings events behave differently from product launches
source collector Source credibility varies; SEC filings are more reliable than RSS articles
record_type collector Filing vs. article vs. press release differ in information content
importance classifier.py Proxy for event magnitude; high-importance events (10-K, product launches) have larger reactions
confidence classifier.py Classification certainty; low-confidence records add noise
long_term_view_score classifier.py Encodes bullish/neutral/bearish as −1/0/+1 for linear models
nlp_title_compound VADER Headline sentiment captures immediate market framing
nlp_compound VADER Full-content sentiment; differs from headline when articles are nuanced
nlp_pos / nlp_neg VADER Separating positive and negative word ratios is more informative than compound alone
has_full_content HTTP enricher Records with full text are more reliably classified
content_length HTTP enricher Longer SEC filings tend to carry more substantive disclosures

Target: alpha_30d direction — whether AAPL outperformed or underperformed the S&P 500 by more than 1% in the 30 days following the event. Raw return was rejected as a target because it conflates Apple-specific signals with general market movement.


Key Technical Decisions

1. Market-adjusted alpha over raw return Raw AAPL return is misleading — a 3% gain during a 4% market rally is underperformance. Subtracting SPY return isolates company-specific signal and meaningfully improves model quality.

2. Time-aware validation alongside random CV Standard k-fold CV leaks future data into training on time-ordered datasets, inflating metrics. The pipeline runs both: stratified CV for sample efficiency and a temporal split (pre/post date boundary) for realistic out-of-sample evaluation.

3. Rule-based classification, not ML, for event categories Event categories are defined by domain knowledge (e.g., SEC form type directly identifies the event). Rules are interpretable, debuggable, and do not require labeled training data. ML is reserved for the prediction task where rules cannot capture non-linear feature interactions.

4. URL-based deduplication in SQLite Running the pipeline multiple times would duplicate records without upsert logic. SQLite's INSERT OR REPLACE keyed on URL ensures idempotent runs across incremental updates.


Generalisation to Other Domains

The pipeline structure is domain-agnostic. The mapping to manufacturing / operations:

This project Manufacturing equivalent
Apple press releases, RSS feeds Machine alarm logs, shift handover reports
SEC regulatory filings Quality audit reports, compliance submissions
event_category (earnings, product launch) Incident category (mechanical failure, supplier delay)
importance score Severity level
alpha_30d (AAPL vs SPY) OEE deviation vs. baseline / peer line
ML prediction: UP / DOWN Predict: high-impact vs. low-impact incident
Signal Dashboard Maintenance decision support dashboard

SQL Analytics

Analytical queries are in queries/analytics.sql. Examples:

-- Event frequency and sentiment trend by month
SELECT strftime('%Y-%m', published_at) AS month,
       event_category,
       COUNT(*) AS event_count,
       ROUND(AVG(CAST(json_extract(metadata_json, '$.nlp_compound') AS REAL)), 3) AS avg_sentiment
FROM records
WHERE published_at IS NOT NULL
GROUP BY 1, 2
ORDER BY 1, event_count DESC;

Run directly against the SQLite database:

sqlite3 output/apple_archive.sqlite < queries/analytics.sql

Setup

git clone https://github.com/SepehrKalantariSol/event-impact-analytics
cd event-impact-analytics
python3 -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install -r requirements.txt

Optional — set a descriptive User-Agent for SEC requests (required by SEC guidelines):

export APPLE_ARCHIVE_USER_AGENT="apple-archive-research/0.1 (contact: your@email.com)"

Run

Streamlit Dashboard (recommended)

streamlit run app.py

Opens at http://localhost:8501

CLI

# Full pipeline — collect fresh data, label, analyse
python -m apple_archive.cli

# Full pipeline with fresh AAPL/SPY price download
python -m apple_archive.cli --refresh-market-data

# Collect more news articles
python -m apple_archive.cli --max-news-pages 50

# Individual stages only (no network calls)
python -m apple_archive.cli train    # retrain ML model
python -m apple_archive.cli analyze  # recompute return statistics
python -m apple_archive.cli label    # re-label events with market returns

Project Structure

apple_long_term_ai/
├── apple_archive/
│   ├── pipeline.py          # Orchestrates all 8 stages
│   ├── cli.py               # CLI entry point (argparse)
│   ├── config.py            # Paths, URLs, RunConfig dataclass
│   ├── models.py            # ArchiveRecord dataclass
│   ├── http.py              # HTTP client: Chrome headers, retry, jitter
│   ├── newsroom_rss.py      # Apple Newsroom RSS + article enricher
│   ├── investor.py          # Apple Investor Relations scraper
│   ├── sec_edgar.py         # SEC EDGAR submissions metadata
│   ├── sec_8k.py            # SEC 8-K press release full-text collector
│   ├── seeking_alpha.py     # Seeking Alpha RSS collector
│   ├── yahoo_finance_news.py# Yahoo Finance RSS collector
│   ├── classifier.py        # Rule-based classifier: 10 categories
│   ├── nlp.py               # VADER sentiment enrichment
│   ├── storage.py           # SQLite store with upsert deduplication
│   ├── market_data.py       # AAPL + SPY price history agent
│   ├── label_dataset.py     # Event → market return labeler
│   ├── analysis.py          # Return stats aggregator
│   ├── ml_model.py          # Random Forest + Logistic Regression trainer
│   └── utils.py             # normalize_date, clean_text, write_jsonl
├── queries/
│   └── analytics.sql        # SQL analytical queries
├── tests/
│   ├── test_utils.py
│   └── test_classifier.py
├── app.py                   # Streamlit GUI (6 pages)
├── requirements.txt
└── .env.example

Output Files

All generated at runtime — not committed to git.

output/
├── apple_archive.sqlite          # SQLite database (WAL mode)
├── normalized/
│   ├── all_records.jsonl
│   ├── newsroom.jsonl
│   ├── investor_relations.jsonl
│   ├── sec_edgar.jsonl
│   └── long_term_outlook.json    # Weighted bullish/bearish outlook
├── market_data/
│   ├── aapl_prices.json          # AAPL daily adjusted closes
│   └── spy_prices.json           # SPY daily adjusted closes
├── dataset/
│   ├── labeled_events.jsonl      # ML-ready dataset
│   ├── labeled_events.csv        # Same data, CSV for inspection
│   ├── analysis.json             # Return stats by category/source
│   └── skipped_no_price.jsonl    # Events outside price history range
└── models/
    ├── random_forest.pkl         # Best model (serialised pipeline)
    └── results.json              # CV scores, feature importances

Tests

pytest tests/ -v

Tests cover: date normalisation (5 formats including RFC 2822), text cleaning, event classification for each major category, SEC form type mapping, and long-term outlook aggregation.


Future Work

  • Temporal features: day-of-week, days-to-earnings-date, market volatility index (VIX) as additional features
  • Embedding-based NLP: replace VADER with a fine-tuned FinBERT model for higher sentiment accuracy on financial text
  • Data quality monitoring: per-run null rates, parse failure rates, and deduplication counts logged to output/pipeline_quality.json

Disclaimer

Research tool only. Not financial advice. Do not use predictions for trading decisions.

About

End-to-end event analytics pipeline: collects Apple data from 6 sources, classifies events, labels with market returns, and predicts 30-day performance using Random Forest. Includes a Streamlit dashboard.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages