Apple Market Intelligence

An end-to-end event-driven analytics and ML pipeline that automatically collects, classifies, and analyses Apple Inc. events from 6 data sources, joins them to market returns, and predicts 30-day performance outcomes using a Random Forest classifier.

🖥️ Application Preview

📊 Overview

🔎 Data Explorer

📈 Market Analysis

🤖 ML Model

⚙️ Run Pipeline

Architecture generalises directly to operations and manufacturing — replace news feeds with incident logs or shift reports, replace stock alpha with OEE or throughput deviation, and the pipeline runs identically.

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│  Data Sources (6)                                               │
│  Apple Newsroom · Investor Relations · SEC EDGAR · SEC 8-K      │
│  Seeking Alpha RSS · Yahoo Finance RSS · yfinance (prices)      │
└────────────────────┬────────────────────────────────────────────┘
                     │ ~1,000+ records
                     ▼
┌────────────────────────────────────────────────────────────────┐
│  Stage 1 — Collect     HTTP client with retry + browser headers│
│  Stage 2 — Classify    Rule-based: 10 categories, importance   │
│  Stage 3 — NLP Enrich  VADER sentiment (title + full content)  │
│  Stage 4 — Store       SQLite, upsert deduplication by URL     │
└────────────────────┬───────────────────────────────────────────┘
                     │
                     ▼
┌────────────────────────────────────────────────────────────────┐
│  Stage 5 — Market Data  AAPL + SPY daily closes (yfinance)     │
│  Stage 6 — Label        return_1d/5d/30d · alpha_1d/5d/30d     │
│  Stage 7 — Analyse      Stats by category, source, importance  │
│  Stage 8 — ML Model     Random Forest · Logistic Regression    │
│                         Stratified 5-fold CV · time-aware split│
└────────────────────┬───────────────────────────────────────────┘
                     │
                     ▼
┌────────────────────────────────────────────────────────────────┐
│  Streamlit Dashboard (6 pages)                                 │
│  Overview · Data Explorer · Market Analysis · ML Model         │
│  Signal Dashboard · Run Pipeline                               │
└────────────────────────────────────────────────────────────────┘

Results

Model	CV F1 (weighted)	Std
Random Forest	see `output/models/results.json` after running	—
Logistic Regression	see `output/models/results.json` after running	—

Run python -m apple_archive.cli train to populate results. The model predicts whether AAPL will outperform the S&P 500 by more than 1% over 30 days following an event (binary classification on alpha_30d).

Dashboard Screenshots

Screenshots are generated after running the pipeline. Add yours to docs/screenshots/ and reference them here.

Overview	Market Analysis
(run pipeline → screenshot)	(run pipeline → screenshot)

Signal Dashboard	ML Model
(run pipeline → screenshot)	(run pipeline → screenshot)

What Problem This Solves

Organisations generate continuous streams of events — product launches, regulatory filings, operational incidents — but lack a systematic way to measure their downstream impact. This pipeline answers: which event types drive the largest performance changes, and can that be predicted before the outcome is known?

Applied here to Apple stock returns, but the same architecture applies to:

Manufacturing: link machine downtime events to OEE impact
Operations: connect supplier incidents to delivery performance deviation
Business analytics: track which initiative categories correlate with KPI movement

Pipeline Stages

#	Stage	What it does
1	Collect	Scrapes 6 sources; HTTP client uses Chrome headers + exponential backoff to handle 403/429
2	Classify	Assigns `event_category` (10 types), `importance` (1–5), `long_term_view`, `confidence` using keyword rules + SEC form type map
3	NLP	VADER sentiment analysis on headlines and full article text; produces `nlp_compound`, `nlp_pos`, `nlp_neg`
4	Store	SQLite with WAL mode; upsert-by-URL ensures no duplicates across pipeline runs
5	Market data	Downloads AAPL and SPY daily adjusted closes via yfinance; cached to JSON for offline use
6	Label	Joins each event to 1d, 5d, 30d return windows; calculates alpha = AAPL return − SPY return to isolate company-specific impact
7	Analyse	Aggregates mean/median return, stdev, win-rate by event category, source, and importance level
8	ML Model	Trains Random Forest + Logistic Regression; evaluated with stratified 5-fold CV and a time-aware holdout split (train pre-2023, test post-2023)

Feature Engineering

Feature	Source	Why it was included
`event_category`	classifier.py	Primary categorical signal — earnings events behave differently from product launches
`source`	collector	Source credibility varies; SEC filings are more reliable than RSS articles
`record_type`	collector	Filing vs. article vs. press release differ in information content
`importance`	classifier.py	Proxy for event magnitude; high-importance events (10-K, product launches) have larger reactions
`confidence`	classifier.py	Classification certainty; low-confidence records add noise
`long_term_view_score`	classifier.py	Encodes bullish/neutral/bearish as −1/0/+1 for linear models
`nlp_title_compound`	VADER	Headline sentiment captures immediate market framing
`nlp_compound`	VADER	Full-content sentiment; differs from headline when articles are nuanced
`nlp_pos` / `nlp_neg`	VADER	Separating positive and negative word ratios is more informative than compound alone
`has_full_content`	HTTP enricher	Records with full text are more reliably classified
`content_length`	HTTP enricher	Longer SEC filings tend to carry more substantive disclosures

Target: alpha_30d direction — whether AAPL outperformed or underperformed the S&P 500 by more than 1% in the 30 days following the event. Raw return was rejected as a target because it conflates Apple-specific signals with general market movement.

Key Technical Decisions

1. Market-adjusted alpha over raw return Raw AAPL return is misleading — a 3% gain during a 4% market rally is underperformance. Subtracting SPY return isolates company-specific signal and meaningfully improves model quality.

2. Time-aware validation alongside random CV Standard k-fold CV leaks future data into training on time-ordered datasets, inflating metrics. The pipeline runs both: stratified CV for sample efficiency and a temporal split (pre/post date boundary) for realistic out-of-sample evaluation.

3. Rule-based classification, not ML, for event categories Event categories are defined by domain knowledge (e.g., SEC form type directly identifies the event). Rules are interpretable, debuggable, and do not require labeled training data. ML is reserved for the prediction task where rules cannot capture non-linear feature interactions.

4. URL-based deduplication in SQLite Running the pipeline multiple times would duplicate records without upsert logic. SQLite's INSERT OR REPLACE keyed on URL ensures idempotent runs across incremental updates.

Generalisation to Other Domains

The pipeline structure is domain-agnostic. The mapping to manufacturing / operations:

This project	Manufacturing equivalent
Apple press releases, RSS feeds	Machine alarm logs, shift handover reports
SEC regulatory filings	Quality audit reports, compliance submissions
`event_category` (earnings, product launch)	Incident category (mechanical failure, supplier delay)
`importance` score	Severity level
`alpha_30d` (AAPL vs SPY)	OEE deviation vs. baseline / peer line
ML prediction: UP / DOWN	Predict: high-impact vs. low-impact incident
Signal Dashboard	Maintenance decision support dashboard

SQL Analytics

Analytical queries are in queries/analytics.sql. Examples:

-- Event frequency and sentiment trend by month
SELECT strftime('%Y-%m', published_at) AS month,
       event_category,
       COUNT(*) AS event_count,
       ROUND(AVG(CAST(json_extract(metadata_json, '$.nlp_compound') AS REAL)), 3) AS avg_sentiment
FROM records
WHERE published_at IS NOT NULL
GROUP BY 1, 2
ORDER BY 1, event_count DESC;

Run directly against the SQLite database:

sqlite3 output/apple_archive.sqlite < queries/analytics.sql

Setup

git clone https://github.com/SepehrKalantariSol/event-impact-analytics
cd event-impact-analytics
python3 -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install -r requirements.txt

Optional — set a descriptive User-Agent for SEC requests (required by SEC guidelines):

export APPLE_ARCHIVE_USER_AGENT="apple-archive-research/0.1 (contact: your@email.com)"

Run

Streamlit Dashboard (recommended)

streamlit run app.py

Opens at http://localhost:8501

CLI

# Full pipeline — collect fresh data, label, analyse
python -m apple_archive.cli

# Full pipeline with fresh AAPL/SPY price download
python -m apple_archive.cli --refresh-market-data

# Collect more news articles
python -m apple_archive.cli --max-news-pages 50

# Individual stages only (no network calls)
python -m apple_archive.cli train    # retrain ML model
python -m apple_archive.cli analyze  # recompute return statistics
python -m apple_archive.cli label    # re-label events with market returns

Project Structure

apple_long_term_ai/
├── apple_archive/
│   ├── pipeline.py          # Orchestrates all 8 stages
│   ├── cli.py               # CLI entry point (argparse)
│   ├── config.py            # Paths, URLs, RunConfig dataclass
│   ├── models.py            # ArchiveRecord dataclass
│   ├── http.py              # HTTP client: Chrome headers, retry, jitter
│   ├── newsroom_rss.py      # Apple Newsroom RSS + article enricher
│   ├── investor.py          # Apple Investor Relations scraper
│   ├── sec_edgar.py         # SEC EDGAR submissions metadata
│   ├── sec_8k.py            # SEC 8-K press release full-text collector
│   ├── seeking_alpha.py     # Seeking Alpha RSS collector
│   ├── yahoo_finance_news.py# Yahoo Finance RSS collector
│   ├── classifier.py        # Rule-based classifier: 10 categories
│   ├── nlp.py               # VADER sentiment enrichment
│   ├── storage.py           # SQLite store with upsert deduplication
│   ├── market_data.py       # AAPL + SPY price history agent
│   ├── label_dataset.py     # Event → market return labeler
│   ├── analysis.py          # Return stats aggregator
│   ├── ml_model.py          # Random Forest + Logistic Regression trainer
│   └── utils.py             # normalize_date, clean_text, write_jsonl
├── queries/
│   └── analytics.sql        # SQL analytical queries
├── tests/
│   ├── test_utils.py
│   └── test_classifier.py
├── app.py                   # Streamlit GUI (6 pages)
├── requirements.txt
└── .env.example

Output Files

All generated at runtime — not committed to git.

output/
├── apple_archive.sqlite          # SQLite database (WAL mode)
├── normalized/
│   ├── all_records.jsonl
│   ├── newsroom.jsonl
│   ├── investor_relations.jsonl
│   ├── sec_edgar.jsonl
│   └── long_term_outlook.json    # Weighted bullish/bearish outlook
├── market_data/
│   ├── aapl_prices.json          # AAPL daily adjusted closes
│   └── spy_prices.json           # SPY daily adjusted closes
├── dataset/
│   ├── labeled_events.jsonl      # ML-ready dataset
│   ├── labeled_events.csv        # Same data, CSV for inspection
│   ├── analysis.json             # Return stats by category/source
│   └── skipped_no_price.jsonl    # Events outside price history range
└── models/
    ├── random_forest.pkl         # Best model (serialised pipeline)
    └── results.json              # CV scores, feature importances

Tests

pytest tests/ -v

Tests cover: date normalisation (5 formats including RFC 2822), text cleaning, event classification for each major category, SEC form type mapping, and long-term outlook aggregation.

Future Work

Temporal features: day-of-week, days-to-earnings-date, market volatility index (VIX) as additional features
Embedding-based NLP: replace VADER with a fine-tuned FinBERT model for higher sentiment accuracy on financial text
Data quality monitoring: per-run null rates, parse failure rates, and deduplication counts logged to output/pipeline_quality.json

Disclaimer

Research tool only. Not financial advice. Do not use predictions for trading decisions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apple Market Intelligence

🖥️ Application Preview

📊 Overview

🔎 Data Explorer

📈 Market Analysis

🤖 ML Model

⚙️ Run Pipeline

Pipeline Architecture

Results

Dashboard Screenshots

What Problem This Solves

Pipeline Stages

Feature Engineering

Key Technical Decisions

Generalisation to Other Domains

SQL Analytics

Setup

Run

Streamlit Dashboard (recommended)

CLI

Project Structure

Output Files

Tests

Future Work

Disclaimer

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
apple_archive		apple_archive
docs		docs
output		output
queries		queries
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Apple Market Intelligence

🖥️ Application Preview

📊 Overview

🔎 Data Explorer

📈 Market Analysis

🤖 ML Model

⚙️ Run Pipeline

Pipeline Architecture

Results

Dashboard Screenshots

What Problem This Solves

Pipeline Stages

Feature Engineering

Key Technical Decisions

Generalisation to Other Domains

SQL Analytics

Setup

Run

Streamlit Dashboard (recommended)

CLI

Project Structure

Output Files

Tests

Future Work

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages