Important Note: This project is still WORK IN PROGRESS and will be updated and enhanced most of the time. For more details, see ToDos below. Since this project should also be used as a portfolio project for my career as Analytics Engineer, the focus lies on the end-to-end data pipeline. So for transparency, there are some parts which are vibecoded, especially all HTML and CSS parts.
Intro: An end-to-end data pipeline project that extracts, transforms, and analyzes fitness data from Strava and Whoop APIs. This project demonstrates a complete ELT (Extract, Load, Transform) workflow using modern data engineering tools, with interactive analytics dashboards and a simple MVP for natural language query capabilities.
This repository serves as a learning and experimentation platform for data engineering, combining:
- API Integration: Automated data extraction from Strava and Whoop
- Data Transformation: Multi-stage dbt pipeline with staging, intermediate, and metrics layers
- Analytics Applications: Interactive dashboards and natural language query interfaces
- Tech Stack: DuckDB, dbt, FastAPI, and more
- Improve LLM with Whoop Data: Add Whoop data like heart rate etc.
- Alerting and Testing: Set up a much better Testing and Alerting Setup via Slack.
- Orchestration: Set up all python scripts and dbt runs by Airflow
- Docker: Dockerize the whole project
The project follows a three-stage ELT pipeline:
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Extract β --> β Load β --> β Transform β --> β Analytics β
β (APIs) β β (DuckDB) β β (dbt) β β (FastAPI) β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
Note: The pipeline uses incremental data ingestion - only new data is extracted and loaded, avoiding duplicate processing of existing records.
- Extract (
1_elt/0_extract/): Python scripts fetch data from Strava and Whoop APIs - Load (
1_elt/1_load/): Raw JSON data is loaded into DuckDB source database (incremental - only new data) - Transform (
1_elt/2_transform/): dbt models transform data through staging β intermediate β metrics layers - Analytics (
2_analytics/): FastAPI applications serve interactive dashboards and queries
The project uses three separate DuckDB databases for separation of concerns:
source.duckdb: Raw source data and staging viewstransform.duckdb: Intermediate transformation tablesanalytics.duckdb: Final metrics and fact tables for analytics
- Activity data (runs, rides, swims, etc.)
- Social interactions (kudos, comments)
- Performance metrics (distance, time, elevation, etc.)
- Sleep data (duration, quality, efficiency)
- Recovery metrics
- Workout data
- Database: DuckDB (analytical database)
- Transformation: dbt (data build tool)
- APIs: FastAPI, uvicorn
- Language: Python 3.11+
- Package Management: uv
- Code Quality: black, sqlfluff, pre-commit
- LLM Integration: Ollama (for local natural language queries)
sport_analytics/
βββ 0_data/ # Data storage
β βββ database/ # DuckDB databases
β βββ raw/ # Raw JSON files
β
βββ 1_elt/ # Extract, Load, Transform
β βββ 0_extract/ # API extraction scripts
β β βββ strava/
β β βββ whoop/
β βββ 1_load/ # Data loading scripts
β β βββ strava/
β β βββ whoop/
β βββ 2_transform/ # dbt transformation project
β βββ models/
β β βββ staging/ # Staging models (views)
β β βββ intermediate/ # Intermediate models (tables)
β β βββ metrics/ # Final metrics (facts & dimensions)
β β βββ dates/ # Date dimension
β β βββ semantic/ # Unified / semantic layer
β β βββ strava/ # Strava facts & dimensions
β β βββ whoop/ # Whoop facts & dimensions
β βββ seeds/ # CSV seed files
β βββ macros/ # Custom dbt macros
β
βββ 2_analytics/ # Analytics applications
β βββ Chat-to-Data/ # Natural language query interface
β βββ sleep_analytics/ # Sleep data dashboard
β
βββ config.yml # Configuration (create from config_sample.yml)
βββ config_sample.yml # Sample configuration template
βββ pyproject.toml # Python dependencies
- Python 3.11 or higher
- uv package manager
- dbt (installed via uv)
- API credentials for Strava and/or Whoop
-
Clone the repository
git clone <repository-url> cd sport_analytics
-
Install dependencies
uv sync
-
Set up configuration
cp config_sample.yml config.yml # Edit config.yml with your API credentials -
Set up dbt profiles (if needed)
cd 1_elt/2_transform # Edit profiles.yml with your database paths
Create config.yml from config_sample.yml and add your API credentials:
strava:
client_id: 'your_client_id'
client_secret: 'your_client_secret'
refresh_token: 'your_refresh_token'
whoop:
client_id: 'your_client_id'
client_secret: 'your_client_secret'
access_token: 'your_access_token'
refresh_token: 'your_refresh_token'
redirect_url: 'http://localhost:8000/callback'
database:
path: 0_data/database/source.duckdbExtract data from APIs:
# Extract Strava data
python 1_elt/0_extract/strava/extract_strava_data.py
# Extract Whoop data
python 1_elt/0_extract/whoop/extract_whoop_data.pyLoad raw JSON into DuckDB:
# Load Strava data
python 1_elt/1_load/strava/load_strava_data.py
# Load Whoop data
python 1_elt/1_load/whoop/load_whoop_data.pyRun dbt transformations:
cd 1_elt/2_transform
# Load seeds
dbt seed --target source
# Please note that using duckdb currently makes execution and targeting rather cumbersome.
# Run staging models
dbt run --select 'staging.*' --target source
# Run intermediate models
dbt run --select 'intermediate.*' --target transform
# Run metrics models
dbt run --select 'metrics.*' --target analyticsFor detailed dbt instructions, see 1_elt/2_transform/HOW_TO_RUN.md.
Query your Strava activities using natural language:
# Start the API (from project root)
uv run uvicorn "2_analytics.Chat-to-Data.api:app" --reload
# In another terminal, start the HTML server
cd 2_analytics/Chat-to-Data
python3 -m http.server 5500
# Open http://127.0.0.1:5500 in your browserPrerequisites for Chat-to-Data:
- Install Ollama: https://ollama.ai/download
- Pull a model:
ollama pull llama3.2 - Optional: Set
OLLAMA_URLandOLLAMA_MODELenvironment variables
For more details, see 2_analytics/Chat-to-Data/how_to_run_chat.txt.
Visualize your Whoop sleep data:
# Start the API (from project root)
uv run uvicorn "2_analytics.sleep_analytics.api:app" --reload
# In another terminal, start the HTML server
cd 2_analytics/sleep_analytics
python3 -m http.server 5500
# Open http://127.0.0.1:5500 in your browserFor more details, see 2_analytics/sleep_analytics/how_to_run.txt.
Staging Layer (source.duckdb):
strava.stg_strava_activitieswhoop.stg_whoop_sleepswhoop.stg_whoop_workouts
Intermediate Layer (transform.duckdb):
strava.int_strava_activitieswhoop.int_whoop_sleepwhoop.int_whoop_workouts
Metrics Layer (analytics.duckdb):
- Strava:
fct_strava_activities,fct_strava_activities_socials,dim_strava_activity_type,dim_strava_gear - Whoop:
fct_whoop_sleeps,fct_whoop_sleep_quality,fct_whoop_workouts,dim_whoop_workouts - Semantic:
fct_activities(unified activities),data_check(data availability per date) - Dates:
dim_dates(date dimension)
Run dbt tests:
cd 1_elt/2_transform
# Test all models
dbt test
# Test specific layer
dbt test --select 'staging.*' --target source
dbt test --select 'intermediate.*' --target transform
dbt test --select 'metrics.*' --target analyticsThe project uses pre-commit hooks for code quality:
- black: Python code formatting
- sqlfluff: SQL linting and formatting
Hooks run automatically on commit. To run manually:
pre-commit run --all-files- Create extraction script in
1_elt/0_extract/<source>/ - Create loading script in
1_elt/1_load/<source>/ - Add staging models in
1_elt/2_transform/models/staging/<source>/ - Build intermediate and metrics models as needed
- Create new directory in
2_analytics/ - Add FastAPI application (
api.py) - Add HTML frontend (
index.html) - Document setup in
how_to_run.txt
This is a personal learning project, but suggestions and improvements are welcome! Feel free to:
- Open issues for bugs or feature requests
- Submit pull requests with improvements
- Share ideas for new features or data sources
This project is for personal learning and experimentation purposes.
Note: This project is primarily for learning and experimentation with modern data engineering tools and practices.
