A complete data engineering project that builds an ETL pipeline for NYC Uber trip data, transforming raw CSV records into a star-schema dimensional model in BigQuery and exposing analytics through Looker Studio.
Raw CSV (GCS) → Mage Pipeline → BigQuery (Star Schema) → Looker Studio Dashboard
- Ingest — Raw TLC trip data uploaded to Google Cloud Storage
- Transform — Python/Mage pipeline builds dimension and fact tables
- Load — Dimensional model loaded into BigQuery
- Analyze — SQL analytics query joins all dimensions for a flat analytics table
- Visualize — Looker Studio dashboard on top of BigQuery
| Layer | Tool |
|---|---|
| Cloud Storage | Google Cloud Storage |
| Orchestration | Mage.ai |
| Compute | GCP Compute Instance |
| Data Warehouse | BigQuery |
| Visualization | Looker Studio |
| Transformation | Python, pandas |
The raw trip data is transformed into a star schema with the following dimensions:
| Table | Description |
|---|---|
fact_table |
Core trip records with all foreign keys and fare amounts |
datetime_dim |
Pickup/dropoff times with hour, day, month, weekday breakdowns |
passenger_count_dim |
Passenger count per trip |
trip_distance_dim |
Trip distance per trip |
rate_code_dim |
Rate code (Standard, JFK, Newark, Negotiated, etc.) |
pickup_location_dim |
Pickup latitude/longitude |
dropoff_location_dim |
Dropoff latitude/longitude |
payment_type_dim |
Payment method (Credit card, Cash, No charge, etc.) |
The final analytics table is built by joining all dimensions to the fact table in BigQuery:
CREATE OR REPLACE TABLE `uber_data_engineering.tbl_analytics` AS (
SELECT
f.trip_id, f.VendorID,
d.tpep_pickup_datetime, d.tpep_dropoff_datetime,
p.passenger_count, t.trip_distance,
r.rate_code_name,
pick.pickup_latitude, pick.pickup_longitude,
drop.dropoff_latitude, drop.dropoff_longitude,
pay.payment_type_name,
f.fare_amount, f.tip_amount, f.tolls_amount, f.total_amount
FROM fact_table f
JOIN datetime_dim d ON f.datetime_id = d.datetime_id
JOIN passenger_count_dim p ON f.passenger_count_id = p.passenger_count_id
JOIN trip_distance_dim t ON f.trip_distance_id = t.trip_distance_id
JOIN rate_code_dim r ON f.rate_code_id = r.rate_code_id
JOIN pickup_location_dim pick ON f.pickup_location_id = pick.pickup_location_id
JOIN dropoff_location_dim drop ON f.dropoff_location_id = drop.dropoff_location_id
JOIN payment_type_dim pay ON f.payment_type_id = pay.payment_type_id
);NYC TLC Yellow Taxi Trip Records — pickup/dropoff times and locations, trip distances, fares, rate types, payment types, and passenger counts.
Gkeri Pepelasi
