Skip to content

dracarrys/uber-ETL-pipeline-data-engineering-project

Repository files navigation

Uber Data Analytics — End-to-End GCP Data Engineering Pipeline

A complete data engineering project that builds an ETL pipeline for NYC Uber trip data, transforming raw CSV records into a star-schema dimensional model in BigQuery and exposing analytics through Looker Studio.

Architecture

Raw CSV (GCS) → Mage Pipeline → BigQuery (Star Schema) → Looker Studio Dashboard
  1. Ingest — Raw TLC trip data uploaded to Google Cloud Storage
  2. Transform — Python/Mage pipeline builds dimension and fact tables
  3. Load — Dimensional model loaded into BigQuery
  4. Analyze — SQL analytics query joins all dimensions for a flat analytics table
  5. Visualize — Looker Studio dashboard on top of BigQuery

Tech Stack

Layer Tool
Cloud Storage Google Cloud Storage
Orchestration Mage.ai
Compute GCP Compute Instance
Data Warehouse BigQuery
Visualization Looker Studio
Transformation Python, pandas

Data Model

The raw trip data is transformed into a star schema with the following dimensions:

Table Description
fact_table Core trip records with all foreign keys and fare amounts
datetime_dim Pickup/dropoff times with hour, day, month, weekday breakdowns
passenger_count_dim Passenger count per trip
trip_distance_dim Trip distance per trip
rate_code_dim Rate code (Standard, JFK, Newark, Negotiated, etc.)
pickup_location_dim Pickup latitude/longitude
dropoff_location_dim Dropoff latitude/longitude
payment_type_dim Payment method (Credit card, Cash, No charge, etc.)

Data Model

Analytics Query

The final analytics table is built by joining all dimensions to the fact table in BigQuery:

CREATE OR REPLACE TABLE `uber_data_engineering.tbl_analytics` AS (
SELECT
  f.trip_id, f.VendorID,
  d.tpep_pickup_datetime, d.tpep_dropoff_datetime,
  p.passenger_count, t.trip_distance,
  r.rate_code_name,
  pick.pickup_latitude, pick.pickup_longitude,
  drop.dropoff_latitude, drop.dropoff_longitude,
  pay.payment_type_name,
  f.fare_amount, f.tip_amount, f.tolls_amount, f.total_amount
FROM fact_table f
JOIN datetime_dim d        ON f.datetime_id = d.datetime_id
JOIN passenger_count_dim p ON f.passenger_count_id = p.passenger_count_id
JOIN trip_distance_dim t   ON f.trip_distance_id = t.trip_distance_id
JOIN rate_code_dim r       ON f.rate_code_id = r.rate_code_id
JOIN pickup_location_dim pick   ON f.pickup_location_id = pick.pickup_location_id
JOIN dropoff_location_dim drop  ON f.dropoff_location_id = drop.dropoff_location_id
JOIN payment_type_dim pay       ON f.payment_type_id = pay.payment_type_id
);

Dataset

NYC TLC Yellow Taxi Trip Records — pickup/dropoff times and locations, trip distances, fares, rate types, payment types, and passenger counts.

Author

Gkeri Pepelasi

About

End-to-end GCP data engineering pipeline — NYC Uber trip data transformed into a BigQuery star schema using Mage, with Looker Studio analytics

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors