A comprehensive data warehouse solution for residential energy consumption analysis, built with modern data engineering tools and best practices.
This project implements a complete end-to-end data pipeline for analyzing household energy consumption patterns, weather impacts, and appliance efficiency. The solution provides actionable insights for energy management, customer segmentation, and grid optimization.
- ✅ Multi-layer Data Architecture: Staging → Silver → Golden layer transformations
- ✅ Data Quality Framework: Comprehensive validation and quality checks
- ✅ Business Analytics: Pre-computed metrics for dashboards and reports
- ✅ Scalable Infrastructure: Docker-based deployment with HDFS and Hive
- ✅ Automated Exports: CSV snapshots of all data layers for external analysis
Raw Data (CSV) → HDFS (Hive) → Staging Layer → Silver Layer → Dimensions/Facts → Golden Layer → CSV Exports
| Component | Technology | Version |
|---|---|---|
| Storage | HDFS (Hadoop Distributed File System) | 3.2.1 |
| SQL Engine | Apache Spark with Thrift Server | 3.2.0 |
| Transformation | dbt (data build tool) with Spark adapter | 1.7.0 |
| Containerization | Docker & Docker Compose | Latest |
Depi-Graduation-Project/
├── Data/
│ ├── raw/ # Source CSV files
│ └── exports/ # Generated CSV snapshots (Bronze/Silver/Gold)
├── pipeline/ # Infrastructure & Pipeline Code
│ ├── docker-compose.yml # Stack definition (Hadoop, Spark, dbt)
│ ├── dbt_project/ # dbt transformation logic
│ └── scripts/ # Helper scripts (load, export, init)
├── airflow/ # Airflow DAGs (Optional Orchestration)
├── docs/ # Documentation & Submission Files
└── README.md
- Docker and Docker Compose (v2.0+)
- 4GB+ RAM available
-
Clone the repository
git clone https://github.com/shady-2004/Depi-Graduation-Project cd Depi-Graduation-Project -
Start the infrastructure
cd pipeline docker-compose up -dWait for services to be healthy:
namenode: http://localhost:9870spark-thrift: http://localhost:4040
-
Load Raw Data Load the source CSVs from
data/rawinto HDFS/Hive:# Run from the pipeline directory ./scripts/load_raw_data.sh -
Run Transformations (dbt) Execute the full transformation pipeline:
docker exec dbt dbt run --profiles-dir .
-
Export Results Export the processed tables back to CSVs in
data/exports:./scripts/export_layers.sh
The project includes an optional Airflow setup for scheduling and monitoring the pipeline.
-
Start Airflow
# From the project root docker-compose -f airflow/docker-compose.airflow.yml up -d -
Access Airflow UI
- URL: http://localhost:8080
- User:
airflow - Password:
airflow
-
Trigger Pipeline
- Enable the
energy_dwh_etl_pipelineDAG to start the daily schedule.
- Enable the
- Staging Layer (Bronze): Raw data ingestion with minimal transformation.
- Silver Layer: Cleaned, validated, and enriched data with quality checks.
- Golden Layer: Pre-aggregated business metrics ready for dashboards.
Run the test suite to ensure data quality:
docker exec dbt dbt test --profiles-dir .- Analytics Insights: Business use cases and insights.
- Data Pipeline Documentation: Technical details.
- Submission Guide: Checklist for project submission.
Data Engineering Team - DEPI Graduation Project 2024