Energy Data Warehouse - Complete Data Pipeline

A comprehensive data warehouse solution for residential energy consumption analysis, built with modern data engineering tools and best practices.

🎯 Project Overview

This project implements a complete end-to-end data pipeline for analyzing household energy consumption patterns, weather impacts, and appliance efficiency. The solution provides actionable insights for energy management, customer segmentation, and grid optimization.

Key Features

✅ Multi-layer Data Architecture: Staging → Silver → Golden layer transformations
✅ Data Quality Framework: Comprehensive validation and quality checks
✅ Business Analytics: Pre-computed metrics for dashboards and reports
✅ Scalable Infrastructure: Docker-based deployment with HDFS and Hive
✅ Automated Exports: CSV snapshots of all data layers for external analysis

📊 Architecture

Raw Data (CSV) → HDFS (Hive) → Staging Layer → Silver Layer → Dimensions/Facts → Golden Layer → CSV Exports

Technology Stack

Component	Technology	Version
Storage	HDFS (Hadoop Distributed File System)	3.2.1
SQL Engine	Apache Spark with Thrift Server	3.2.0
Transformation	dbt (data build tool) with Spark adapter	1.7.0
Containerization	Docker & Docker Compose	Latest

📁 Project Structure

Depi-Graduation-Project/
├── Data/
│   ├── raw/                # Source CSV files
│   └── exports/            # Generated CSV snapshots (Bronze/Silver/Gold)
├── pipeline/               # Infrastructure & Pipeline Code
│   ├── docker-compose.yml  # Stack definition (Hadoop, Spark, dbt)
│   ├── dbt_project/        # dbt transformation logic
│   └── scripts/            # Helper scripts (load, export, init)
├── airflow/                # Airflow DAGs (Optional Orchestration)
├── docs/                   # Documentation & Submission Files
└── README.md

🚀 Quick Start

Prerequisites

Docker and Docker Compose (v2.0+)
4GB+ RAM available

Installation & Setup

Clone the repository

git clone https://github.com/shady-2004/Depi-Graduation-Project
cd Depi-Graduation-Project

Start the infrastructure
```
cd pipeline
docker-compose up -d
```
Wait for services to be healthy:
- namenode: http://localhost:9870
- spark-thrift: http://localhost:4040

Load Raw Data Load the source CSVs from data/raw into HDFS/Hive:

# Run from the pipeline directory
./scripts/load_raw_data.sh

Run Transformations (dbt) Execute the full transformation pipeline:
```
docker exec dbt dbt run --profiles-dir .
```
Export Results Export the processed tables back to CSVs in data/exports:
```
./scripts/export_layers.sh
```

🌪️ Orchestration (Airflow)

The project includes an optional Airflow setup for scheduling and monitoring the pipeline.

Start Airflow

# From the project root
docker-compose -f airflow/docker-compose.airflow.yml up -d

Access Airflow UI
- URL: http://localhost:8080
- User: airflow
- Password: airflow
Trigger Pipeline
- Enable the energy_dwh_etl_pipeline DAG to start the daily schedule.

📊 Data Layers

Staging Layer (Bronze): Raw data ingestion with minimal transformation.
Silver Layer: Cleaned, validated, and enriched data with quality checks.
Golden Layer: Pre-aggregated business metrics ready for dashboards.

🧪 Validation & Testing

Run the test suite to ensure data quality:

docker exec dbt dbt test --profiles-dir .

📖 Documentation

Analytics Insights: Business use cases and insights.
Data Pipeline Documentation: Technical details.
Submission Guide: Checklist for project submission.

👥 Team

Data Engineering Team - DEPI Graduation Project 2024

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
airflow		airflow
docs		docs
pipeline		pipeline
.gitignore		.gitignore
Dashboard.pdf		Dashboard.pdf
PRESENTATION_OUTLINE.md		PRESENTATION_OUTLINE.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
SUBMISSION_GUIDE.md		SUBMISSION_GUIDE.md
TESTING_SUMMARY.md		TESTING_SUMMARY.md
TEST_RESULTS.md		TEST_RESULTS.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Energy Data Warehouse - Complete Data Pipeline

🎯 Project Overview

Key Features

📊 Architecture

Technology Stack

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation & Setup

🌪️ Orchestration (Airflow)

📊 Data Layers

🧪 Validation & Testing

📖 Documentation

👥 Team

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Energy Data Warehouse - Complete Data Pipeline

🎯 Project Overview

Key Features

📊 Architecture

Technology Stack

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation & Setup

🌪️ Orchestration (Airflow)

📊 Data Layers

🧪 Validation & Testing

📖 Documentation

👥 Team

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages