Skip to content

shady-2004/Depi-Graduation-Project

Repository files navigation

Energy Data Warehouse - Complete Data Pipeline

A comprehensive data warehouse solution for residential energy consumption analysis, built with modern data engineering tools and best practices.

🎯 Project Overview

This project implements a complete end-to-end data pipeline for analyzing household energy consumption patterns, weather impacts, and appliance efficiency. The solution provides actionable insights for energy management, customer segmentation, and grid optimization.

Key Features

  • Multi-layer Data Architecture: Staging → Silver → Golden layer transformations
  • Data Quality Framework: Comprehensive validation and quality checks
  • Business Analytics: Pre-computed metrics for dashboards and reports
  • Scalable Infrastructure: Docker-based deployment with HDFS and Hive
  • Automated Exports: CSV snapshots of all data layers for external analysis

📊 Architecture

Raw Data (CSV) → HDFS (Hive) → Staging Layer → Silver Layer → Dimensions/Facts → Golden Layer → CSV Exports

Technology Stack

Component Technology Version
Storage HDFS (Hadoop Distributed File System) 3.2.1
SQL Engine Apache Spark with Thrift Server 3.2.0
Transformation dbt (data build tool) with Spark adapter 1.7.0
Containerization Docker & Docker Compose Latest

📁 Project Structure

Depi-Graduation-Project/
├── Data/
│   ├── raw/                # Source CSV files
│   └── exports/            # Generated CSV snapshots (Bronze/Silver/Gold)
├── pipeline/               # Infrastructure & Pipeline Code
│   ├── docker-compose.yml  # Stack definition (Hadoop, Spark, dbt)
│   ├── dbt_project/        # dbt transformation logic
│   └── scripts/            # Helper scripts (load, export, init)
├── airflow/                # Airflow DAGs (Optional Orchestration)
├── docs/                   # Documentation & Submission Files
└── README.md

🚀 Quick Start

Prerequisites

  • Docker and Docker Compose (v2.0+)
  • 4GB+ RAM available

Installation & Setup

  1. Clone the repository

    git clone https://github.com/shady-2004/Depi-Graduation-Project
    cd Depi-Graduation-Project
  2. Start the infrastructure

    cd pipeline
    docker-compose up -d

    Wait for services to be healthy:

  3. Load Raw Data Load the source CSVs from data/raw into HDFS/Hive:

    # Run from the pipeline directory
    ./scripts/load_raw_data.sh
  4. Run Transformations (dbt) Execute the full transformation pipeline:

    docker exec dbt dbt run --profiles-dir .
  5. Export Results Export the processed tables back to CSVs in data/exports:

    ./scripts/export_layers.sh

🌪️ Orchestration (Airflow)

The project includes an optional Airflow setup for scheduling and monitoring the pipeline.

  1. Start Airflow

    # From the project root
    docker-compose -f airflow/docker-compose.airflow.yml up -d
  2. Access Airflow UI

  3. Trigger Pipeline

    • Enable the energy_dwh_etl_pipeline DAG to start the daily schedule.

📊 Data Layers

  1. Staging Layer (Bronze): Raw data ingestion with minimal transformation.
  2. Silver Layer: Cleaned, validated, and enriched data with quality checks.
  3. Golden Layer: Pre-aggregated business metrics ready for dashboards.

🧪 Validation & Testing

Run the test suite to ensure data quality:

docker exec dbt dbt test --profiles-dir .

📖 Documentation

👥 Team

Data Engineering Team - DEPI Graduation Project 2024

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors