This project implements a comprehensive data engineering solution for analyzing motor vehicle collision data from three major US cities: New York, Chicago, and Austin. The solution includes data ingestion, transformation, warehousing, and business intelligence capabilities to provide actionable insights into traffic safety patterns.
- Design and implement a scalable data architecture for traffic collision analysis
- Create a unified data model across multiple city datasets
- Provide business intelligence capabilities for traffic safety insights
- Address data quality issues and standardize heterogeneous data sources
The project integrates crash data from official city data portals:
- New York City - Motor Vehicle Collisions dataset
- Chicago - Traffic Crashes dataset
- Austin - Austin Crash Report dataset
- ETL Platform: Talend Data Integration
- Databases: Azure SQL Server, MySQL, SQL Server
- Data Profiling: Alteryx, YData Profile
- Visualization: Tableau, Power BI
- Data Modeling: Star Schema Dimensional Modeling
- Version Control: Git/GitHub
- Documentation: Markdown, PDF
- Data Formats: CSV, Excel, SQL
- Cloud Platform: Azure (for database hosting)
- Operating System: Cross-platform (Windows, macOS, Linux)
- Data Quality: Built-in Talend data validation
- Data Transformation: Talend Data Mapper
- Metadata Management: Talend Metadata Manager
- Job Orchestration: Talend Job Server
- Monitoring: Talend Administration Center
MegaCollision/
├── TALEND FILES/Job Designs/DAMG7370_GROUP2/
│ ├── metadata/ # Connection configurations
│ ├── process/ # ETL job designs
│ │ ├── Stage_Tables/ # Data staging jobs
│ │ ├── Date_DIM/ # Date dimension processing
│ │ ├── Time_DIM/ # Time dimension processing
│ │ ├── Geo_DIM/ # Geographic dimension processing
│ │ ├── Accident_Fact/ # Fact table processing
│ │ └── References/ # Lookup tables and mappings
│ └── talend.project
├── DDL.sql # Database schema definitions
├── datasets.pdf # Data profiling documentation
└── DOCUMENTATION.pdf # Comprehensive project documentation
- Data Profiling: Comprehensive analysis using Alteryx and YData Profile
- Staging Environment: Azure SQL Server/MySQL staging tables
- Dimensional Model Design: Star schema with facts and dimensions
- ETL Implementation: Talend jobs for staging to integration layer
- Data Validation: Quality checks and row rejection handling
- Business Intelligence: Query development for analysis questions
- Dashboard Creation: Tableau and Power BI visualizations
- Insight Delivery: Actionable business intelligence reports
Important
To view the PowerBI dashboard, could you please request access? I’ll approve it as quickly as possible 🙂
View the live PowerBI dashboard
In the meantime, you can view the exported dashboard here:
PowerBI Dashboard (PPT Export)
The project addresses 10 key traffic safety questions:
- Accident Volume: Total accidents per city
- High-Risk Areas: Top 3 accident-prone areas per city
- Injury Patterns: Accidents resulting in injuries
- Pedestrian Safety: Pedestrian involvement rates
- Seasonal Trends: Temporal accident patterns
- Motorist Impact: Motorist injuries and fatalities
- Fatal Hotspots: Top 5 fatal accident areas
- Time Analysis: Day/time accident patterns
- Fatality Comparison: Pedestrian vs. road user fatalities
- Contributing Factors: Most common accident causes
- Fact Table:
Accident_facts- Core metrics (injuries, deaths, units involved) - Dimension Tables:
Time_DIM- Time-based analysis attributesDate_DIM- Date-specific informationGeo_DIM- Location data (lat/long, street, city, ZIP)SpeedLimit_DIM- Speed limit informationCrash_DIM- Crash-specific detailsContributingFactor_DIM- Accident contributing factorsVehicle_DIM- Vehicle type information
- Junction Tables: Many-to-many relationships for factors and vehicles
- Missing contributing factors → Specific code mapping
- Multiple contributing factor columns → Merged and normalized
- Inconsistent speed limits (0 values) → Updated to -1
- Combined date/time → Split into separate columns
- Multiple contributing factor columns → Merged and normalized
- Vehicle type inconsistencies → Standardized classifications
- Missing units involved → Computed from related columns
- Time format standardization → Added seconds component
- Vehicle type inference → Computed from available data
- Multiple contributing factor columns → Merged and normalized
- Date/time formatting → Split and standardized to yyyy-mm-dd
- Talend Data Integration Studio
- SQL Server/Azure SQL Database
- Tableau Desktop or Power BI
- Clone the repository
- Import Talend project from
TALEND FILES/Job Designs/DAMG7370_GROUP2/ - Configure database connections in metadata folder
- Execute DDL.sql to create database schema
- Run ETL jobs in sequence: Staging → Dimensions → Facts
- Multi-City Data Integration: Unified analysis across NYC, Chicago, and Austin
- Comprehensive Data Quality: Addresses real-world data inconsistencies
- Scalable Architecture: Star schema design for performance and flexibility
- Business Intelligence Ready: Pre-built analysis queries and visualizations
- Documentation: Complete project documentation and data profiling reports
- DDL.sql: Complete database schema definitions
- datasets.pdf: Data profiling and quality analysis
- DOCUMENTATION.pdf: Comprehensive project documentation
- Screenshots: ETL job designs and process flows
This project demonstrates end-to-end data engineering capabilities, from raw data ingestion through to actionable business insights, with robust data quality controls and comprehensive documentation.
