Skip to content

sovunia-hub/crime-big-data-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Crime Analytics

Python MariaDB Apache Hive Apache Kafka Apache Spark

Overview

Crime Analytics is a project aimed at analyzing crime data in Los Angeles from 2020 onwards. The project utilizes statistical methods and data analysis techniques to identify trends, high-risk groups, and the effectiveness of law enforcement agencies.

Features

  • Data Pipeline Creation: Collecting and processing crime data using MariaDB, Apache Hive, and Apache Kafka.
  • Data Analysis: Examining crime trends, victim demographics, and crime locations.
  • Visualization: Generating interactive charts and graphs to illustrate crime patterns.
  • Crime Prediction: Utilizing machine learning models to predict crime hotspots.

Technologies Used

image

  • Programming Language: Python
  • Data Storage: MariaDB, Apache Hive, HDFS
  • Data Processing: Apache Spark, PySpark
  • Data Streaming: Apache Kafka, Apache Flume
  • Visualization: Plotly, Matplotlib, Seaborn

Data Sources

The dataset consists of crime reports from the Los Angeles Police Department (LAPD), including:

  • Crime type
  • Date and time of occurrence
  • Victim details (age, gender, descent)
  • Crime location (latitude, longitude)
  • Status of the case

Workflow Steps

  1. Data Collection & Storage:

    • Extract crime data from LAPD reports.
    • Store raw data in HDFS for processing.
    • Create table in MariaDB:
      MariaDB [crimes]> create table crime_data (
         DR No varchar(100),
         Date Rptd varchar(100),
         Date_Occ varchar(100),
         Time_Occ varchar(100),
         Area varchar(100),
         Area_Name varchar(100),
         Rpt_Dist No varchar(100),
         Part varchar(100),
         Crm_Cd varchar(100),
         Crm_Cd_Desc varchar(100),
         Mocodes varchar(100),
         Vict Age varchar(100),
         Vict_Sex varchar(100),
         Vict_Descent varchar(100),
         Premis_Cd varchar(100),
         Premis_Desc varchar(100),
         Weapon_Used_Cd varchar(100),
         Weapon_Desc varchar(100),
         Status varchar(100),
         Status_Desc varchar(100),
         Crm_Cd_1 varchar(100),
         Crm_Cd_2 varchar(100),
         Crm_Cd_3 varchar(100),
         Crm_Cd_4 varchar(100),
         Location varchar(100),
         Cross_Street varchar(100),
         Lat varchar(100),
         Lon varchar(100)
      );
  2. Data Transfer with Apache Sqoop:

    • Import data from HDFS to MariaDB:
      sqoop export \
         --connect jdbc:mysql://localhost/crimes \
         --username student \
         --password student \
         --export-dir /user/student/lab_data \
         --table crime_data \
         --fields-terminated-by ';'
  3. Data Transfer to Spool:

    • Import data from MariaDB to Spool with python file:
      mariadb_to_spool.py
  4. Real-Time Data Streaming with Apache Kafka & Flume:

    • Start a Kafka topic:
      kafka-topics --create \
         --bootstrap-server localhost:9092 \
         --replication-factor 1 \
         --partitions 1 \
         --topic crime_topic
    • Configure Flume agent to stream data:
      agent1.sources = srcl
      agent1.channels = ch1 ch2
      agent1.sinks = sink1 sink2
       
      agent1.sources.srcl.type = spooldir
      agent1.sources.srcl.spoolDir = /home/student/spool
       
      agent1.channels.ch1.type = memory
      agent1.channels.ch1.capacity = 10000
      agent1.channels.ch1.transactionCapacity = 100
       
      agent1.channels.ch2.type = memory
      agent1.channels.ch2.capacity = 10000
      agent1.channels.ch2.transactionCapacity = 100
       
      agent1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
      agent1.sinks.sink1.kafka.bootstrap.servers = localhost:9092
      agent1.sinks.sinkl.kafka.topic = crime_topic
      agent1.sinks.sinkl.kafka.flumeBatchSize = 5
      agent1.sinks.sink1.channel = ch1
       
      agent1.sinks.sink2.type = logger
      agent1.sinks.sink2.channel = ch2
       
      agent1.sources.srcl.channels = ch1 ch2
  5. Data Transfer to Hive:

    • Import data from MariaDB to Hive:
      sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
         --connect jdbc:mysql://localhost:3306/crimes \
         --username student \
         --password student \
         --table crimes_data \
         --hive-import \
         --hive-table hive_crimes
  6. Analytics:

    • The analysis and visualizations are presented in the file.

Usage

  • Run the data pipeline to collect and store crime data.
  • Use Jupyter Notebook or scripts to analyze crime trends.
  • Generate visualizations to interpret findings.

Results

  • Identification of high-crime areas in Los Angeles.
  • Demographic analysis of victims.
  • Evaluation of crime resolution rates by law enforcement.
  • Time-based crime trends for improved law enforcement planning.

About

Crime Analytics is a project aimed at analyzing crime data in Los Angeles from 2020 onwards. The project utilizes statistical methods and data analysis techniques to identify trends, high-risk groups, and the effectiveness of law enforcement agencies.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages