Real-Time Fraud Detection with MLOps

Overview

This project implements a real-time fraud detection system using a scalable data engineering and ML pipeline. It simulates continuous financial transactions, streams them through Google Cloud Pub/Sub (GCP), processes them with PySpark Structured Streaming, enriches features, and applies both supervised and unsupervised machine learning models for fraud scoring. The system also provides real-time alerting via Pub/Sub topics (fraud_alerts) and persists enriched data to Parquet storage for offline analysis or model retraining.

Architecture

Data Generator – Produces synthetic transactions (user ID, amount, location, timestamp).
- Publishes messages continuously to Pub/Sub topic transactions.
Google Cloud Pub/Sub – Handles real-time ingestion of transaction streams.
- Topics:
  - transactions → incoming raw transactions
  - fraud_alerts → alerts for suspicious transactions
Spark Structured Streaming – Reads from Pub/Sub.
- Parses and transforms raw transactions.
- Builds feature-rich data frames (time features, velocity features, user behavior, etc.).
- Scores transactions using trained ML models.
- Writes results to:
  - Console (for visibility)
  - Parquet (./artifacts/enriched_transactions)
  - Pub/Sub (fraud_alerts) for real-time alerting.
Model Training – Implemented in train_model.py.
- Two modes:
  - Supervised (Logistic Regression / XGBoost): requires labeled transactions.
  - Unsupervised (IsolationForest): anomaly detection without labels.
- Features extracted include velocity, rolling window stats, log-transforms, and location frequency.
- Artifacts saved:
  - model.pkl → trained model
  - preprocess_scaler.pkl → feature scaler
  - features.json → exact feature order
  - threshold.json → threshold, mode, calibration metadata
Real-Time Scoring (Spark) – Loads artifacts.
- Normalizes new batch scores consistently (using training min/max).
- Compares against threshold.
- Publishes fraud alerts to Pub/Sub.

Project Structure

├── data-generator/ # Pub/Sub publisher generating random transactions
├── spark_streaming/ # Spark Structured Streaming jobs
│ └── main.py # Main streaming job with alerting
├── models/
│ ├── train_model.py # Training pipeline (supervised & unsupervised)
│ ├── model.pkl # Saved trained model
│ ├── preprocess_scaler.pkl
│ ├── features.json
│ └── threshold.json
├── artifacts/ # Output (Parquet, alerts, scored results)
└── streamlit_app/ # Streamlit dashboard for visualization

Setup & Installation

1. Clone Repo

git clone <your-repo-url>
cd <your-repo-name>

2. Authenticate GCP and Set Project

gcloud auth application-default login
gcloud config set project <your-gcp-project-id>

3. Create Pub/Sub Topics

gcloud pubsub topics create transactions
gcloud pubsub topics create fraud_alerts

4. Start Data Generator

cd data-generator
python3 main.py

5. Run Spark Streaming (with Pub/Sub connector)

python3 -m spark_streaming.main

6. Pull Alerts from Pub/Sub

gcloud pubsub subscriptions create fraud_alerts-sub --topic=fraud_alerts
gcloud pubsub subscriptions pull fraud_alerts-sub --auto-ack --limit=5

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.idea		.idea
artifacts		artifacts
data-generator		data-generator
db		db
infra/pgdata		infra/pgdata
models		models
sai/google-cloud-sdk		sai/google-cloud-sdk
services/alert_sink		services/alert_sink
spark-streaming		spark-streaming
streamlit_app		streamlit_app
.DS_Store		.DS_Store
.env		.env
.gitignore		.gitignore
README.md		README.md
Real_Time_Fraud_Detection_Flowchart.pdf		Real_Time_Fraud_Detection_Flowchart.pdf
Real_Time_Fraud_Detection_MLOps_Project.pdf		Real_Time_Fraud_Detection_MLOps_Project.pdf
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time Fraud Detection with MLOps

Overview

Architecture

Project Structure

Setup & Installation

1. Clone Repo

2. Authenticate GCP and Set Project

3. Create Pub/Sub Topics

4. Start Data Generator

5. Run Spark Streaming (with Pub/Sub connector)

6. Pull Alerts from Pub/Sub

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Real-Time Fraud Detection with MLOps

Overview

Architecture

Project Structure

Setup & Installation

1. Clone Repo

2. Authenticate GCP and Set Project

3. Create Pub/Sub Topics

4. Start Data Generator

5. Run Spark Streaming (with Pub/Sub connector)

6. Pull Alerts from Pub/Sub

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages