Deepfakes Detection

🇰🇷 한국어 버전 | 🇯🇵 日本語版 | 📈 Model Evaluation | 🔮 Try Demo

📌 Contents

💡 Install & Requirements
🛠 SetUp
📚 DeepFake Video BenchMark Datasets — Overview of Celeb-DF-v2, FF++, and KoDF datasets used for training.
⚙️ Data Preparation — Efficient face detection and landmark extraction pipeline using YOLOv8
🏗 Model Architecture — Detailed look into our hybrid CNN-ViT (MS-EffViT & MS-EffGCViT) designs.
🧬 Model Zoo — Comparison of model variants, parameter counts, and computational complexity (FLOPs).
🚀 Training - Step-by-step training scrips with Goolge Colab and W&B experiment tracking
📈 Model Evaluation - Benchmarking results
💻 Model Usage - How to integrate DeepGuard models into your own Python code or via timm
🔮 Predict Image & Video - Simple Inference examples for detecting deepfakes in image and video
🎨 DeepFake AI Explainability - Visualizing model focus using Grad-CAM and attention maps
📓 Tutorials - Hands-on Colab notebooks for inference and dual-branch XAI visualization
📬 Authors
📝 Reference
⚖️ License

💡 Install & Requirements

To install requirements:

pip install -r requirements.txt

🛠 SetUp

Clone the repository and move into it:

git clone https://github.com/HanMoonSub/DeepGuard.git

cd DeepGuard

📚 DeepFake Video BenchMark Datasets

To evaluate the generalization and robustness of our deepfake detection model, we utilize three large-scale, widely recognized benchmark datasets. Each dataset presents unique challenges and covers different types of forgery methods.

Dataset	Real Videos	Fake Videos	Year	Participants	Description (Paper Title)	Details
Celeb-DF-v2	890	5,639	2019	59	A Large-scale Challenging Dataset for DeepFake Forensics	🔗 Readme
FaceForensics++	1,000	6,000	2019	1,000	Learning to Detect Manipulated Facial Images	🔗 Readme
KoDF	62,166	175,776	2020	400	Large-Scale Korean Deepfake Detection Dataset	🔗 Readme

⚙️ Data Preparation

Our preprocessing pipeline is designed to efficiently extract facial features from videos and prepare them for high-accuracy deepfake detection.

Detect Original Face

To maximize preprocessing efficiency, face detection is performed only on original (real) videos. Since mnipulated videos in DeepFake Video BenchMark Datasets share the same spatial coordinates as their sources, these bounding boxes are reused for the corresponding deepfake versions.

🚀 Efficiency Optimizations

Lightweight Model: Uses yolov8n-face for high-speed inference without sacrificing accuracy.
Targeted Processing: By detecting faces only in original videos, the total detection workload is reduced by approximately 80%.
Dynamic Rescaling: To maintain consistent inference speed across different resolutions, frames are automatically resized based on their dimensions:

Frame Size(Longest Side)	Scale Factor	Action
< 300px	2.0
300px - 700px	1.0
700px - 1500px	0.5
> 1500px	0.33

Face Cropping & Landmark Extraction

This module extracts face crops from both original and deepfake videos using the bounding boxes generated in the previous step. It also performs landmark detection to facilitate advanced augmentations like Landmark-based Cutout

🛠 Key Features

Dynamic Margin with Jitter: Adds a configurable margin around the face. The margin_jitter parameter introduces random variance to the crop size, making the model more robust to different face scales.
Landmark Localization: Detects 5 primary facial landmarks (eyes, nose, mouth corners) and saves them as .npy files.

DATA_ROOT/
├── crops/
│   └── {video_id}/
│       ├── 12.png
│       └── ...
├── landmarks/
│   └── {video_id}/
│       ├── 12.npy
│       └── ...
└── train_frame_metadata.csv

Dataset-Specific Pipelines

Click the links below to view the specific preprocessing details for each dataset:

🏗 Model Architecture

Multi Scale Efficient Global Context Vision Transformer is an optimized multi-scale hybrid architecture that integrates CNN-driven spatial inductive bias with hierarchical attention mechanisms to effectively identify subtle(local) artifacts and macro(global) artifacts for robust deepfake forensics."

Explore More Details

Model Architecture: MS-EffViT - Multi Scale Efficient Vision Transformer
Advanced Architecture: MS-EFFGCViT - Multi Scale Efficient Global Context Vision Transformer

We utilizes two distinct types of self-attention to capture both long-range and short-range information across feature maps.

Local Window Attention: this model efficiently captures local textures and precise spatial details while maintaining linear computational complexity relative to the image size.
Global Window Attention: Unlike Swin Transformer, this module utilizes global-queries that interact with local window keys and values. This allows each local region to incorporate global context, effectively capturing long-range dependencies and providing a comprehensive understanding of the entire spatial structure

🧬 Model Zoo

Model	Resolution	# Total Params(M)	# Backbone(M)	# L-ViT(M)	# H-ViT(M)	FLOPs (G)	Model Config
⚡ ms_eff_gcvit_b0	224 X 224	8.7	3.6(41.4%)	1.7(19.5%)	3.3(37.9%)	0.87	spec
🔥 ms_eff_gcvit_b5	384 X 384	50.3	27.3(54.3%)	6.6(13.1%)	16.1(32.0%)	13.64	spec

🚀 Training

We provide training scripts for both ms_eff_vit and ms_eff_gcvit. We recommend using Google Colab for free GPU access and Weightes & Biases(W&B) for experiment tracking

📊 Weight & Biases Experiments

ms_eff_vit_b0: Celeb-DF-v2 🚀 | FaceForensics++ 🚀 | KoDF 🚀
ms_eff_vit_b5: Celeb-DF-v2 🚀 | FaceForensics++ 🚀 | KoDF 🚀
ms_eff_gcvit_b0: Celeb-DF-v2 🚀 | FaceForensics++ 🚀 | KoDF 🚀
ms_eff_gcvit_b5: Celeb-DF-v2 🚀 | FaceForensics++ 🚀 | KoDF 🚀

!python -m train_eff_vit \ # train_eff_gcvit
    --root-dir DATA_ROOT \ 
    --model-ver "ms_eff_vit_b5" \ # ms_eff_vit_b0, ms_eff_vit_b5, ms_eff_gcvit_b0, ms_eff_gcvit_b5
    --dataset "ff++" \ # ff++, celeb_df_v2, kodf
    --seed 2025 \ # for reproducibility
    --wandb-api-key "your-api-key" # Write your own api key

📈 Model Evaluation

!python -m inference.predict_video \
    --root-dir DATA_ROOT \
    --margin-ratio 0.2 \
    --conf-thres 0.5 \
    --min-face-ratio 0.01 \
    --model-name "ms_eff_gcvit_b0" \ # ms_eff_vit_b0, ms_eff_vit_b5, ms_eff_gcvit_b0, ms_eff_gcvit_b5
    --model-dataset "kodf" \ # ff++, celeb_df_v2, kodf
    --num-frames 20 \
    --tta-hflip 0.0 \
    --agg-mode "conf" \

Celeb DF(v2) Pretrained Models

Model Variant	Test@Acc	Test@Auc	Test@log_loss	Download	Train Config
ms_eff_gcvit_b0	0.9842	0.9965	0.0283	model	recipe
ms_eff_gcvit_b5	0.9981	0.9984	0.0089	model	recipe

FaceForensics++ Pretrained Models

Model Variant	Test@Acc	Test@Auc	Test@log_loss	Download	Train Config
ms_eff_gcvit_b0	0.9808	0.9969	0.0637	model	recipe
ms_eff_gcvit_b5	0.9850	0.9974	0.0492	model	recipe

KoDF Pretrained Models

Model Variant	Test@Acc	Test@Auc	Test@log_loss	Download	Train Config
ms_eff_gcvit_b0	0.9655	0.9792	0.1237	model	recipe
ms_eff_gcvit_b5	0.9850	0.9974	0.0492	model	recipe

💻 Model Usage

Quick Start You can load the models directly via the DeepGuard package or through the timm interface.

Available Datasets: celeb_df_v2, ff++, kodf

Installation

# pip install -U git+https://github.com/HanMoonSub/DeepGuard.git
pip install deepguard

Option A: Direct Import (via DeepGuard)

from deepguard import ms_eff_gcvit_b0, ms_eff_gcvit_b5

model = ms_eff_gcvit_b0(pretrained=True, dataset="celeb_df_v2")
model = ms_eff_gcvit_b5(pretrained=True, dataset="ff++")

Option B: Using timm Interface (via timm)

import timm
import deepguard

model = timm.create_model("ms_eff_gcvit_b0", pretrained=True, dataset="ff++")
model = timm.create_model("ms_eff_gcvit_b5", pretrained=True, dataset="kodf")

Option C: Hugging Face Hub

import torch
from huggingface_hub import hf_hub_download
from deepguard import ms_eff_gcvit_b0  # or ms_eff_gcvit_b5

REPO_ID = "KoreaPeter/ms-eff-gcvit-deepfake"

ckpt = hf_hub_download(REPO_ID, "ms_eff_gcvit_b0_kodf.bin")  # celeb_df_v2 | ff++ | kodf
model = ms_eff_gcvit_b0(pretrained=False)
model.load_state_dict(torch.load(ckpt, map_location="cpu"))
model.eval()

🔮 Predict Image & Video

Predict DeepFake Image

from inference.image_predictor import ImagePredictor

# Initialize the predictor
predictor = ImagePredictor(
            margin_ratio = 0.2, #  Margin ratio around the detected face crop
            conf_thres = 0.5, # Confidence threshold for face detection
            min_face_ratio = 0.01, # Minimum face-toframe size ratio to process 
            model_name = "ms_eff_vit_b0", #  ms_eff_vit_b5, ms_eff_gcvit_b0, ms_eff_gcvit_b5  
            dataset = "celeb_df_v2" # ff++, kodf
            )

# Run Inference
result = predictor.predict_img(
            img_path="path/to/image.jpg",
            tta_hflip=0.0 # Horizontal Flip for Test-Time Augmentation 
            )

print(f"Deepfake Probability: {result:.4f}")

Predict DeepFake Video

from inference.video_predictor import VideoPredictor

# Initialize the predictor
predictor = VideoPredictor(
            margin_ratio = 0.2, #  Margin ratio around the detected face crop
            conf_thres = 0.5, # Confidence threshold for face detection
            min_face_ratio = 0.01, # Minimum face-toframe size ratio to process 
            model_name = "ms_eff_vit_b0", #  ms_eff_vit_b5, ms_eff_gcvit_b0, ms_eff_gcvit_b5  
            dataset = "celeb_df_v2" # ff++, kodf
            )

# Run Inference
result = predictor.predict_video(
            video_path = "path/to/video.mp4",
            num_frames = 20, # Number of frames to sample per video
            agg_mode = "conf", # Aggregation Method: 'conf', 'mean', 'vote'
            tta_hflip=0.0 # Horizontal Flip for Test-Time Augmentation 
            )

print(f"Deepfake Probability: {result:.4f}")

🎨 DeepFake AI Explainability

Deepfake detection is only as trustworthy as its explanations. DeepGuard integrates a production-ready XAI Toolkit that visualizes where and why the model flags a face as manipulated — turning a black-box score into actionable forensic evidence.

⭐ Validated on hybrid CNN-ViT architectures, specifically MS-EffViT and MS-EffGCViT.
⭐ Dual-Branch Analysis: Dual-branch design mirrors the model's own multi-scale reasoning

🧠 How Dual-Branch XAI Works

Branch	Feature Map	Focus	Best For
	High Resolution	Local Forgery artifacts	Skin texture, boundary blending, compression traces
	Low Resolution	Global Semantic Structure	Lighting inconsistency, facial geometry, Shadow artifacts

📐 XAI Methods

Each method is assigned to the branch where it performs best empirically.

Branch	Method	🎯 Core Idea
`low level`	HiResCAM	Like GradCAM but element-wise multiply the activations with the gradients; provably guaranteed faithfulness for certain models
`low level`	GradCAMElementWise	Like GradCAM but element-wise multiply the activations with the gradients then apply a ReLU operation before summing
`low level`	LayerCAM	Spatially weight the activations by positive gradients. Works better especially in lower layers
---	---	---
`high level`	EigenGradCAM	Like EigenCAM but with class discrimination: First principle component of Activations*Grad. Looks like GradCAM, but cleaner
`high level`	GradCAM++	Like GradCAM but uses second order gradients
`high level`	XGradCAM	Like GradCAM but scale the gradients by the normalized activations

aug_smooth applies TTA (horizontal flips) before averaging CAMs → smoother, more object-aligned maps
eigen_smooth applies PCA noise reduction → retains dominant forgery pattern only

💡 DeepFake XAI Usage

Low-Level Branch — Local Artifact Detection

from explainability import HiResCAMExplainer, GradCAMElementWiseExplainer, LayerCAMExplainer

explainer = HiResCAMExplainer(
    model_name   = "ms_eff_gcvit_b0",  # or ms_eff_vit_b0, ms_eff_gcvit_b5, ms_eff_vit_b5
    dataset      = "celeb_df_v2",       # or ff++, kodf
    branch_level = "low",
)

High-Level Branch — Global Semantic Detection

from explainability import EigenGradCAMExplainer, GradCAMPlusPlusExplainer, XGradCAMExplainer

explainer = EigenGradCAMExplainer(
    model_name   = "ms_eff_gcvit_b0",
    dataset      = "celeb_df_v2",
    branch_level = "high",
)

🎨 Visualization Modes

1. Heatmap — Continuous activation distribution

result = explainer.display_heatmap_on_image(
    img_path     = "path/to/image.jpg",
    category     = 1,      # 0: Real, 1: Fake
    threshold    = 0.5,    # binarization cutoff (0.5~1.0), or "auto" for Otsu
    image_weight = 0.5,    # 0.0: heatmap only ← → 1.0: original only
    aug_smooth   = False,  # TTA smoothing (not supported on 'pro' models)
    eigen_smooth = False,  # PCA noise reduction
)

2. Bounding Box — Discrete forgery region localization

result = explainer.display_bbox_on_image(
    img_path     = "path/to/image.jpg",
    category     = 1,
    threshold    = 0.5,
    thickness    = 1,
    aug_smooth   = False,
    eigen_smooth = False,
)

3. Heatmap + BBox — Full overlay (recommended for reporting)

result = explainer.display_heatmap_bbox_on_image(
    img_path     = "path/to/image.jpg",
    category     = 1,
    threshold    = 0.5,
    image_weight = 0.5,
    aug_smooth   = False,
    eigen_smooth = False,
)

📊 Visual Results

MS-EFF-VIT — Low-Level Branch

Model	Branch-Level	Image	HiresCam	GradCamElementwise	LayerCam
⚡ ms-eff-vit-b0
🔥 ms-eff-vit-b5

MS-Eff-ViT — High-Level Branch

Model	Branch-Level	Image	EigenGradCam	GradCamPlusPlus	XGradCam
⚡ ms-eff-vit-b0
🔥 ms-eff-vit-b5

MS-EFF-GCVIT — Low-Level Branch

Model	Branch-Level	Image	HiresCam	GradCamElementwise	LayerCam
⚡ ms-eff-gcvit-b0
🔥 ms-eff-gcvit-b5

MS-Eff-GCViT — High-Level Branch

Model	Branch-Level	Image	EigenGradCam	GradCamPlusPlus	XGradCam
⚡ ms-eff-gcvit-b0
🔥 ms-eff-gcvit-b5

📓 Tutorials

The jupyter notebooks themselves can be found under the tutorials folder in the git repository.

📬 Authors

This project was developed as a Senior Graduation Project by the Department of Software at Chungbuk National University (CBNU), Republic of Korea.

한문섭: Data & Backend Engineering (Data Preprocessing Pipeline, DB Schema Design) — hanmoon3054@gmail.com
이예솔: UI/UX & Frontend Engineering (UI/UX Design, User Dashboard, Model Visualization) — yesol4138@chungbuk.ac.kr
서윤제: AI Engineering (AI Model Architecture, Inference API Design, Model Serving) — seoyunje2001@gmail.com

📝 Reference

facenet-pytorch - Pretrained Face Detection(MTCNN) and Recognition(InceptionResNet) Models by Tim Esler
face-cutout - Face Cutout Library by Sowmen
Celeb-DF++ - Celeb-DF++ Dataset by OUC-VAS Group
DeeperForensics-1.0 - DeeperForensics-1.0 Dataset by Endless Sora
Deepfake Detection - Detection of Video Deepfake using ResNext and LSTM by Abhijith Jadhav
deepfake-detection-project-v4 - Multiple Deep Learning Models by Ameen Caslam
Awesome-Deepfake-Detection - A curated list of tools, papers and code by Daisy Zhang
Pytorch-Grad-Cam - Advanced Visual Explanations for PyTorch Models

⚖️ License

This project is licensed under the terms of the MIT license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deepfakes Detection

📌 Contents

💡 Install & Requirements

🛠 SetUp

📚 DeepFake Video BenchMark Datasets

⚙️ Data Preparation

Detect Original Face

Face Cropping & Landmark Extraction

Dataset-Specific Pipelines

🏗 Model Architecture

Explore More Details

🧬 Model Zoo

🚀 Training

📊 Weight & Biases Experiments

📈 Model Evaluation

💻 Model Usage

🔮 Predict Image & Video

Predict DeepFake Image

Predict DeepFake Video

🎨 DeepFake AI Explainability

🧠 How Dual-Branch XAI Works

📐 XAI Methods

💡 DeepFake XAI Usage

🎨 Visualization Modes

📊 Visual Results

MS-EFF-VIT — Low-Level Branch

MS-Eff-ViT — High-Level Branch

MS-EFF-GCVIT — Low-Level Branch

MS-Eff-GCViT — High-Level Branch

📓 Tutorials

📬 Authors

📝 Reference

⚖️ License

Uh oh!

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Deepfakes Detection

📌 Contents

💡 Install & Requirements

🛠 SetUp

📚 DeepFake Video BenchMark Datasets

⚙️ Data Preparation

Detect Original Face

Face Cropping & Landmark Extraction

Dataset-Specific Pipelines

🏗 Model Architecture

Explore More Details

🧬 Model Zoo

🚀 Training

📊 Weight & Biases Experiments

📈 Model Evaluation

💻 Model Usage

🔮 Predict Image & Video

Predict DeepFake Image

Predict DeepFake Video

🎨 DeepFake AI Explainability

🧠 How Dual-Branch XAI Works

📐 XAI Methods

💡 DeepFake XAI Usage

🎨 Visualization Modes

📊 Visual Results

MS-EFF-VIT — Low-Level Branch

MS-Eff-ViT — High-Level Branch

MS-EFF-GCVIT — Low-Level Branch

MS-Eff-GCViT — High-Level Branch

📓 Tutorials

📬 Authors

📝 Reference

⚖️ License