Network Packet Normalization for Anomaly Detection

High-performance CUDA kernel for real-time network packet preprocessing in intrusion detection systems.

Problem

Network security systems process millions of packets per second. Each packet must be normalized before ML inference:

Input:  Raw packet features [batch_size × num_features]
Output: Normalized packets with mean=0, std=1 per packet

Challenge: At 10 Gbps, we process 150,000 packets/sec. PyTorch's normalization is too slow for real-time processing.

Solution

Custom CUDA kernel with kernel fusion - computes mean, std, and normalization in a single GPU pass instead of PyTorch's 3 separate kernels.

Result: 3-7.5x faster with optimized kernel, handles real-time traffic with sub-millisecond latency.

Benchmark Results (NVIDIA GB200)

Batch Size	Features	PyTorch (ms)	CUDA (ms)	Optimized (ms)	Speedup
1,024	128	0.0636	0.0082	0.0085	7.50x
4,096	256	0.0632	0.0178	0.0103	6.11x
8,192	512	0.0658	0.0330	0.0186	3.54x
16,384	1,024	0.1872	0.0882	0.0611	3.06x

Average speedup: 5.05x (original CUDA: 3.86x, optimized: 1.31x faster)

Roofline Analysis

Quick Start

# Run benchmark (auto-builds CUDA extension if needed)
./run_benchmark.sh

# Generate roofline analysis
python profile_kernel.py

Project Structure

├── src/
│   ├── normalize_pytorch.py      # PyTorch baseline
│   ├── normalize_cuda_kernel.cu  # Fused CUDA kernel
│   └── normalize_cuda.cpp        # Python binding
├── benchmark.py                   # Performance comparison
├── profile_kernel.py             # Roofline analysis
├── run_benchmark.sh              # Main entry point
└── setup.py                       # Build script

Why Custom CUDA?

PyTorch approach (slow):

1. Read data → compute mean → write mean
2. Read data → compute std → write std  
3. Read data → normalize → write output
Total: 6 memory operations

Our approach (fast):

1. Read data → compute mean+std+normalize → write output
Total: 2 memory operations (3x reduction)

GPU performance is limited by memory bandwidth, not compute. Fewer memory operations = faster execution.

Why Not Numba?

Numba CUDA is easier to use but doesn't work here:

GB200 has compute capability 10.0 (Blackwell architecture)
Numba currently supports only up to 9.x (H100 and earlier)
Results in segmentation faults on GB200

Custom C++/CUDA works on all GPU generations.

Requirements

CUDA-capable GPU (compute capability 9.0+)
PyTorch with CUDA support
CUDA Toolkit 12.6+
Python 3.10+
matplotlib (for roofline plots)

Real-World Impact

For a 10 Gbps network processing 1.5M packets/sec in batches of 8K:

Batches per second: 1,500,000 / 8,192 = 183 batches/sec

PyTorch: 183 × 0.0663ms = 12.1 ms/sec GPU time
Custom:  183 × 0.0330ms =  6.0 ms/sec GPU time

Savings: 6.1 ms/sec freed for ML model inference

This allows processing higher traffic rates or running more complex models on the same hardware.

Algorithm

Each packet normalized independently:

x_norm[i,j] = (x[i,j] - mean(x[i,:])) / (std(x[i,:]) + ε)

where ε = 1e-5 prevents division by zero.

CUDA Kernel Design

Original Kernel:

1 thread block per packet for independent processing
256 threads per block for parallel reduction
Shared memory for fast mean/std computation
Coalesced memory access for optimal bandwidth
Single kernel launch instead of 3 separate launches

Optimized Kernel (1.31x faster):

Warp shuffle reductions - uses __shfl_down_sync() instead of shared memory (2x faster)
Welford's online algorithm - single-pass mean+variance computation (reads data once vs twice)
Vectorized memory - float4 loads/stores for 4x memory throughput
Reduced shared memory - 64 bytes vs 2KB (better occupancy)
Less thread divergence - only first warp does final reduction

See OPTIMIZATIONS.md for detailed analysis.

Why Memory-Bound? Can We Do Better?

Arithmetic Intensity: 0.75 FLOPs/byte (6 FLOPs per element / 8 bytes read+write)

Why we stay memory-bound:

Normalization has limited computation: (x - mean) / std
To become compute-bound needs 13x more math per element
The algorithm fundamentally requires reading/writing data

What we optimized:

Reduced memory passes (6 ops → 2 ops) via kernel fusion
Warp shuffles + vectorized loads for max bandwidth efficiency
Achieved 19% bandwidth utilization (good for AI=0.75)

Cannot optimize further without:

Processing multiple batches with CUDA streams (19% → 40%+)
Fusing with next operation (matrix multiply, activation, etc.)
Using FP16 (2x less bandwidth, loses precision)

Memory-bound is correct for this workload. Peak speedup achieved through memory reduction, not more compute.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
OPTIMIZATIONS.md		OPTIMIZATIONS.md
README.md		README.md
benchmark.py		benchmark.py
compare_kernels.py		compare_kernels.py
profile_kernel.py		profile_kernel.py
roofline.png		roofline.png
run_benchmark.sh		run_benchmark.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Network Packet Normalization for Anomaly Detection

Problem

Solution

Benchmark Results (NVIDIA GB200)

Roofline Analysis

Quick Start

Project Structure

Why Custom CUDA?

Why Not Numba?

Requirements

Real-World Impact

Algorithm

CUDA Kernel Design

Original Kernel:

Optimized Kernel (1.31x faster):

Why Memory-Bound? Can We Do Better?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Network Packet Normalization for Anomaly Detection

Problem

Solution

Benchmark Results (NVIDIA GB200)

Roofline Analysis

Quick Start

Project Structure

Why Custom CUDA?

Why Not Numba?

Requirements

Real-World Impact

Algorithm

CUDA Kernel Design

Original Kernel:

Optimized Kernel (1.31x faster):

Why Memory-Bound? Can We Do Better?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages