Skip to content

hoyathalis/packet-normalization-cuda

Repository files navigation

Network Packet Normalization for Anomaly Detection

High-performance CUDA kernel for real-time network packet preprocessing in intrusion detection systems.

Problem

Network security systems process millions of packets per second. Each packet must be normalized before ML inference:

Input:  Raw packet features [batch_size × num_features]
Output: Normalized packets with mean=0, std=1 per packet

Challenge: At 10 Gbps, we process 150,000 packets/sec. PyTorch's normalization is too slow for real-time processing.

Solution

Custom CUDA kernel with kernel fusion - computes mean, std, and normalization in a single GPU pass instead of PyTorch's 3 separate kernels.

Result: 3-7.5x faster with optimized kernel, handles real-time traffic with sub-millisecond latency.

Benchmark Results (NVIDIA GB200)

Batch Size Features PyTorch (ms) CUDA (ms) Optimized (ms) Speedup
1,024 128 0.0636 0.0082 0.0085 7.50x
4,096 256 0.0632 0.0178 0.0103 6.11x
8,192 512 0.0658 0.0330 0.0186 3.54x
16,384 1,024 0.1872 0.0882 0.0611 3.06x

Average speedup: 5.05x (original CUDA: 3.86x, optimized: 1.31x faster)

Roofline Analysis

Roofline Plot

Quick Start

# Run benchmark (auto-builds CUDA extension if needed)
./run_benchmark.sh

# Generate roofline analysis
python profile_kernel.py

Project Structure

├── src/
│   ├── normalize_pytorch.py      # PyTorch baseline
│   ├── normalize_cuda_kernel.cu  # Fused CUDA kernel
│   └── normalize_cuda.cpp        # Python binding
├── benchmark.py                   # Performance comparison
├── profile_kernel.py             # Roofline analysis
├── run_benchmark.sh              # Main entry point
└── setup.py                       # Build script

Why Custom CUDA?

PyTorch approach (slow):

1. Read data → compute mean → write mean
2. Read data → compute std → write std  
3. Read data → normalize → write output
Total: 6 memory operations

Our approach (fast):

1. Read data → compute mean+std+normalize → write output
Total: 2 memory operations (3x reduction)

GPU performance is limited by memory bandwidth, not compute. Fewer memory operations = faster execution.

Why Not Numba?

Numba CUDA is easier to use but doesn't work here:

  • GB200 has compute capability 10.0 (Blackwell architecture)
  • Numba currently supports only up to 9.x (H100 and earlier)
  • Results in segmentation faults on GB200

Custom C++/CUDA works on all GPU generations.

Requirements

  • CUDA-capable GPU (compute capability 9.0+)
  • PyTorch with CUDA support
  • CUDA Toolkit 12.6+
  • Python 3.10+
  • matplotlib (for roofline plots)

Real-World Impact

For a 10 Gbps network processing 1.5M packets/sec in batches of 8K:

Batches per second: 1,500,000 / 8,192 = 183 batches/sec

PyTorch: 183 × 0.0663ms = 12.1 ms/sec GPU time
Custom:  183 × 0.0330ms =  6.0 ms/sec GPU time

Savings: 6.1 ms/sec freed for ML model inference

This allows processing higher traffic rates or running more complex models on the same hardware.

Algorithm

Each packet normalized independently:

x_norm[i,j] = (x[i,j] - mean(x[i,:])) / (std(x[i,:]) + ε)

where ε = 1e-5 prevents division by zero.

CUDA Kernel Design

Original Kernel:

  • 1 thread block per packet for independent processing
  • 256 threads per block for parallel reduction
  • Shared memory for fast mean/std computation
  • Coalesced memory access for optimal bandwidth
  • Single kernel launch instead of 3 separate launches

Optimized Kernel (1.31x faster):

  • Warp shuffle reductions - uses __shfl_down_sync() instead of shared memory (2x faster)
  • Welford's online algorithm - single-pass mean+variance computation (reads data once vs twice)
  • Vectorized memory - float4 loads/stores for 4x memory throughput
  • Reduced shared memory - 64 bytes vs 2KB (better occupancy)
  • Less thread divergence - only first warp does final reduction

See OPTIMIZATIONS.md for detailed analysis.

Why Memory-Bound? Can We Do Better?

Arithmetic Intensity: 0.75 FLOPs/byte (6 FLOPs per element / 8 bytes read+write)

Why we stay memory-bound:

  • Normalization has limited computation: (x - mean) / std
  • To become compute-bound needs 13x more math per element
  • The algorithm fundamentally requires reading/writing data

What we optimized:

  • Reduced memory passes (6 ops → 2 ops) via kernel fusion
  • Warp shuffles + vectorized loads for max bandwidth efficiency
  • Achieved 19% bandwidth utilization (good for AI=0.75)

Cannot optimize further without:

  • Processing multiple batches with CUDA streams (19% → 40%+)
  • Fusing with next operation (matrix multiply, activation, etc.)
  • Using FP16 (2x less bandwidth, loses precision)

Memory-bound is correct for this workload. Peak speedup achieved through memory reduction, not more compute.

About

Optimized CUDA kernel for network packet normalization: 5× faster than PyTorch for real-time ML preprocessing in intrusion detection systems.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors