High-performance CUDA kernel for real-time network packet preprocessing in intrusion detection systems.
Network security systems process millions of packets per second. Each packet must be normalized before ML inference:
Input: Raw packet features [batch_size × num_features]
Output: Normalized packets with mean=0, std=1 per packet
Challenge: At 10 Gbps, we process 150,000 packets/sec. PyTorch's normalization is too slow for real-time processing.
Custom CUDA kernel with kernel fusion - computes mean, std, and normalization in a single GPU pass instead of PyTorch's 3 separate kernels.
Result: 3-7.5x faster with optimized kernel, handles real-time traffic with sub-millisecond latency.
| Batch Size | Features | PyTorch (ms) | CUDA (ms) | Optimized (ms) | Speedup |
|---|---|---|---|---|---|
| 1,024 | 128 | 0.0636 | 0.0082 | 0.0085 | 7.50x |
| 4,096 | 256 | 0.0632 | 0.0178 | 0.0103 | 6.11x |
| 8,192 | 512 | 0.0658 | 0.0330 | 0.0186 | 3.54x |
| 16,384 | 1,024 | 0.1872 | 0.0882 | 0.0611 | 3.06x |
Average speedup: 5.05x (original CUDA: 3.86x, optimized: 1.31x faster)
# Run benchmark (auto-builds CUDA extension if needed)
./run_benchmark.sh
# Generate roofline analysis
python profile_kernel.py├── src/
│ ├── normalize_pytorch.py # PyTorch baseline
│ ├── normalize_cuda_kernel.cu # Fused CUDA kernel
│ └── normalize_cuda.cpp # Python binding
├── benchmark.py # Performance comparison
├── profile_kernel.py # Roofline analysis
├── run_benchmark.sh # Main entry point
└── setup.py # Build script
PyTorch approach (slow):
1. Read data → compute mean → write mean
2. Read data → compute std → write std
3. Read data → normalize → write output
Total: 6 memory operations
Our approach (fast):
1. Read data → compute mean+std+normalize → write output
Total: 2 memory operations (3x reduction)
GPU performance is limited by memory bandwidth, not compute. Fewer memory operations = faster execution.
Numba CUDA is easier to use but doesn't work here:
- GB200 has compute capability 10.0 (Blackwell architecture)
- Numba currently supports only up to 9.x (H100 and earlier)
- Results in segmentation faults on GB200
Custom C++/CUDA works on all GPU generations.
- CUDA-capable GPU (compute capability 9.0+)
- PyTorch with CUDA support
- CUDA Toolkit 12.6+
- Python 3.10+
- matplotlib (for roofline plots)
For a 10 Gbps network processing 1.5M packets/sec in batches of 8K:
Batches per second: 1,500,000 / 8,192 = 183 batches/sec
PyTorch: 183 × 0.0663ms = 12.1 ms/sec GPU time
Custom: 183 × 0.0330ms = 6.0 ms/sec GPU time
Savings: 6.1 ms/sec freed for ML model inference
This allows processing higher traffic rates or running more complex models on the same hardware.
Each packet normalized independently:
x_norm[i,j] = (x[i,j] - mean(x[i,:])) / (std(x[i,:]) + ε)
where ε = 1e-5 prevents division by zero.
- 1 thread block per packet for independent processing
- 256 threads per block for parallel reduction
- Shared memory for fast mean/std computation
- Coalesced memory access for optimal bandwidth
- Single kernel launch instead of 3 separate launches
- Warp shuffle reductions - uses
__shfl_down_sync()instead of shared memory (2x faster) - Welford's online algorithm - single-pass mean+variance computation (reads data once vs twice)
- Vectorized memory -
float4loads/stores for 4x memory throughput - Reduced shared memory - 64 bytes vs 2KB (better occupancy)
- Less thread divergence - only first warp does final reduction
See OPTIMIZATIONS.md for detailed analysis.
Arithmetic Intensity: 0.75 FLOPs/byte (6 FLOPs per element / 8 bytes read+write)
Why we stay memory-bound:
- Normalization has limited computation:
(x - mean) / std - To become compute-bound needs 13x more math per element
- The algorithm fundamentally requires reading/writing data
What we optimized:
- Reduced memory passes (6 ops → 2 ops) via kernel fusion
- Warp shuffles + vectorized loads for max bandwidth efficiency
- Achieved 19% bandwidth utilization (good for AI=0.75)
Cannot optimize further without:
- Processing multiple batches with CUDA streams (19% → 40%+)
- Fusing with next operation (matrix multiply, activation, etc.)
- Using FP16 (2x less bandwidth, loses precision)
Memory-bound is correct for this workload. Peak speedup achieved through memory reduction, not more compute.
