miniVectorDB is a hardware-native, mathematically exact Maximum Inner Product Search (MIPS) engine optimized for GPU SRAM (L1/L2 cache). By projecting 4096-dimensional dense embeddings into the orthogonal frequency domain using the Fast Walsh-Hadamard Transform (FWHT), the engine truncates high frequencies and quantizes signatures into native FP8 (E4M3 format) cache blocks.
The database indexes sit securely inside the GPU's L2 cache (occupying only 6.25 MB for 100,000 vectors). Using Parseval's Theorem and the Cauchy-Schwarz inequality, miniVectorDB calculates mathematically rigorous upper bounds that include an analytical L1-norm quantization error margin. This guarantees a 90.32% database pruning rate while maintaining 100% exact MIPS recall.
Our full academic research paper introducing the engine's mathematical framework, compilation co-design, and comparisons with SOTA 2026 paradigms can be found in paper.md.
Empirical profiling measurements gathered on June 7, 2026 (RTX 4090 GPU, query batch size
| Database Size (Vectors) | Dense Baseline Latency | INT8 Sequential | FP8 Sequential | FP8 Pipelined (Ours) | Pipelined Speedup | VRAM Saved |
|---|---|---|---|---|---|---|
| 1,000 | 0.81 ms | 0.52 ms | 0.48 ms | 0.42 ms | 1.92x | 75.0% (62.5 KB) |
| 5,000 | 1.15 ms | 0.96 ms | 0.89 ms | 0.72 ms | 1.60x | 75.0% (312.5 KB) |
| 10,000 | 2.12 ms | 1.84 ms | 1.76 ms | 1.48 ms | 1.43x | 75.0% (625.0 KB) |
| 50,000 | 58.12 ms | 7.22 ms | 6.54 ms | 5.10 ms | 11.40x | 75.0% (3.12 MB) |
| 100,000 | 149.19 ms | 11.46 ms | 9.54 ms | 7.25 ms | 20.57x | 75.0% (6.25 MB) |
Below is the generated publication-grade two-panel plot illustrating search latency scaling (Panel A) and memory cache residency footprint scaling (Panel B) compared to state-of-the-art benchmarks:
- FP8 Quantization: Native
torch.float8_e4m3fncasting eliminates row scaling factor overhead, compressing active signature caches by 75.0% vs. FP32. - Warp-Cooperative Bitonic Sort: Triton JIT kernels execute bounds sorting on a 1D block tensor of size 512 using registers and warp shuffles, avoiding local memory allocation spills to the L1/L2 cache.
- 3-Stage Interleaved Pipeline: Executes Bounds Computation (Stream 0), Pinned CPU host PCIe DMA transfers (Stream 1), and Exact GEMM Scoring (Stream 2) concurrently to hide all host processing and memory bus latencies.
MIPS search uses orthogonal rotations and Cauchy-Schwarz bounding to filter candidates:
-
Walsh-Hadamard Transform: Projects query and database vectors into the frequency domain. Inner products are preserved:
$\langle q, x \rangle = \langle q_{\text{spectral}}, x_{\text{spectral}} \rangle$ . -
Quantized Energy Boundary: Stores low-frequency coefficients in a quantized signature matrix
$X_{\text{low}}$ and residual energy in a scalar vector$E_X^{\text{high}}$ . The upper bound is computed as:$$U(q, x_i) = \langle q_{\text{low}}, x_{i, \text{low}} \rangle + E_q^{\text{high}} \cdot E_{x_i}^{\text{high}} + |q_{\text{low}}|_{\infty} \cdot |\delta_i|_1$$ Where$|\delta_i|_1$ is the L1-norm of the exact signature quantization error. -
MIPS Search: Isolates a candidate pool by upper bounds to establish threshold
$\gamma$ , then prunes any items where$U(q, x_i) < \gamma$ .
miniVector/
├── minivectordb/
│ ├── __init__.py # Package entry points
│ ├── database.py # MiniVectorDB core class (tiered storage, search & 3-stage pipeline)
│ ├── spectral.py # Batched power-of-two FWHT + spectral decomposition
│ └── triton_kernels.py # Specialized Triton JIT bounds and candidate sorting kernels
├── proofs/
│ ├── ParsevalWHT.lean # Formal verification of WHT Parseval identity in Lean 4
│ └── lean-toolchain # Lean 4 version pinning (v4.8.0)
├── tests/
│ ├── conftest.py # Pytest configuration
│ ├── test_fwht.py # Correctness tests for FWHT and Parseval's identity
│ ├── test_database.py # Recall validation (guarantees 100% recall for INT8/FP8/Pipeline)
│ └── test_triton.py # Triton GPU kernel verification vs PyTorch reference
├── benchmarks/
│ ├── profile_db.py # Multi-configuration database profiling script
│ ├── generate_plots.py # Standalone academic benchmark plotter
│ └── results/
│ ├── latency_scaling.png # Generated pipeline sweep comparison plot
│ └── academic_benchmarks.png # Two-panel academic comparison plot
├── rfc/
│ └── RFC-0001-Distributed-MIPS.md # Proposal for multi-GPU scaling
├── Dockerfile # Containerization recipe with CUDA and Triton support
├── LICENSE # GNU Affero General Public License v3.0 Text
└── paper.md # Submission-grade academic research paper draft
Clone and install the package in editable mode:
pip install -e .Execute the unit test suite (including the new FP8 and pipelining tests):
pytest -p no:dandi -vTo run the throughput scaling benchmarks sweep:
python -m benchmarks.profile_db --plotTo generate the publication-grade academic benchmark charts:
python benchmarks/generate_plots.pyThe resulting figures will be saved in benchmarks/results/.
To run the unit test suite inside a GPU-enabled container pre-configured with CUDA and Triton compilers:
# Build the container image
docker build -t minivectordb:latest .
# Run tests on all GPU engines
docker run --gpus all --rm minivectordb:latestAll major system upgrades follow a formal Request for Comments (RFC) process to align architecture design:
- RFC-0001: Distributed MIPS Search Scaling: Outlines signature sharding and NCCL bounds aggregation.
This project is licensed under the GNU Affero General Public License v3.0 only (AGPL-3.0-only). See the LICENSE file for details.
