Anay Dongre MrAnayDongre

Anay Dongre

ML Systems Engineer

I build LLM inference systems, agentic AI platforms, and GPU kernels.

🔨 Shipped

PatchQuest

Agentic coding harness for small models

12-phase deterministic pipeline that makes small and open-weight models reliable on real repository tasks. Tree-sitter symbol extraction builds a code graph for repo intelligence. SecretGuard credential scanning. 4-tier command-risk gating. Docker sandboxing with network isolation. Mock mode for keyless evaluation. Solo-built end-to-end.

8 LLM providers including

EigenTune

CUDA kernels for SVD-based model optimization

Parameter-efficient fine-tuning via SVD decomposition. Decomposes weight matrices, freezes U/V bases, learns lightweight scalars. Hand-written CUDA kernels managing GPU shared memory, thread synchronization, and warp-level execution. Profiled with Nsight Compute. ONNX export for cross-architecture deployment. HuggingFace Trainer integration. Full test suite, CI/CD.

700+ installs · pip install eigentune

Nano LLAMA

221M-parameter transformer from scratch

Full decoder-only transformer built at the numerical level in PyTorch. Implements RMSNorm, Rotary Positional Embeddings (RoPE), SwiGLU gated activations, grouped multi-head attention, mixed-precision training (fp16/bf16), and gradient checkpointing. Trained on a single 4GB GPU with reproducible scripts. No cloud budget, no framework wrappers, just raw PyTorch.

Parameter Golf

Mixed-precision quantization pipeline

Custom mixed int6/int8 quantization with per-row clip-search calibration for 24–27M parameter transformers. Per-layer numerical error analysis under strict model-size constraints. Investigated accuracy-compression tradeoffs across quantization configurations.

📄 Research

📌 TransKV: Transactional KV Staging for Speculative Decoding under Paged KV Memory TechRxiv (IEEE), 2026 · Under submission to EMNLP 2026 workshop

Speculative decoding inflates KV cache pages and fragments memory under load. TransKV isolates speculative state in staging blocks, commits only accepted tokens, rolls back the rest. 14–41% reduction in committed-cache write traffic with formal output-equivalence proofs across Qwen2.5 model pairs.

📌 Blockchain-Based E-Voting with Proof-of-Work and ML — IET Blockchain, 2023 (peer-reviewed)

📌 NeRF: A Comprehensive Survey — IJISRT, 2023

🔧 Open Source

I contribute to ML infrastructure that other engineers depend on.

vLLM · PR #44693 — Runtime memory optimization and KV-cache scheduling

CocoIndex · PR #1010 — Native Rust/PyO3 crate, throughput bottleneck fix

sglang · Structured generation, inference scheduling

nano-vllm · Lightweight LLM inference engine

✍️ Writing

3× Kaggle Master · Codeforces · AWS ML Specialty · Previously at Aerolift.AI and JPMorgan Chase

Provide feedback

Saved searches

Use saved searches to filter your results more quickly