LLM systems engineer building reproducible evaluation, post-training, retrieval, and traceable agent infrastructure.
Portfolio | Selected repositories
I work on the engineering layer around model behavior: data pipelines, benchmark harnesses, retrieval diagnostics, verifier-guided inference, trace capture, and regression tests that make LLM systems measurable instead of anecdotal.
- Triton PR #10411: merged runtime cache-group integrity fix that treats incomplete cache groups as misses.
- Triton PR #10413: merged benchmarking reliability fix for single pruned autotune configs.
- PyTorch TorchTitan PR #3456: merged LoRA freezing fix for non-linear modules.
- Apache TVM PR #19818: merged ONNX frontend correctness fix preserving BatchNormalization inference mode.
- ONNX Runtime PR #29140: merged CUDA/FMHA kernel initialization fix for large-head variants.
- PyTorch issue #188023: reported a negative-stride DLPack crash in
torch.from_dlpack, triaged as a crash/error-checking/numpy/dlpack bug with a follow-up fix PR opened.
- Treat every model claim as an artifact-backed systems claim: data version, command, hardware, metric, and failure boundary.
- Build evaluation loops that survive refactors: golden sets, deterministic runners, CI checks, and report provenance.
- Keep agent behavior inspectable: tool calls, retrieved sources, validators, retries, and escalation paths should be first-class data.
- Optimize for reproducible learning velocity: small models, single-GPU runs, tight ablations, and clear error analysis before scale.
| Surface | What I build | Representative repos |
|---|---|---|
| Post-training and verifier-guided inference | SFT/DPO-style pipelines, executable checks, reward-labeled traces, benchmark exports | L20-CodeForge, repro-llm-stack |
| Retrieval and ranking evaluation | Query planning, citation checks, recall/MRR regression tests, reranker snapshots | signal-rag, retrieval-eval, finmteb-zh-reranker-sota, coreb-retrieval-sota |
| Structured generation benchmarks | NL2SQL, ordering reliability, multi-path inference, cost and robustness reporting | nl2sql-benchmark, order-delta-bench |
| Agent trace infrastructure | Scientific workflows, semantic judges, deterministic validators, trace review | scitrace-rl, CodeGraph |
| Efficient model systems | Quantization experiments, serving benchmarks, GPU instrumentation, compiler/runtime fixes | Triton PR #10411, Triton PR #10413, TorchTitan PR #3456, ONNX Runtime PR #29140, TVM PR #19818, PyTorch issue #188023, l20-stack, llm-quant-bench, l20-edu-135m-pretrain |
| Project | Signal | Evidence surface |
|---|---|---|
| L20-CodeForge | Single-L20 post-training and verifier-guided inference for executable code benchmarks | Reproduction scripts, artifact hashes, result boundaries |
| l20-stack | Single-L20 infra reference stack for kernels, dispatch policy, serving, and QLoRA smoke runs | Triton RMSNorm/RoPE+KV kernels, L20 benchmark reports, CUDA telemetry |
| nl2sql-benchmark | Text-to-SQL fine-tuning and multi-path inference with Qwen2.5-Coder-7B | Spider/BIRD-style evaluation, cost curves, export paths |
| finmteb-zh-reranker-sota | FinanceMTEB Chinese reranking snapshot with Qwen3-Reranker-8B | Public report, CI checks, leaderboard snapshot context |
| signal-rag | Retrieval workbench with query planning, citation checks, and extractive fallback | Recall evaluation, source-trust tiers, benchmark examples |
| scitrace-rl | Trace, validation, and reward infrastructure for scientific agents | Adversarial cases, semantic judge, deterministic validators |
| coreb-retrieval-sota | Reproducible CoREB retrieval benchmark snapshot | CI-backed artifacts, result provenance, and upstream submission issue |
| Upstream model-system PRs and reports | Merged fixes in Triton, PyTorch TorchTitan, Apache TVM, ONNX Runtime, plus a PyTorch crash report | Runtime/cache correctness, LoRA freezing behavior, benchmarking reliability, ONNX frontend behavior, CUDA/FMHA initialization, DLPack crash triage |
I try to make serious repositories answer five questions quickly:
| Question | Expected answer |
|---|---|
| What is the exact task? | Dataset, benchmark, workflow, or user problem is named up front. |
| How do I run it? | Setup and reproduction commands are visible from the README. |
| What should happen? | Expected outputs, metrics, report paths, or screenshots are documented. |
| What is proven? | Claims are tied to artifacts rather than vague demos. |
| Where does it fail? | Known limitations and next experiments are explicit. |
Core languages: Python, TypeScript, SQL, C++, CUDA, Swift, C#
Model systems: PyTorch, Transformers, LoRA/QLoRA, vLLM, lm-eval, Triton
LLM applications: RAG, retrieval evaluation, tool use, citation verification, structured generation
Infrastructure: FastAPI, SQLite, Docker, GitHub Actions, Make, CLI tooling, PostGIS, Redis, Kafka
Research direction: post-training, process supervision, agent evaluation, scientific reproducibility, AI4S infrastructure