Yin Li Kevin-Li-2025

Yin Li

LLM systems engineer building reproducible evaluation, post-training, retrieval, and traceable agent infrastructure.

Portfolio | Selected repositories

I work on the engineering layer around model behavior: data pipelines, benchmark harnesses, retrieval diagnostics, verifier-guided inference, trace capture, and regression tests that make LLM systems measurable instead of anecdotal.

Upstream Open Source

Triton PR #10411: merged runtime cache-group integrity fix that treats incomplete cache groups as misses.
Triton PR #10413: merged benchmarking reliability fix for single pruned autotune configs.
PyTorch TorchTitan PR #3456: merged LoRA freezing fix for non-linear modules.
Apache TVM PR #19818: merged ONNX frontend correctness fix preserving BatchNormalization inference mode.
ONNX Runtime PR #29140: merged CUDA/FMHA kernel initialization fix for large-head variants.
PyTorch issue #188023: reported a negative-stride DLPack crash in torch.from_dlpack, triaged as a crash/error-checking/numpy/dlpack bug with a follow-up fix PR opened.

Operating Thesis

Treat every model claim as an artifact-backed systems claim: data version, command, hardware, metric, and failure boundary.
Build evaluation loops that survive refactors: golden sets, deterministic runners, CI checks, and report provenance.
Keep agent behavior inspectable: tool calls, retrieved sources, validators, retries, and escalation paths should be first-class data.
Optimize for reproducible learning velocity: small models, single-GPU runs, tight ablations, and clear error analysis before scale.

System Map

Surface	What I build	Representative repos
Post-training and verifier-guided inference	SFT/DPO-style pipelines, executable checks, reward-labeled traces, benchmark exports	L20-CodeForge, repro-llm-stack
Retrieval and ranking evaluation	Query planning, citation checks, recall/MRR regression tests, reranker snapshots	signal-rag, retrieval-eval, finmteb-zh-reranker-sota, coreb-retrieval-sota
Structured generation benchmarks	NL2SQL, ordering reliability, multi-path inference, cost and robustness reporting	nl2sql-benchmark, order-delta-bench
Agent trace infrastructure	Scientific workflows, semantic judges, deterministic validators, trace review	scitrace-rl, CodeGraph
Efficient model systems	Quantization experiments, serving benchmarks, GPU instrumentation, compiler/runtime fixes	Triton PR #10411, Triton PR #10413, TorchTitan PR #3456, ONNX Runtime PR #29140, TVM PR #19818, PyTorch issue #188023, l20-stack, llm-quant-bench, l20-edu-135m-pretrain

Selected Systems

Project	Signal	Evidence surface
L20-CodeForge	Single-L20 post-training and verifier-guided inference for executable code benchmarks	Reproduction scripts, artifact hashes, result boundaries
l20-stack	Single-L20 infra reference stack for kernels, dispatch policy, serving, and QLoRA smoke runs	Triton RMSNorm/RoPE+KV kernels, L20 benchmark reports, CUDA telemetry
nl2sql-benchmark	Text-to-SQL fine-tuning and multi-path inference with Qwen2.5-Coder-7B	Spider/BIRD-style evaluation, cost curves, export paths
finmteb-zh-reranker-sota	FinanceMTEB Chinese reranking snapshot with Qwen3-Reranker-8B	Public report, CI checks, leaderboard snapshot context
signal-rag	Retrieval workbench with query planning, citation checks, and extractive fallback	Recall evaluation, source-trust tiers, benchmark examples
scitrace-rl	Trace, validation, and reward infrastructure for scientific agents	Adversarial cases, semantic judge, deterministic validators
coreb-retrieval-sota	Reproducible CoREB retrieval benchmark snapshot	CI-backed artifacts, result provenance, and upstream submission issue
Upstream model-system PRs and reports	Merged fixes in Triton, PyTorch TorchTitan, Apache TVM, ONNX Runtime, plus a PyTorch crash report	Runtime/cache correctness, LoRA freezing behavior, benchmarking reliability, ONNX frontend behavior, CUDA/FMHA initialization, DLPack crash triage

Engineering Standard

I try to make serious repositories answer five questions quickly:

Question	Expected answer
What is the exact task?	Dataset, benchmark, workflow, or user problem is named up front.
How do I run it?	Setup and reproduction commands are visible from the README.
What should happen?	Expected outputs, metrics, report paths, or screenshots are documented.
What is proven?	Claims are tied to artifacts rather than vague demos.
Where does it fail?	Known limitations and next experiments are explicit.

Technical Vector

Core languages: Python, TypeScript, SQL, C++, CUDA, Swift, C#
Model systems: PyTorch, Transformers, LoRA/QLoRA, vLLM, lm-eval, Triton
LLM applications: RAG, retrieval evaluation, tool use, citation verification, structured generation
Infrastructure: FastAPI, SQLite, Docker, GitHub Actions, Make, CLI tooling, PostGIS, Redis, Kafka
Research direction: post-training, process supervision, agent evaluation, scientific reproducibility, AI4S infrastructure

Contact

Portfolio | GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly