MedGemma Impact Challenge 2026 — Main Track + Edge AI Prize
- Radu Alexa — Board-Certified Uro-Oncologist. Clinical lead, domain expert, sole reviewer of 1,242 AI outputs.
- Ayush Nangia — AI Researcher. Model pipeline, web application, infrastructure, data engineering.
- Aman Gokrani — AI Researcher, ex-Microsoft Health AI. Judge system, evaluation framework, statistical analysis.
This project evaluates whether MedGemma can serve as both a therapy recommendation engine and an automated quality evaluator for German renal cell carcinoma (RCC) cases — validated against expert physician judgment.
We deploy MedGemma 27B in two roles across 69 anonymized RCC cases from German tumor boards:
- Therapy Predictor — generates guideline-concordant therapy recommendations from structured clinical data
- Structured Medical Judge — evaluates predictions across semantic match, clinical appropriateness, and reasoning quality
| Metric | MedGemma 27B |
|---|---|
| Clinically Acceptable | 75.4% |
| Exact Therapy Match | 46.4% |
| Doctor Quality Score | 6.1/9 |
| Judge Cohen's Kappa (vs Doctor) | 0.675 |
| Judge F1 Score | 0.873 |
| Self-Judging Bias | p=0.017 (significant) |
- Best predictor among 6 models tested (MedGemma 27B, Gemma 3 27B, Gemma 3 4B, OLMo 32B Instruct, OLMo 32B Think, Meditron3 7B)
- Best judge with substantial agreement (kappa=0.675) against board-certified uro-oncologist
- Key discovery: statistically significant self-judging bias when MedGemma evaluates its own predictions
- Tier 1 — LLM Prediction: 6 models generate therapy recommendations from structured clinical JSON (414 total predictions)
- Tier 2 — AI Judge: GPT-5.2 and MedGemma 27B independently score each prediction (828 judge evaluations)
- Tier 3 — Expert Validation: Board-certified uro-oncologist reviews all model predictions AND all judge evaluations (1,242 reviews, 100% coverage)
- Live Demo: medical-review-site.vercel.app
- Evaluation Report:
findings/evaluation_report_2026-02-18.md
Med_LLM/
├── scripts/ # Python scripts
│ ├── modal_treatment_predict.py # Modal vLLM inference (main predictor)
│ ├── modal_judge.py # Modal vLLM judge evaluation
│ ├── evaluate_treatment_llm_judge.py # Statistical analysis
│ ├── generate_evaluation_report.py # DOCX report generation
│ ├── create_summary_report.py # Summary metrics report
│ └── ... # Classification, conversion, validation
├── converted_data/ # Converted clinical data (JSON, CSV)
├── results/ # Experiment results by provider
│ ├── modal_treatment/ # Modal vLLM treatment predictions
│ ├── modal_judge/ # Modal vLLM judge evaluations
│ └── openrouter/ # OpenRouter API results
├── findings/ # Reports and analysis
│ ├── evaluation_report_*.md # Comprehensive evaluation report
│ └── SUMMARY_REPORT.md # Model comparison summary
└── documentation/ # Project documentation
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt # or: pip install python-docx openpyxl pandas requests scikit-learn matplotlib seaborn python-pptxcp .env.example .env
# Edit .env and add your keys:
# OPENROUTER_API_KEY=sk-or-v1-... (for OpenRouter inference)
# HF_TOKEN=hf_... (for gated models on Modal)# Treatment predictions via Modal vLLM on H100 GPUs
modal run scripts/modal_treatment_predict.py
# Judge evaluations
modal run scripts/modal_judge.pyReproducibility is handled automatically inside the Modal scripts:
seed=42for deterministic samplingVLLM_BATCH_INVARIANT=1for batch-size-independent outputsCUBLAS_WORKSPACE_CONFIG=:4096:8for deterministic CUDA operations
# Generate statistical analysis and report
python scripts/evaluate_treatment_llm_judge.py
python scripts/create_summary_report.py- Inference: Modal vLLM on NVIDIA H100 GPUs
- Structured output: Pydantic schema enforcement via vLLM grammar-guided generation
- Reproducibility: Deterministic seed (42), fixed temperature (0.3 standard / 0.6 thinking models), vLLM batch invariance (
VLLM_BATCH_INVARIANT=1) — outputs are identical regardless of concurrent batch size, eliminating GPU kernel non-determinism - Web application: Next.js, Supabase PostgreSQL, deployed on Vercel
- Language: All prompts, cases, and evaluations in German medical terminology
This project contains anonymized clinical data. All patient data has been de-identified in accordance with German data protection regulations.