| title | Medical Triage OpenEnv |
|---|---|
| emoji | 🏥 |
| colorFrom | blue |
| colorTo | purple |
| sdk | docker |
| pinned | false |
An OpenEnv-compatible reinforcement learning environment for medical triage classification — where an LLM agent must assess patient severity under uncertainty, request clinical information strategically, and make time-critical decisions.
Built for the Meta × OpenEnv × Hugging Face SST Hackathon.
Medical triage is one of the most consequential sequential decision-making tasks in the real world. Every minute of delay in identifying an emergency costs lives. This environment models that pressure:
- The agent receives incomplete patient information and must decide: act now or gather more data?
- Wrong severity classification is penalized — calling a stroke "mild" matters
- Efficiency is rewarded — correct diagnosis in fewer steps scores higher
- The environment is designed to be genuinely useful for training and evaluating LLM-based clinical agents
Client → FastAPI Server → MedicalTriageEnv → LLM Agent → Action → Reward
/reset→ returns initial patient observation: symptoms, age, pain level- Agent reasons about severity — may request hidden clinical info if ambiguous
/step→ evaluates the action: reveals more info OR scores the severity prediction- Reward returned based on correctness + diagnostic efficiency
The agent combines LLM reasoning with a deterministic safety layer — neither alone is sufficient.
Structured prompt feeds symptoms, pain score, age, and any revealed clinical findings into Qwen3-30B-A3B at temperature=0. The model produces step-by-step medical reasoning before committing to a severity prediction.
Critical patterns are caught deterministically, regardless of LLM output:
| Pattern | Override | Clinical Rationale |
|---|---|---|
| Pediatric + breathing symptoms | → Emergency | Pediatric airways decompensate fast |
| Age 70+ + confusion/fatigue | → Emergency | Sepsis presentation in elderly |
| Fruity breath | → Emergency | Diabetic ketoacidosis (DKA) |
| Slurred speech / facial droop | → Emergency | Stroke — time-to-treatment critical |
| Rebound tenderness (hidden) | → Emergency | Peritonitis / surgical abdomen |
| ST elevation (hidden) | → Emergency | STEMI — minutes matter |
| Ambiguous abdominal pain, no hidden info | → request_more_info |
Rule out ectopic, appendicitis |
This two-stage design reflects real clinical practice: experienced clinicians use heuristics for high-acuity patterns and deliberate reasoning for ambiguous presentations.
A key design decision: the agent is not forced to commit on incomplete information.
When clinical presentation is ambiguous, the agent can issue request_more_info — triggering the environment to reveal hidden clinical findings (lab values, vitals, secondary symptoms). This models the real triage workflow where a nurse asks follow-up questions before routing a patient.
This makes the environment genuinely multi-step and forces the agent to balance information gain against time cost.
Graders are difficulty-tiered (Easy / Medium / Hard) and score across four dimensions:
| Component | What it measures |
|---|---|
CORRECT_SEVERITY |
Did the agent classify the right tier? |
CORRECT_SERVICE |
Appropriate care pathway (emergency vs urgent vs GP)? |
CORRECT_PATTERN |
Did the agent recognize the clinical pattern? |
TIME_BONUS_FAST |
Bonus for correct diagnosis in a single step |
Score ranges — all strictly in (0, 1):
| Outcome | Easy | Medium | Hard |
|---|---|---|---|
| Correct, 1 step | ~0.90 | ~0.85 | ~0.77 |
| Correct, 2+ steps | ~0.85 | ~0.80 | ~0.72 |
| Off by one tier | 0.30 | 0.25 | 0.20 |
| Wrong / null | 0.05 | 0.05 | 0.05 |
Scores never reach 0 or 1 — boundary values are explicitly excluded to reflect real-world uncertainty in clinical judgment.
15 tasks across three difficulty tiers, each requiring progressively more nuanced reasoning:
Easy — Clear symptom-severity mapping. Single-step diagnosis expected. Classic presentations (ankle sprain → Mild, chest pain + ST elevation → Emergency).
Medium — Ambiguous initial presentation. Hidden info reveal likely needed. Age and comorbidity context matters.
Hard — Atypical presentations, misleading initial symptoms, high penalty for missed emergencies. Requires correct identification of rare but critical patterns (DKA, sepsis in elderly, ectopic pregnancy).
| Endpoint | Method | Description |
|---|---|---|
/reset |
POST | Initialize episode, returns patient observation |
/step |
POST | Submit assign_severity or request_more_info action |
/close |
POST | End episode, finalize scoring |
- Fully Dockerized — runs anywhere
- Deployed on Hugging Face Spaces (Docker SDK)
- FastAPI server on port
7860 - Passes all OpenEnv Phase 1 + Phase 2 validation checks
git clone https://github.com/nish-debug15/openenv-agent.git
cd openenv-agent
pip install -r requirements.txt
python -m server.appModel: Qwen/Qwen3-30B-A3B (via Hugging Face router)
Temperature: 0 (deterministic, reproducible evaluation)
Max steps: 10 per episode
Fallback: Rule-based override catches all critical safety patterns
- OpenEnv Phase 1 — structural + spec compliance
- Docker build + inference execution
- Output parsing (
[START]/[STEP]/[END]format) - Task validation — 15/15 tasks scored
- LLM criteria check
- Phase 2 — Submission Validated
openenv-agent/
├── server/
│ ├── app.py # FastAPI server
│ ├── medical_triage_env_environment.py # Core RL environment
│ ├── graders.py # EasyGrader, MediumGrader, HardGrader
│ └── models.py # Typed observation/action/reward models
├── Dockerfile
├── inference.py # Agent logic + output formatting
├── tasks.py # 15 triage scenarios
├── graders.py # Internal grade() function
├── openenv.yaml # OpenEnv spec config
├── requirements.txt
└── README.md
- Nishit Patel (Lead)
- Pranav Adhikari
- Rahul Kiran
MIT License