RetinalGPT is a retinal multimodal assistant built on large vision-language models.
This repository contains the data construction pipeline used to build retinal instruction-following conversations for the paper:
- RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models
- Hugging Face Model
The main workflow in this repo is:
- Build dataset-specific retinal descriptions through
Descclasses. - Construct two types of data:
instructionalignment
- Run the pipeline in one of two modes:
directgenerationbatchrequest packaging / unpacking
- Convert generated outputs into instruction-tuning JSONL / JSON files.
This repo is not the full end-to-end training codebase for the entire project. It focuses on the retinal data processing and conversation generation pipeline.
The environment follows the LLaVA base setup used for legacy v0 workflows in our project.
In practice, we use the standard LLaVA-style base environment and then install the extra packages needed by this repository:
conda create -n retinalgpt python=3.10 -y
conda activate retinalgpt
pip install --upgrade pip
pip install -r requirements.txtIf you already have a working LLaVA / llava-v0 style environment, you can usually reuse it directly. For more details on the upstream base setup, please refer to the official LLaVA repository.
RetinalGPT/
├── Instruction/
│ ├── Desc/ # Dataset-specific description builders
│ ├── configs/ # Config-driven dataset jobs
│ ├── experiments/ # Optional script-style experiment entrypoints
│ ├── sample/ # Minimal bring-your-own-data example
│ ├── tools/ # Bounding box and postprocess helpers
│ ├── pipeline_runner.py # Config-driven instruction/alignment runner
│ ├── batch_runner.py # Config-driven batch runner
│ ├── pipeline_prompts.py # Centralized instruction/alignment prompts
│ ├── batch_prompts.py # Centralized batch prompts
│ ├── instruction_gen_async.py # API-based conversation generation
│ ├── convert2json.py # Output parsing / JSON conversion
│ ├── utils.py # Shared helper functions
│ └── ...
├── figures/ # Paper assets and reference figures
├── requirements.txt
└── README.md
Each dataset is wrapped by a description class in Instruction/Desc. These classes map raw metadata into a unified text description that can be consumed by a large multimodal model.
Typical inputs include:
- image quality predictions
- fractal / vascular quantitative features
- disease labels
- lesion masks or bounding boxes
- dataset-specific metadata
The generated description is then appended with task-specific prompt instructions and sent to the API to produce a retinal conversation sample.
Instruction/Desc contains dataset-specific classes such as:
APTOSDescEyeQDescIDRIDDescMICCAIDescMessidorDescODIRDDescRFMiDDescUKDesc
All of them follow the same design goal: turn heterogeneous dataset annotations into a reusable natural-language description.
The project maintains two data tracks:
instruction: multi-turn retinal conversationsalignment: compact alignment-style supervision, usually one-turn
The project maintains two execution modes:
direct: call the API directly and write conversation outputsbatch: package local requests first, send them to the API server, then unpack returned outputs
Most users only need pipeline_runner.py, batch_runner.py, and Instruction/sample/. Instruction/experiments/ keeps the older script-style entrypoints in one place.
The main generation logic lives in:
Instruction/instruction_gen_async.py
This module supports:
- async API calls
- text-only generation
- image-conditioned generation
- compatibility with older script-style calls already present in this repo
For instruction / alignment construction, the main entrypoint is:
Instruction/pipeline_runner.py
For local batch request packaging and unpacking, the main entrypoint is:
Instruction/batch_runner.py
Both are config-driven and use dataset jobs defined in Instruction/configs/.
For the simplest single-image run:
python3 run_retinalGPT_simple.py \
--model-name ASU-GSL/RetinalGPT \
--image-file /path/to/retinal_image.png \
--question "Please describe this retinal image in detail."After downloading the RetinalGPT weights, you can run inference directly with:
python3 run_retinalGPT.py \
--model-name ASU-GSL/RetinalGPT \
--image-folder /path/to/images \
--question-file examples/inference/questions.json \
--answers-file /path/to/predictions.jsonlYou can also run batch inference with a JSON or JSONL question file:
python3 run_retinalGPT.py \
--model-name ASU-GSL/RetinalGPT \
--image-folder /path/to/images \
--question-file /path/to/questions.jsonl \
--answers-file /path/to/predictions.jsonlSupported batch input fields are:
idimageorimagesquestionquestionsmessages
For messages, the script automatically extracts user or human turns as questions.
A minimal example question file is provided at examples/inference/questions.json.
cd Instruction
python3 pipeline_runner.py UK_instruction_directcd Instruction
python3 batch_runner.py APTOScd Instruction
python3 sample/generate_instruction_conversations.py \
--metadata-csv sample/metadata_template.csv \
--image-dir /path/to/your/images \
--output-jsonl sample/generated_instruction_conversations.jsonlFor the minimal custom-data walkthrough, see Instruction/sample/README.md.
cd Instruction
python3 experiments/instruction/ins_UK.py
python3 experiments/batch/batch_file_APTOS.pyThe pipeline writes conversation samples into JSONL files with fields such as:
idimageconversations
These outputs can then be merged, cleaned, aligned, or converted into nested JSON using the helper scripts already included in Instruction/.
The intended engineering flow is now:
- Build hidden metadata with
Desc/* - Choose a dataset job from
Instruction/configs/ - Run either:
pipeline_runner.pyforinstruction/alignmentbatch_runner.pyfor batch request workflows
- Use
convert2json.py,utils.py, andInstruction/tools/for packing, unpacking, conversion, and utility workflows - Use
Instruction/sample/as the minimal bring-your-own-data example - Use
Instruction/experiments/only if you want the old script-style entrypoints
- This repository is intended for research and data construction.
- It is centered on retinal conversation generation and instruction data preparation.
Instruction/sampleis the recommended starting point for adapting the pipeline to a new dataset.Instruction/experimentskeeps the dataset-specific experiment scripts out of the main pipeline path.- Parts of the repository structure and code organization were optimized with OpenAI Codex under the authors' supervision.
If you find this project useful, please cite:
@article{zhu2025retinalgpt,
title={Retinalgpt: A retinal clinical preference conversational assistant powered by large vision-language models},
author={Zhu, Wenhui and Li, Xin and Chen, Xiwen and Qiu, Peijie and Vasa, Vamsi Krishna and Dong, Xuanzhao and Chen, Yanxi and Lepore, Natasha and Dumitrascu, Oana and Su, Yi and others},
journal={arXiv preprint arXiv:2503.03987},
year={2025}
}We thank the LLaVA and LLaVA-Med projects. Our training and evaluation code is built on top of their open-source vision-language modeling framework.

