⚠️ Important Notice for Normalization Mode
To enable SNOMED-CT normalization mode (-Nflag), users must first register on the Spanish SNOMED-CT licensing platform:
https://snomed-ct.sanidad.gob.es/snomed-ct/solicitudLicencia.doAfter obtaining access, download the required datasets as specified at the beginning of the script
create_snomed_normalized_icd_dataset.py.
These include the Spanish and International SNOMED-CT description files and the ExtendedMapSnapshot. The filenames and exact paths are provided within the script.
A sophisticated tool for extracting and normalizing rheumatologic diagnoses from Spanish clinical evolution texts using local LLMs and SNOMED-CT mappings.
Medical Evolution Text Analyzer processes Spanish-language clinical evolution notes to extract principal rheumatologic diagnoses and corresponding ICD-10 codes. It leverages local language models via the Ollama API and integrates a SNOMED-CT-based normalization module to enhance diagnosis consistency.
- Summarizes long clinical notes to fit context windows
- Extracts principal rheumatologic diagnoses using strict prompt engineering
- Maps diagnoses to ICD-10 codes through direct or SNOMED-enhanced logic
- Applies fuzzy and keyword-based validation of results
- Evaluates multiple LLMs in parallel
- Provides performance metrics and result visualizations
- Command-line driven with multiple modes and settings
- Python 3.10+
- Ollama running locally
- SNOMED-CT and ICD datasets (included in the repository)
- Clone the repository:
git clone https://github.com/username/evolution-text-analysis.git
cd evolution-text-analysis- Install
uvpackage manager (recommended). More information at UV installing web:
pip install uv- Install dependencies using
uv:
uv sync- Ensure Ollama is running:
https://ollama.com/download
⚠️ No need to manually prepare SNOMED-CT files — a normalized version is bundled and ready to use.
uv run main.py [options]| Argument | Description |
|---|---|
-f, --filename |
File with evolution texts (.csv or .json) |
-m, --mode |
Test mode: 1 (all models), 2 (manual selection) |
-b, --batches |
Number of texts per processing batch |
-n, --num-texts |
Max number of texts to process |
-W, --context-window |
Max token context window size |
-t, --test |
Enable test mode for evaluation |
-i, --installed |
Only use installed models |
-v, --verbose |
Print detailed processing info |
-N, --normalize |
Use SNOMED-CT for ICD normalization |
# Analyze with default config (optimal model)
uv run main.py
# Evaluate all installed models
uv run main.py -tiv
# Evaluate one selected model with SNOMED normalization
uv run main.py -tN -m2
# Analyze 50 records in 4 parallel batches
uv run main.py -n50 -b4main.py: Entry point handling CLI and mode selectionanalyzer.py: Analysis engine combining LLM, summarizer, and parsertester.py: Model evaluation framework with metrics and reporting_custom_output_parser.py: ICD mapping and SNOMED normalization logic_validator.py: Rule-based and fuzzy diagnosis validationutils.py: Argument parsing, file I/O, summarization helperdata_models.py: Pydantic schemas for all structured objects
-
Startup
- Verify Ollama is running
- Load config and evolution text data
-
Text Preprocessing
- Optionally summarize long records to fit context window
-
Diagnosis Extraction
- LLM returns a single-line principal diagnosis
-
ICD Code Assignment
- Model or SNOMED-CT-driven mapping logic
-
Normalization & Validation (if enabled)
- Filter, expand, and fuzzy match diagnostics
-
Result Handling
- Save per-record results as JSON
- Compute accuracy and error metrics
- Generate optional charts (for test mode)
.
├── evolution_text_analyzer/
│ ├── __init__.py
│ ├── analyzer.py
│ ├── tester.py
│ ├── _custom_output_parser.py
│ ├── _validator.py
│ ├── utils.py
│ ├── data_models.py
├── main.py
├── config.json
├── create_snomed_normalized_icd_dataset.py
├── snomed_description_icd_normalized.csv
├── results/
├── testing_results/
└── pyproject.toml
CSV or JSON file with fields:
id: Unique record IDevolution_text: Raw clinical text (Spanish)principal_diagnostic: Ground-truth diagnosis (only for test mode)
{
"optimal_model": 0,
"models": ["model1", "model2", ...],
"prompts": {
"gen_summary_prompt": "...",
"gen_diagnostic_prompt": "...",
"gen_icd_code_prompt": "..."
}
}- Saved in
results/(normal mode) ortesting_results/(test mode) - Includes JSON with model outputs, performance metrics, and optional charts
For issues or contributions, please open a GitHub issue or PR.