Medical Evolution Text Analyzer

⚠️ Important Notice for Normalization Mode
To enable SNOMED-CT normalization mode (-N flag), users must first register on the Spanish SNOMED-CT licensing platform:
https://snomed-ct.sanidad.gob.es/snomed-ct/solicitudLicencia.do

After obtaining access, download the required datasets as specified at the beginning of the script create_snomed_normalized_icd_dataset.py.
These include the Spanish and International SNOMED-CT description files and the ExtendedMapSnapshot. The filenames and exact paths are provided within the script.

A sophisticated tool for extracting and normalizing rheumatologic diagnoses from Spanish clinical evolution texts using local LLMs and SNOMED-CT mappings.

Overview

Medical Evolution Text Analyzer processes Spanish-language clinical evolution notes to extract principal rheumatologic diagnoses and corresponding ICD-10 codes. It leverages local language models via the Ollama API and integrates a SNOMED-CT-based normalization module to enhance diagnosis consistency.

Features

Summarizes long clinical notes to fit context windows
Extracts principal rheumatologic diagnoses using strict prompt engineering
Maps diagnoses to ICD-10 codes through direct or SNOMED-enhanced logic
Applies fuzzy and keyword-based validation of results
Evaluates multiple LLMs in parallel
Provides performance metrics and result visualizations
Command-line driven with multiple modes and settings

Requirements

Python 3.10+
Ollama running locally
SNOMED-CT and ICD datasets (included in the repository)

Installation

Clone the repository:

git clone https://github.com/username/evolution-text-analysis.git
cd evolution-text-analysis

Install uv package manager (recommended). More information at UV installing web:

pip install uv

Install dependencies using uv:

uv sync

Ensure Ollama is running:

https://ollama.com/download

⚠️ No need to manually prepare SNOMED-CT files — a normalized version is bundled and ready to use.

Usage

uv run main.py [options]

Command Line Arguments

Argument	Description
`-f`, `--filename`	File with evolution texts (`.csv` or `.json`)
`-m`, `--mode`	Test mode: `1` (all models), `2` (manual selection)
`-b`, `--batches`	Number of texts per processing batch
`-n`, `--num-texts`	Max number of texts to process
`-W`, `--context-window`	Max token context window size
`-t`, `--test`	Enable test mode for evaluation
`-i`, `--installed`	Only use installed models
`-v`, `--verbose`	Print detailed processing info
`-N`, `--normalize`	Use SNOMED-CT for ICD normalization

Execution Examples

# Analyze with default config (optimal model)
uv run main.py

# Evaluate all installed models
uv run main.py -tiv

# Evaluate one selected model with SNOMED normalization
uv run main.py -tN -m2

# Analyze 50 records in 4 parallel batches
uv run main.py -n50 -b4

Architecture

main.py: Entry point handling CLI and mode selection
analyzer.py: Analysis engine combining LLM, summarizer, and parser
tester.py: Model evaluation framework with metrics and reporting
_custom_output_parser.py: ICD mapping and SNOMED normalization logic
_validator.py: Rule-based and fuzzy diagnosis validation
utils.py: Argument parsing, file I/O, summarization helper
data_models.py: Pydantic schemas for all structured objects

Processing Flow

Startup
- Verify Ollama is running
- Load config and evolution text data
Text Preprocessing
- Optionally summarize long records to fit context window
Diagnosis Extraction
- LLM returns a single-line principal diagnosis
ICD Code Assignment
- Model or SNOMED-CT-driven mapping logic
Normalization & Validation (if enabled)
- Filter, expand, and fuzzy match diagnostics
Result Handling
- Save per-record results as JSON
- Compute accuracy and error metrics
- Generate optional charts (for test mode)

Directory Structure

.
├── evolution_text_analyzer/
│   ├── __init__.py
│   ├── analyzer.py
│   ├── tester.py
│   ├── _custom_output_parser.py
│   ├── _validator.py
│   ├── utils.py
│   ├── data_models.py
├── main.py
├── config.json
├── create_snomed_normalized_icd_dataset.py
├── snomed_description_icd_normalized.csv
├── results/
├── testing_results/
└── pyproject.toml

Data Format

Input

CSV or JSON file with fields:

id: Unique record ID
evolution_text: Raw clinical text (Spanish)
principal_diagnostic: Ground-truth diagnosis (only for test mode)

Config (JSON)

{
  "optimal_model": 0,
  "models": ["model1", "model2", ...],
  "prompts": {
    "gen_summary_prompt": "...",
    "gen_diagnostic_prompt": "...",
    "gen_icd_code_prompt": "..."
  }
}

Results

Saved in results/ (normal mode) or testing_results/ (test mode)
Includes JSON with model outputs, performance metrics, and optional charts

For issues or contributions, please open a GitHub issue or PR.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
evolution_text_analyzer		evolution_text_analyzer
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
config.json		config.json
create_snomed_normalized_icd_dataset.py		create_snomed_normalized_icd_dataset.py
evolution_texts.csv		evolution_texts.csv
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Evolution Text Analyzer

Table of Contents

Overview

Features

Requirements

Installation

Usage

Command Line Arguments

Execution Examples

Architecture

Processing Flow

Directory Structure

Data Format

Input

Config (JSON)

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Medical Evolution Text Analyzer

Table of Contents

Overview

Features

Requirements

Installation

Usage

Command Line Arguments

Execution Examples

Architecture

Processing Flow

Directory Structure

Data Format

Input

Config (JSON)

Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages