This is the DICOM validation tool for the Laboratory Catalog and Archive Service (LabCAS). It ensures that DICOM files:
- Contain little-to-no PHI/PII β Scans both DICOM headers and pixel data for protected health information (PHI) and personally identifiable information (PII)
- Adhere to EDRN requirements β Validates DICOM tags against the EDRN core validation spreadsheet
This tool was originally developed in response to EDRN/EDRN-metadata#160.
This program has features described in the following subsections.
- Header-based detection: Scans DICOM metadata tags for identifiers including:
- Patient names, birth dates, addresses
- Physician and operator names
- Email addresses, phone numbers, SSNs
- Medical record numbers (MRNs)
- Pixel-based detection: Uses OCR (via Tesseract) to detect text embedded in DICOM images
- Multiple recognizers: Choose between different PHI/PII detection algorithms:
simple-scoring(default): Pattern-based detection with configurable scoringaccepting: Accepts all files (usually used for testing only)rejecting: Rejects all files (used for testing only)
Validates over 40 DICOM tags against EDRN requirements including:
- Study/Series/Image Identification: UIDs, instance numbers, SOP class
- Acquisition Modality and Equipment: Modality codes, manufacturer info, device details
- Temporal Data: Dates and times in proper format
- Image Data: Dimensions, pixel data, display parameters
- MR-specific: Spacing between slices validation
Generates CSV reports organized by:
- Site ID
- Event ID
- File name
- Finding type and severity score
Details on installing this software follows in this section.
Requires Python 3.12 or higher and Tesseract OCR for pixel-based PHI/PII detection.
Tesseract provides optical character recgonition features for this program and must be installed separately.
macOS:
brew install tesseractLinux (Ubuntu/Debian):
sudo apt-get install tesseract-ocrWindows: Download from https://github.com/UB-Mannheim/tesseract/wiki
It's best to set up a Python virtual environment and use pip to install it into that environment:
pip install jpl.labcas.validation
Or install from source:
git clone https://github.com/EDRN/jpl.labcas.validation.git
cd jpl.labcas.validation
pip install --editable .The following describes how to use this program.
DICOM files should be arranged in a way that mirrors the expectations of LabCAS, which arranges files into folders in a specific hierarchy, described below:
### π» Basic Usage
The easiest way to run this is:
validate-dicom-files <directory>/.../<collection-folder>
the `<directory>` should eventually contain the following directory hierarchy:
<directory>
β¦ (sub-directories)
collection-folder (such as Prostate_MRI)
event-ID-folder (such as 1234567)
β¦ (sub-folders)
DICOM file 1
DICOM file 2
β¦
### β‘ Command-Line Options
Use `--help` to get more details, but summarizing:
- `-s, --score <value>`: Maximum PHI/PII score threshold (0.0-1.0, default: 0.8)
- `-c, --concurrency <num>`: Number of concurrent processes (default: CPU count)
- `-r, --recognizer <name>`: PHI/PII recognizer to use:
- `simple-scoring` (default): Pattern-based detection
- `accepting`: Accept all files
- `rejecting`: Reject all files
- `-o, --output <dir>`: Output directory for CSV reports (default: current directory)
- `--log-file <file>`: Write detailed logs to a file while keeping the progress bar readable
- `-d, --debug`: Debug logging
- `-q, --quiet`: Quiet logging
Validation shows a progress bar on stderr. Without `--log-file`, normal log messages are routed through tqdm so they do not garble the progress display. With `--log-file`, detailed logs go to the file and only errors and critical messages are also shown on stderr.
### π Examples
Validate a directory with default settings:
validate-dicom-files /path/to/dicom/files
Use a different PHI/PII threshold (lower = less strict):
validate-dicom-files --score 0.5 /path/to/dicom/files
Generate a custom report filename:
validate-dicom-files --output validation_results.md /path/to/dicom/files
Use a specific number of workers:
validate-dicom-files --concurrency 4 /path/to/dicom/files
In general, use a `--concurrency` equal to at least the number of CPU cores available. Some recommend using twice that number.
## π Understanding the Report
The tool generates a Markdown report with findings organized hierarchically:
1. **By Site ID**: Grouped by blinded site identifier
2. **By Event ID**: Grouped by 7-digit event ID
3. **By File**: Individual DICOM files within each event
4. **By Finding**: Each finding includes:
- **Score**: Severity from 0.0 (low) to 1.0 (high)
- **Kind**: Type of finding:
- π Header: PHI/PII found in DICOM metadata
- πΌοΈ Pixels: PHI/PII found in image data via OCR
- β οΈ Validation: Tag compliance issue
- β Error: File reading or processing error
- **Details**: Specific information about the finding
Only findings with scores above the threshold are included in the report.
## ποΈ Architecture
The validation framework is modular and extensible:
- **PHI/PII Recognizers**: Plug-in system for different detection algorithms
- **Validators**: Individual validators for each DICOM tag requirement
- **Findings**: Structured representation of all issues discovered
## π§ͺ Development Status
Development Status: Pre-Alpha
CT requirements may be added in the future, pending completion of the [spreadsheet's CT tab](https://docs.google.com/spreadsheets/d/1oQB0EoeajxFagSrIzF_8hOIc6hbC9MiMvhbYLfr6vPQ/edit?pli=1&gid=1779958583#gid=1779958583).
## π License
Apache 2.0 - See LICENSE.md for details
## π€ Contributing
Issues and pull requests welcome on GitHub: https://github.com/EDRN/jpl.labcas.validation/issues. See also the EDRN [Code of Conduct](https://github.com/EDRN/.github/blob/main/CODE_OF_CONDUCT.md) and [Contributors' Guide](https://github.com/EDRN/.github/blob/main/CONTRIBUTING.md).
## π€ Authors
- Sean Kelly `@nutjob4life`
## Β©οΈ Copyright
Copyright Β© 2025 California Institute of Technology. U.S. Government sponsorship acknowledged.