Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions packages/markitdown-dicom/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
tests/test-dicom-files/
119 changes: 119 additions & 0 deletions packages/markitdown-dicom/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# MarkItDown DICOM Plugin (`markitdown-dicom`)

This is a plugin for [MarkItDown](https://github.com/microsoft/markitdown) that adds support for converting DICOM (`.dcm`) files into LLM-friendly Markdown metadata representations.

The plugin is designed to be highly memory-efficient (using deferred loading for pixel data) and token-efficient, ignoring raw pixel arrays while extracting clinically-relevant metadata.

## Features

- **Efficient Stream Peeking**: Fast detection of `.dcm` files by peeking at the `DICM` file preamble/magic bytes at offset 128.
- **Memory Safety**: Uses `pydicom` with deferred value loading (`defer_size="1 KB"`) to parse headers of large multi-frame DICOM files without loading gigabytes of pixel data.
- **PII-Aware by Default**: Automatically redacts Patient Name, Patient ID, and Patient Birth Date.
- **Formatted Metadata**: Standardizes dates to `YYYY-MM-DD` and times to `HH:MM:SS` for downstream RAG and vector database ingestion.
- **Custom Tag Support**: Automatically extracts additional standard metadata fields. Private/vendor tags can optionally be included and are filtered to avoid binary, sequence, and other high-volume data types.

## Installation

Install the plugin along with MarkItDown:

```bash
pip install markitdown-dicom
```

## Usage

### Command Line Interface

Use the `-p` (or `--use-plugins`) option to enable third-party plugins:

```bash
markitdown --use-plugins patient_scan.dcm -o patient_scan.md
```

### Python API

```python
from markitdown import MarkItDown

# Initialize MarkItDown with plugins enabled
md = MarkItDown(enable_plugins=True)

# Convert a DICOM file
result = md.convert("patient_scan.dcm")
print(result.text_content)
```

### Disabling PII Redaction

If you are working in a fully de-identified or secure clinical environment and want to retain Patient Name and Patient ID, you can disable redaction:

```python
from markitdown import MarkItDown

md = MarkItDown(enable_plugins=True, redact_pii=False)
result = md.convert("patient_scan.dcm")
```

## Example Output

```markdown
# DICOM File

## Patient Information

* **Patient Name**: [REDACTED]
* **Patient ID**: [REDACTED]
* **Patient Birth Date**: [REDACTED]
* **Patient Sex**: M
* **Patient Age**: 045Y

## Study Information

* **Study Instance UID**: 1.2.840.113619.2.134.1.20230612.98765432
* **Study ID**: STUDY-1
* **Study Date**: 2023-06-12
* **Study Time**: 11:44:27
* **Study Description**: Chest X-Ray
* **Accession Number**: ACC-98765

## Series Information

* **Series Instance UID**: 1.2.840.113619.2.134.2.20230612.98765432
* **Series Number**: 1
* **Series Description**: PA View
* **Series Date**: 2023-06-12
* **Series Time**: 11:45:00

## Acquisition Parameters

* **Modality**: DX
* **Protocol Name**: Chest PA
* **Exposure**: 2
* **Exposure Time**: 10
* **KVP**: 120
* **Acquisition Date**: 2023-06-12
* **Acquisition Time**: 11:45:00

## Equipment

* **Manufacturer**: GE Medical Systems
* **Manufacturer Model Name**: Discovery
* **Device Serial Number**: SN-12345
* **Software Versions**: v1.2.3

## Image Properties

* **Rows**: 2048
* **Columns**: 1500
* **Samples Per Pixel**: 1
* **Bits Allocated**: 16
* **Bits Stored**: 12
* **High Bit**: 11
* **Pixel Representation**: 0
* **Photometric Interpretation**: MONOCHROME2
* **Frame Count**: 1
* **Instance Number**: 42
* **SOP Class UID**: 1.2.840.10008.5.1.4.1.1.2
* **SOP Instance UID**: 1.2.840.113619.2.134.2.20230612.98765432.1
* **Pixel Data Present**: Yes
```
68 changes: 68 additions & 0 deletions packages/markitdown-dicom/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "markitdown-dicom"
dynamic = ["version"]
description = 'DICOM converter plugin for MarkItDown - Extracts metadata from .dcm files'
readme = "README.md"
requires-python = ">=3.10"
license = "MIT"
keywords = ["markitdown", "dicom", "metadata", "pydicom"]
authors = [
{ name = "Aryan Kaushik", email = "aryankaushik251@gmail.com" },
]
classifiers = [
"Development Status :: 4 - Beta",
"Programming Language :: Python",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Programming Language :: Python :: Implementation :: CPython",
]
dependencies = [
"markitdown>=0.1.0a1",
"pydicom>=2.4.0",
]

[project.urls]
Documentation = "https://github.com/microsoft/markitdown#readme"
Issues = "https://github.com/microsoft/markitdown/issues"
Source = "https://github.com/microsoft/markitdown"

[tool.hatch.version]
path = "src/markitdown_dicom/__about__.py"

[project.entry-points."markitdown.plugin"]
dicom = "markitdown_dicom"

[tool.hatch.envs.types]
extra-dependencies = [
"mypy>=1.0.0",
]
[tool.hatch.envs.types.scripts]
check = "mypy --install-types --non-interactive {args:src/markitdown_dicom tests}"

[tool.coverage.run]
source_pkgs = ["markitdown_dicom", "tests"]
branch = true
parallel = true
omit = [
"src/markitdown_dicom/__about__.py",
]

[tool.coverage.paths]
markitdown-dicom = ["src/markitdown_dicom", "*/markitdown-dicom/src/markitdown_dicom"]
tests = ["tests", "*/markitdown-dicom/tests"]

[tool.coverage.report]
exclude_lines = [
"no cov",
"if __name__ == .__main__.:",
"if TYPE_CHECKING:",
]

[tool.hatch.build.targets.sdist]
only-include = ["src/markitdown_dicom"]
4 changes: 4 additions & 0 deletions packages/markitdown-dicom/src/markitdown_dicom/__about__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# SPDX-FileCopyrightText: 2026-present Aryan Kaushik <aryankaushik251@gmail.com>
#
# SPDX-License-Identifier: MIT
__version__ = "0.1.0a1"
14 changes: 14 additions & 0 deletions packages/markitdown-dicom/src/markitdown_dicom/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# SPDX-FileCopyrightText: 2026-present Aryan Kaushik <aryankaushik251@gmail.com>
#
# SPDX-License-Identifier: MIT

from ._plugin import __plugin_interface_version__, register_converters
from ._dicom_converter import DicomConverter
from .__about__ import __version__

__all__ = [
"__version__",
"__plugin_interface_version__",
"register_converters",
"DicomConverter",
]
Loading