Python reference encoder for Coleman Dimensional Encoding (CDE) of healthcare records. Compact and lossy by design; HIPAA compliance is a goal, not something this repo proves or certifies (see Claim status).
Maps an EHR record to a fixed 20-byte (154-bit) vector over six dimensions. The 48-bit temporal dimension is encrypted with AES-GCM using an HKDF-derived monthly key; the other dimensions are quantized or hashed (see Encoding & privacy). It is a lossy, compact reference encoder for population-health-style analytics — not a certified de-identification or HIPAA-compliance product.
This is a single-file Python reference encoder for Coleman Dimensional Encoding (CDE) framed as a minimally sufficient statistic (MSS) for healthcare data. It takes an EHR record dictionary and emits a fixed-size byte vector intended for cohort-level analytics, where individual-record fidelity is intentionally traded away for compactness.
The implementation is cde_encoder/ (an encoder and a vector dataclass). There is no
multi-cloud deployment, no access-control engine, no audit subsystem, and no formal
proof in this repository — only the encoder, its vector format, and tests.
What is true here (labels in the MSS spirit):
- Guarantee (measured/structural): every record encodes to exactly 20 bytes / 154
bits regardless of input size — verified by the 73 passing tests and by
CDEVector.to_bytes. - Guarantee (measured): ~11.1x smaller than the synthetic record's JSON (222 → 20 bytes); see PERFORMANCE.md.
- Assumption (not proven in-repo): that the encoding constitutes a minimally sufficient statistic for population-health analytics, and that it provides "information-theoretic de-identification." There is no proof, re-identification-risk measurement, or formal analysis in this repository. Treat these as design intent, not established results.
The encoder produces six dimensions (cde_encoder/encoder.py), packed by CDEVector:
| Dimension | Bits | What it stores | Privacy / fidelity |
|---|---|---|---|
| temporal | 48 | service date quantized to a month offset, then encrypted | AES-GCM + HKDF monthly key when cryptography is installed; otherwise a SHA-256 digest with no cryptographic security (see Crypto dependency) |
| demographic | 16 | age in 5-year bands + optional sex (1 bit) / race (3 bits) | lossy: exact age not retained |
| clinical | 32 | diagnosis-set fingerprint | lossy, non-invertible: 32-bit truncated SHA-256 of the sorted ICD codes; collisions possible |
| geographic | 24 | ZIP-prefix (12 bits) + region id (4 bits) | lossy: street/house-level precision removed |
| treatment | 32 | procedure + medication fingerprint | lossy, non-invertible: 32-bit truncated SHA-256 of sorted CPT/RxNorm codes; collisions possible |
| status | 2 | binary outcome + low bit of severity | lossy: 4 possible states |
- Lossiness (important): the clinical and treatment dimensions are 32-bit truncated SHA-256 fingerprints of the code sets. They are deliberately non-invertible and can collide; you cannot recover the original ICD/CPT/RxNorm codes from the vector. The demographic, geographic, and status dimensions are also quantized. Consequently the encoder does not preserve all sufficient statistics — for example, exact diagnoses, exact age, and exact location are not recoverable. Claims of "preserving all sufficient statistics" are inaccurate for this implementation and have been removed.
- Encryption scope: only the 48-bit temporal field is encrypted (AES-GCM with an HKDF-derived monthly key). The whole vector is not encrypted at rest by this library.
encode_temporal uses AES-GCM + HKDF only if the cryptography package is
importable. If it is not, the encoder silently falls back to a SHA-256 digest of
the month offset and master key, which the source explicitly labels as providing
NO cryptographic security (cde_encoder/encoder.py, the else branch of
encode_temporal). The fallback is for development/testing only. Install the pinned
dependency (requirements.txt) for any non-toy use, and verify
cde_encoder.encoder.CRYPTOGRAPHY_AVAILABLE is True at startup.
This README labels claims as Definition, Guarantee (measured or derivable), Assumption (design intent, not proven here), or Unknown (not measured), per the MSS honesty convention. Where this repo previously stated an Assumption as a Guarantee (e.g. "information-theoretic de-identification," "preserves all sufficient statistics," "HIPAA compliance through mathematical guarantees"), it has been relabeled or removed.
git clone https://github.com/chronomancy-io/chronohipaa.git
cd chronohipaa
pip install -r requirements-dev.txt
pip install -e .Note:
pre-commitis listed inrequirements-dev.txt, but this repo currently has no.pre-commit-config.yaml, so there are no hooks to install. Linting is run viamake lint(ruff + mypy) and in CI.
from datetime import datetime
from cde_encoder import CDEEncoder
encoder = CDEEncoder(master_key=b"\x00" * 32) # 32-byte key required
vector = encoder.encode({
"service_date": datetime(2025, 1, 15),
"age": 47,
"sex": "F",
"race": "white",
"zip_code": "02115",
"diagnoses": ["E11.9", "I10"],
"procedures": ["99213"],
"medications": ["1191"],
"outcome": "stable",
"severity": 1,
})
assert len(vector) == 20 # always 20 bytesSee ARCHITECTURE.md for the actual code structure.
- ARCHITECTURE.md - System design and components
- PERFORMANCE.md - Benchmarks and characteristics
See CONTRIBUTING.md for development standards.
Apache-2.0 © 2026 Jacob Coleman — See LICENSE for details.