Skip to content

chronomancy-io/chronohipaa

Repository files navigation

ChronoHIPAA

Python reference encoder for Coleman Dimensional Encoding (CDE) of healthcare records. Compact and lossy by design; HIPAA compliance is a goal, not something this repo proves or certifies (see Claim status).

standard-readme compliant License WASP v1.0.0 CDE v1.0.0 MSS v1.0.0

Maps an EHR record to a fixed 20-byte (154-bit) vector over six dimensions. The 48-bit temporal dimension is encrypted with AES-GCM using an HKDF-derived monthly key; the other dimensions are quantized or hashed (see Encoding & privacy). It is a lossy, compact reference encoder for population-health-style analytics — not a certified de-identification or HIPAA-compliance product.

Background

This is a single-file Python reference encoder for Coleman Dimensional Encoding (CDE) framed as a minimally sufficient statistic (MSS) for healthcare data. It takes an EHR record dictionary and emits a fixed-size byte vector intended for cohort-level analytics, where individual-record fidelity is intentionally traded away for compactness.

The implementation is cde_encoder/ (an encoder and a vector dataclass). There is no multi-cloud deployment, no access-control engine, no audit subsystem, and no formal proof in this repository — only the encoder, its vector format, and tests.

What is true here (labels in the MSS spirit):

  • Guarantee (measured/structural): every record encodes to exactly 20 bytes / 154 bits regardless of input size — verified by the 73 passing tests and by CDEVector.to_bytes.
  • Guarantee (measured): ~11.1x smaller than the synthetic record's JSON (222 → 20 bytes); see PERFORMANCE.md.
  • Assumption (not proven in-repo): that the encoding constitutes a minimally sufficient statistic for population-health analytics, and that it provides "information-theoretic de-identification." There is no proof, re-identification-risk measurement, or formal analysis in this repository. Treat these as design intent, not established results.

Dimensions

The encoder produces six dimensions (cde_encoder/encoder.py), packed by CDEVector:

Dimension Bits What it stores Privacy / fidelity
temporal 48 service date quantized to a month offset, then encrypted AES-GCM + HKDF monthly key when cryptography is installed; otherwise a SHA-256 digest with no cryptographic security (see Crypto dependency)
demographic 16 age in 5-year bands + optional sex (1 bit) / race (3 bits) lossy: exact age not retained
clinical 32 diagnosis-set fingerprint lossy, non-invertible: 32-bit truncated SHA-256 of the sorted ICD codes; collisions possible
geographic 24 ZIP-prefix (12 bits) + region id (4 bits) lossy: street/house-level precision removed
treatment 32 procedure + medication fingerprint lossy, non-invertible: 32-bit truncated SHA-256 of sorted CPT/RxNorm codes; collisions possible
status 2 binary outcome + low bit of severity lossy: 4 possible states

Encoding & privacy

  • Lossiness (important): the clinical and treatment dimensions are 32-bit truncated SHA-256 fingerprints of the code sets. They are deliberately non-invertible and can collide; you cannot recover the original ICD/CPT/RxNorm codes from the vector. The demographic, geographic, and status dimensions are also quantized. Consequently the encoder does not preserve all sufficient statistics — for example, exact diagnoses, exact age, and exact location are not recoverable. Claims of "preserving all sufficient statistics" are inaccurate for this implementation and have been removed.
  • Encryption scope: only the 48-bit temporal field is encrypted (AES-GCM with an HKDF-derived monthly key). The whole vector is not encrypted at rest by this library.

Crypto dependency

encode_temporal uses AES-GCM + HKDF only if the cryptography package is importable. If it is not, the encoder silently falls back to a SHA-256 digest of the month offset and master key, which the source explicitly labels as providing NO cryptographic security (cde_encoder/encoder.py, the else branch of encode_temporal). The fallback is for development/testing only. Install the pinned dependency (requirements.txt) for any non-toy use, and verify cde_encoder.encoder.CRYPTOGRAPHY_AVAILABLE is True at startup.

Claim status

This README labels claims as Definition, Guarantee (measured or derivable), Assumption (design intent, not proven here), or Unknown (not measured), per the MSS honesty convention. Where this repo previously stated an Assumption as a Guarantee (e.g. "information-theoretic de-identification," "preserves all sufficient statistics," "HIPAA compliance through mathematical guarantees"), it has been relabeled or removed.

Install

git clone https://github.com/chronomancy-io/chronohipaa.git
cd chronohipaa

pip install -r requirements-dev.txt
pip install -e .

Note: pre-commit is listed in requirements-dev.txt, but this repo currently has no .pre-commit-config.yaml, so there are no hooks to install. Linting is run via make lint (ruff + mypy) and in CI.

Usage

from datetime import datetime
from cde_encoder import CDEEncoder

encoder = CDEEncoder(master_key=b"\x00" * 32)  # 32-byte key required
vector = encoder.encode({
    "service_date": datetime(2025, 1, 15),
    "age": 47,
    "sex": "F",
    "race": "white",
    "zip_code": "02115",
    "diagnoses": ["E11.9", "I10"],
    "procedures": ["99213"],
    "medications": ["1191"],
    "outcome": "stable",
    "severity": 1,
})
assert len(vector) == 20  # always 20 bytes

See ARCHITECTURE.md for the actual code structure.

Documentation

Contributing

See CONTRIBUTING.md for development standards.

License

Apache-2.0 © 2026 Jacob Coleman — See LICENSE for details.

About

A Python reference encoder mapping each health record to a fixed 20 byte (154 bit) CDE vector over six dimensions. Lossy by design. A research encoder, not a certified deidentification or HIPAA compliance product.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors