Skip to content

bbj-lab/cocoa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

88 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

DOI SWH

Cocoa: a configurable collator

โ˜•๏ธ Chicago's second favorite bean

cocoa bean

About

This repo provides a configurable way to collate data from multiple sources into a single denormalized dataframe and create tokenized timelines from the results. It benefits from previous experience collating data to train foundation models on tokenized electronic health records. 1 2 3 4

Installation

You can download and install this package as follows:

git clone git@github.com:bbj-lab/cocoa.git
cd cocoa
python -m venv .venv
. .venv/bin/activate
pip install -e .

(1) Collation

The collator pulls from raw data tables (parquet or csv) and combines them into a single denormalized dataframe in a MEDS-like format. Each row in the output represents an event with a subject_id, time, code (all mandatory), and optional numeric_value / text_value columns.

Collation is driven by a YAML config (the package ships a default; see ./src/cocoa/config/collation.yaml) that specifies:

  • A reference table with a primary key (subject_id), start/end times, and optional augmentation joins (e.g. joining a patient demographics table).
  • A list of entries, each mapping a source table (or the reference frame itself via table: REFERENCE) to the output schema. Each entry declares which column provides the code, time, and optionally numeric_value, and text_value. Codes can be given a prefix prefix. Some preprocessing can be done with optional entries for filter_expr, with_col_expr, and agg_expr. These take the form of polars expressions that are evaluated and applied to the dataframe during loading. Mild checks are performed when evaluating these expressions, but in general, the yaml config is just as powerful as the python. Check all yaml files prior to use.
  • Subject splits (train_frac / tuning_frac) that partition subjects chronologically into train, tuning, and held-out sets.

A collation config has three top-level sections: identifiers, subject splits, and the reference + entries that define which events to extract.

Identifiers and splits

subject_id: hospitalization_id # the atomic unit of interest
group_id: patient_id # multiple subjects can belong to a group

subject_splits:
  train_frac: 0.7
  tuning_frac: 0.1
  # the remainder is held out

subject_id is the column that uniquely identifies each subject (e.g. a hospitalization). group_id is an optional higher-level grouping column. Subjects are sorted chronologically and split into train / tuning / held-out sets according to the specified fractions.

Reference table

The reference table is the primary static table to which other static information can be joined:

reference:
  table: clif_hospitalization
  start_time: admission_dttm
  end_time: discharge_dttm

  augmentation_tables:
    - table: clif_patient
      key: patient_id
      validation: "m:1"
      with_col_expr: pl.lit("AGE").alias("AGE")
  • table โ€” the name of the parquet (or csv) file in --raw-data-home (without the extension).
  • start_time / end_time โ€” columns that define the subject's time window; used to filter events from other tables when reference_key is set (see below).
  • augmentation_tables โ€” optional list of tables to join onto the reference frame. Each needs a key to join on and a validation mode (e.g. "m:1"). You can also add computed columns via with_col_expr.

Entries

The entries list defines the events to extract. Every entry produces rows with the columns subject_id, time, code, numeric_value, and text_value. The entry's fields tell the collator which source columns map to these outputs.

Required fields:

Field Description
table Source table name, or REFERENCE to pull from the reference frame.
code Column whose values become the event code.
time Column whose values become the event timestamp.

Optional fields:

Field Description
prefix String prepended to the code (separated by //), e.g. LAB-RES.
numeric_value Column to use as the numeric value for the event.
text_value Column to use as the text value for the event.
filter_expr A Polars expression (or list of expressions) to filter rows before extraction.
with_col_expr A Polars expression (or list) to add computed columns before extraction.
reference_key Join the source table to the reference frame on this key and keep only rows within the subject's start_timeโ€“end_time window.

Examples:

  • A simple categorical event from the reference frame:

    - table: REFERENCE
      prefix: DSCG
      code: discharge_category
      time: discharge_dttm

    creates codes such as DSCG//assisted_living, DSCG//home, DSCG//hospice with time discharge_dttm.

  • A numeric event from an external table:

    - table: clif_labs
      prefix: LAB-RES
      code: lab_category
      numeric_value: lab_value_numeric
      time: lab_result_dttm

    creates codes such as LAB-RES//alt and LAB-RES//ast with numeric_value lab_value_numeric at time lab_result_dttm.

  • Tables can be filtered prior to extraction with filter_expr:

    - table: clif_position
      prefix: POSN
      filter_expr: pl.col("position_category") == "prone"
      code: position_category
      time: recorded_dttm

    selects only rows where pl.col("position_category") == "prone"

  • Multiple filters can be applied as a list:

    - table: clif_medication_admin_intermittent_converted
      prefix: MED-INT
      filter_expr:
        - pl.col("mar_action_category") == "given"
        - pl.col("_convert_status") == "success"
      code: med_category
      numeric_value: med_dose_converted
      time: admin_dttm
  • Creating a computed column with with_col_expr to use as the code:

    - table: clif_respiratory_support_processed
      prefix: RESP
      with_col_expr: pl.lit("fio2_set").alias("code")
      filter_expr: pl.col("fio2_set").is_finite()
      code: code
      numeric_value: fio2_set
      time: recorded_dttm
  • The reference_key can be used to restrict events to a subject's time window:

    - table: clif_code_status
      prefix: CODE
      code: code_status_category
      time: admission_dttm
      reference_key: patient_id

Outputs

  • meds.parquet gives a table of the collated events:

    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ subject_id โ”† time                โ”† code                         โ”† numeric_value โ”† text_value โ”‚
    โ”‚ ---        โ”† ---                 โ”† ---                          โ”† ---           โ”† ---        โ”‚
    โ”‚ str        โ”† datetime[ฮผs]        โ”† str                          โ”† f32           โ”† str        โ”‚
    โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
    โ”‚ 24591817   โ”† 2111-09-26 18:15:00 โ”† MED-CTS//sodium_chloride     โ”† 0.0           โ”† null       โ”‚
    โ”‚ 21343412   โ”† 2112-01-11 06:31:00 โ”† LAB-RES//albumin             โ”† 3.3           โ”† null       โ”‚
    โ”‚ 24894995   โ”† 2113-01-14 14:25:00 โ”† LAB-ORD//creatinine          โ”† null          โ”† null       โ”‚
    โ”‚ 20947416   โ”† 2110-12-12 18:41:00 โ”† LAB-RES//hemoglobin          โ”† 8.4           โ”† null       โ”‚
    โ”‚ 25082363   โ”† 2110-06-17 17:00:00 โ”† VTL//respiratory_rate        โ”† 30.0          โ”† null       โ”‚
    โ”‚ โ€ฆ          โ”† โ€ฆ                   โ”† โ€ฆ                            โ”† โ€ฆ             โ”† โ€ฆ          โ”‚
    โ”‚ 22074503   โ”† 2110-07-13 03:53:00 โ”† LAB-ORD//chloride            โ”† null          โ”† null       โ”‚
    โ”‚ 24524153   โ”† 2110-10-08 03:20:00 โ”† LAB-RES//glucose_serum       โ”† 179.0         โ”† null       โ”‚
    โ”‚ 28104308   โ”† 2112-03-22 14:31:00 โ”† LAB-RES//sodium              โ”† 137.0         โ”† null       โ”‚
    โ”‚ 23859742   โ”† 2110-08-21 21:35:00 โ”† LAB-RES//ptt                 โ”† 26.299999     โ”† null       โ”‚
    โ”‚ 25805890   โ”† 2110-10-03 11:00:00 โ”† LAB-ORD//eosinophils_percent โ”† null          โ”† null       โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    
  • subject_splits.parquet gives a table of all subject_id's and their corresponding split assignment:

    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ subject_id โ”† split    โ”‚
    โ”‚ ---        โ”† ---      โ”‚
    โ”‚ str        โ”† str      โ”‚
    โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
    โ”‚ 21081215   โ”† train    โ”‚
    โ”‚ 20302177   โ”† train    โ”‚
    โ”‚ โ€ฆ          โ”† โ€ฆ        โ”‚
    โ”‚ 27116134   โ”† tuning   โ”‚
    โ”‚ 29134959   โ”† tuning   โ”‚
    โ”‚ โ€ฆ          โ”† โ€ฆ        โ”‚
    โ”‚ 28150003   โ”† held_out โ”‚
    โ”‚ 22151813   โ”† held_out โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    

(2) Tokenization

The tokenizer consumes the collated parquet output and converts events into integer token sequences suitable for sequence models. It:

  1. Adds BOS / EOS (beginning/end-of-sequence) tokens to each subject's timeline.
  2. Optionally inserts configurable clock tokens to mark the passage of time.
  3. Optionally inserts configurable time spacing tokens between events.
  4. Computes quantile-based bins for numeric values (from training data only).
  5. Maps codes (and optionally their binned values) to integer tokens via a vocabulary that is formed during training and is frozen for tuning/held-out data.
  6. Aggregates per-subject token sequences according to time, and then configurable sort order.

Tokenization is driven by its own YAML config (the package ships a default; see ./src/cocoa/config/tokenization.yaml) that specifies:

  • n_bins โ€” number of quantile bins for numeric values.
  • fused โ€” whether to fuse the code, binned value, and text value into a single token (true) or keep them as separate tokens (false).
  • include_numeric_values โ€” whether to include raw numeric values alongside tokens in the output (false by default).
  • insert_spacers โ€” whether to insert time spacing tokens between events.
  • insert_clocks โ€” whether to insert clock tokens at specified times.
  • ordering โ€” the priority order of code prefixes when sorting events within the same timestamp.
  • spacers โ€” mapping of time intervals (e.g., 5m-15m, 1h-2h) to their lower bounds in minutes, used for time spacing tokens.
  • clocks โ€” list of hour strings (e.g., 00, 04, ...) at which to insert clock tokens.

Outputs

  • tokens_times.parquet gives one row per subject with three columns:

    • subject_id
    • tokens โ€” the integer token sequence for the subject's timeline.
    • times โ€” a parallel list of timestamps, one per token, indicating when each event occurred.

    The table will look something like this:

    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ subject_id         โ”† tokens          โ”† times                           โ”‚
    โ”‚ ---                โ”† ---             โ”† ---                             โ”‚
    โ”‚ str                โ”† list[u32]       โ”† list[datetime[ฮผs]]              โ”‚
    โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
    โ”‚ 20002103           โ”† [20, 350, โ€ฆ 21] โ”† [2116-05-08 02:45:00, 2116-05-โ€ฆ โ”‚
    โ”‚ 20008372           โ”† [20, 350, โ€ฆ 21] โ”† [2110-10-30 13:03:00, 2110-10-โ€ฆ โ”‚
    โ”‚ โ€ฆ                  โ”† โ€ฆ               โ”† โ€ฆ                               โ”‚
    โ”‚ 29994865           โ”† [20, 364, โ€ฆ 21] โ”† [2111-01-28 21:49:00, 2111-01-โ€ฆ โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    

    In this example, token 20 corresponds to the beginning-of-sequence token (BOS), token 21 to the end-of-sequence token (EOS), and the tokens in between correspond to the subject's clinical events in chronological order (with ties broken by the configured ordering). In fused mode each event is a single token; in unfused mode an event with a numeric value becomes two tokens (code + quantile bin).

  • tokenizer.yaml is a plain yaml file that contains information about the configuration, learned vocabulary, and bins. This file is sufficient to reconstitute the tokenizer object. Currently, there's an entry for the lookup that maps strings to tokens:

    lookup:
      UNK: 0
      ADMN//direct: 1
      ADMN//ed: 2
      ADMN//elective: 3
      AGE//age_Q0: 4
      โ€ฆ

    and an entry for bin cutpoints:

    bins:
      VTL//heart_rate:
        - 65.0
        - 70.0
        - 75.0
        - 80.0
        - 84.0
        - 89.0
        - 94.0
        - 100.0
        - 108.0
      LAB-RES//platelet_count:
        - 62.0
        - 114.0
        - 147.0
        - 175.0
        - 203.0
        - 233.0
        - 267.0
        - 314.0
        - 390.0
      โ€ฆ

    The lists following each key correspond to the cutpoints for the associated category.

Tip

To train a generative event model on this data, check out our configurable trainer: ๐Ÿฆœ cotorra

(3) Winnowing

The winnower prepares held-out timelines for evaluation by filtering and flagging subjects based on outcome criteria. It:

  1. Loads held-out data from the tokenized timelines and associated timestamps.
  2. Splits each subject's timeline at a configurable time horizon or at the first occurrence of a specified token, separating events into "past" (before the horizon) and "future" (after the horizon).
  3. Checks for the presence of outcome tokens in both the past and future periods.
  4. Filters out subjects whose timelines don't exceed the horizon duration, ensuring subjects have sufficient observation time.
  5. Outputs a winnowed dataset suitable for inference and evaluation tasks.

Winnowing is driven by a YAML config (the package ships a default; see ./src/cocoa/config/winnowing.yaml) that specifies:

  • outcome_tokens โ€” list of event codes to track as outcomes (e.g., XFR-IN//icu, DSCG//expired). The winnower creates binary flags for each outcome indicating whether that token appears in the past or future period.
  • threshold โ€” defines how the threshold is set. Currently supported options are as follows:
    • duration_s (integer) thresholds after a given duration (in seconds)
    • first_occurrence (token string) thresholds after the first occurrence of the provided token
    • uniform_random (boolean) thresholds at a point in time chosen uniformly at random from the total duration of the timeline
  • horizon_after_threshold_s is an optional parameter that allows you to set a prediction window (in seconds) after the threshold is triggered

Example configuration:

outcome_tokens:
  - XFR-IN//icu
  - RESP//imv
  - DSCG//expired
  - DSCG//hospice
threshold:
  # choose one and only one of the following
  # duration_s: !!int 86400 # 24h
  first_occurrence: XFR-IN//icu

horizon_after_threshold_s: !!int 2592000 # 30d outcome window after prediction threshold

Outputs

  • held_out_for_inference.parquet has columns for each outcome token (e.g., XFR-IN//icu_past, XFR-IN//icu_future) indicating whether that outcome occurred in the respective time period.

Usage

We provide a CLI that should be sufficient for most use cases:

 Usage: cocoa [OPTIONS] COMMAND [ARGS]...

 Configurable collation and tokenization (vXX.X.X)

โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --install-completion          Install completion for the current shell.     โ”‚
โ”‚ --show-completion             Show completion for the current shell, to     โ”‚
โ”‚                               copy it or customize the installation.        โ”‚
โ”‚ --help                        Show this message and exit.                   โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€ Commands โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ collate           Collate raw data into a denormalized format.              โ”‚
โ”‚ tokenize          Tokenize collated data into integer sequences.            โ”‚
โ”‚ winnow            Winnow held-out data for evaluation.                      โ”‚
โ”‚ pipeline          Run the full pipeline: collate, tokenize, & winnow.       โ”‚
โ”‚ combine-datasets  Combine multiple processed datasets into one.             โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

with commands:

  • cocoa collate

    Usage: cocoa collate [OPTIONS]
    
    Collate raw data into a denormalized format.
    
    Reads collation configuration and produces a MEDS-like parquet file
    with collated events.
    
    โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚    --collation-config     -c      PATH  Collation configuration file        โ”‚
    โ”‚                                         (overrides default)                 โ”‚
    โ”‚ *  --raw-data-home        -r      TEXT  Raw data directory [required]       โ”‚
    โ”‚ *  --processed-data-home  -p      TEXT  Processed data directory [required] โ”‚
    โ”‚    --verbose              -v            Verbose logging for collate; this   โ”‚
    โ”‚                                         may cause memory issues with large  โ”‚
    โ”‚                                         datasets                            โ”‚
    โ”‚    --help                               Show this message and exit.         โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    
  • cocoa tokenize

    Usage: cocoa tokenize [OPTIONS]
    
    Tokenize collated data into integer sequences.
    
    Reads collated parquet files and produces tokenized timelines with
    vocabulary and bin information.
    
    โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚    --tokenization-config  -c      PATH  Tokenization configuration file     โ”‚
    โ”‚                                         (overrides config)                  โ”‚
    โ”‚ *  --processed-data-home  -p      TEXT  Processed data directory [required] โ”‚
    โ”‚    --tokenizer-home       -t      TEXT  Use a pretrained tokenizer at this  โ”‚
    โ”‚                                         path (overrides config)             โ”‚
    โ”‚    --verbose              -v            Verbose logging for collate; this   โ”‚
    โ”‚                                         may cause memory issues with large  โ”‚
    โ”‚                                         datasets                            โ”‚
    โ”‚    --help                               Show this message and exit.         โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    
  • cocoa winnow

    Usage: cocoa winnow [OPTIONS]
    
    Winnow held-out data for evaluation.
    
    Filters held-out timelines and assigns flags to disqualify certain subjects
    from evaluation based on the configured criteria.
    
    โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚    --winnowing-config     -c      PATH  Winnowing configuration file        โ”‚
    โ”‚                                         (overrides config)                  โ”‚
    โ”‚ *  --processed-data-home  -p      TEXT  Processed data directory [required] โ”‚
    โ”‚    --verbose              -v            Verbose logging for winnow; prints  โ”‚
    โ”‚                                         summary statistics                  โ”‚
    โ”‚    --help                               Show this message and exit.         โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    
  • cocoa pipeline

    Usage: cocoa pipeline [OPTIONS]
    
    Run the full pipeline: collate, tokenize, & winnow.
    
    โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚    --collation-config             PATH  Collation configuration file        โ”‚
    โ”‚                                         (overrides config)                  โ”‚
    โ”‚    --tokenization-config          PATH  Tokenization configuration file     โ”‚
    โ”‚                                         (overrides config)                  โ”‚
    โ”‚    --winnowing-config             PATH  Winnowing configuration file        โ”‚
    โ”‚                                         (overrides config)                  โ”‚
    โ”‚ *  --raw-data-home        -r      TEXT  Raw data directory [required]       โ”‚
    โ”‚ *  --processed-data-home  -p      TEXT  Processed data directory [required] โ”‚
    โ”‚    --verbose              -v            Verbose logging for pipeline steps  โ”‚
    โ”‚    --help                               Show this message and exit.         โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    

Tip

For common use cases, check out the recipes section!

Footnotes

  1. M. Burkhart, B. Ramadan, Z. Liao, K. Chhikara, J. Rojas, W. Parker, & B. Beaulieu-Jones, Foundation models for electronic health records: representation dynamics and transferability, arXiv:2504.10422 โ†ฉ

  2. M. Burkhart, B. Ramadan, L. Solo, W. Parker, & B. Beaulieu-Jones, Quantifying surprise in clinical care: Detecting highly informative events in electronic health records with foundation models, Pacific Symposium on Biocomputing 31 (2026), 173โ€“188 โ†ฉ

  3. L. Solo, M. McDermott, W. Parker, B. Ramadan, M. Burkhart, & B. Beaulieu-Jones, Efficient generative prediction for EHR foundation models: the SCOPE and REACH estimators, arXiv:2602.03730 โ†ฉ

  4. I. Lee, L. Solo, M. Burkhart, B. Ramadan, W. Parker, & B. Beaulieu-Jones, Representation before training: a fixed-budget benchmark for generative medical event models, arXiv:2604.16775 โ†ฉ

About

A configurable collator and tokenizer

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages