Cotorra: a configurable trainer

🦜 the wild parakeet of Chicago's south side

About

This repo provides a configurable trainer for generative event models on tokenized timelines. Cotorra is a Spanish term for a small-to-medium sized parrot, particularly the Monk parakeet. Monk parakeets were introduced to the south side of Chicago, where they have flourished. ¹ It benefits from previous experience training foundation models on tokenized electronic health records. ² ³ ⁴ ⁵

Installation

You can download and install this package as follows:

git clone git@github.com:bbj-lab/cotorra.git
cd cotorra
python -m venv .venv
. .venv/bin/activate
pip install -e ".[gen]" \
  --index-url https://download.pytorch.org/whl/cu128 \
  --extra-index-url https://pypi.org/simple

Context

Suppose you have a dataset of tokenized timelines tokens_times.parquet as a parquet table with columns:

subject_id
tokens — the integer token sequence for the subject's timeline.
times — a parallel list of timestamps, one per token, indicating when each event occurred.

The table will look something like this:

┌────────────────────┬─────────────────┬─────────────────────────────────┐
│ subject_id         ┆ tokens          ┆ times                           │
│ ---                ┆ ---             ┆ ---                             │
│ str                ┆ list[u32]       ┆ list[datetime[μs]]              │
╞════════════════════╪═════════════════╪═════════════════════════════════╡
│ 20002103           ┆ [20, 350, … 21] ┆ [2116-05-08 02:45:00, 2116-05-… │
│ 20008372           ┆ [20, 350, … 21] ┆ [2110-10-30 13:03:00, 2110-10-… │
│ …                  ┆ …               ┆ …                               │
│ 29994865           ┆ [20, 364, … 21] ┆ [2111-01-28 21:49:00, 2111-01-… │
└────────────────────┴─────────────────┴─────────────────────────────────┘

You also have a tokenizer.yaml, a plain yaml file that contains information about the configuration, learned vocabulary, and bins. This file is sufficient to reconstitute the tokenizer object. We only need this file to contain a lookup table:

lookup:
  UNK: 0
  ADMN//direct: 1
  ADMN//ed: 2
  ADMN//elective: 3
  AGE//age_Q0: 4
  ...

Finally, we need subject_splits.parquet which is a table listing out all subject_id's and their corresponding split assignment (with splits: train, tuning, and held_out):

┌────────────┬──────────┐
│ subject_id ┆ split    │
│ ---        ┆ ---      │
│ str        ┆ str      │
╞════════════╪══════════╡
│ 21081215   ┆ train    │
│ 20302177   ┆ train    │
│ …          ┆ …        │
│ 28150003   ┆ held_out │
│ 22151813   ┆ held_out │
└────────────┴──────────┘

For extraction and scoring workflows, we also need split-specific inference tables in the same processed_data_home directory:

train_for_inference.parquet
tuning_for_inference.parquet
held_out_for_inference.parquet

These tables are expected to include at least:

tokens_past (the model context used for extraction/scoring)
s_elapsed_past (if using time_based_rope)
token-specific label columns such as <TOKEN>_past and <TOKEN>_future used by generative and representation-based scoring.

The cocoa winnow command provides these.

Tip

For getting your data to this point, check out our configurable collator / tokenizer: ☕️ cocoa

Given these things, we want to train a model to predict the next token in a subject's timeline given their complete history or context up to this point. This package is designed to do that in a configurable way.

Configuration

This library can be extensively customized through yaml configuration files. Each command has its own default config under src/cotorra/config/, which you can override by passing a config file via the appropriate CLI flag. Any value can also be overridden programmatically via **kwargs which are merged on top of the YAML config via OmegaConf.

Training configuration (example)

Used by cotorra train and cotorra tune.

model_name: Name or path of the HuggingFace model (e.g., meta-llama/Llama-3.2-1B).
model_args: Model architecture parameters passed directly to HuggingFace's AutoConfig.
max_seq_len: Maximum sequence length for model input.
n_epochs: Number of epochs (handled in the dataloader, not the trainer).
run_name: Name for the current run (referenced by wandb and training_args).
tokens_of_interest: List of special tokens to upweight during training (referenced by loss config).
wandb:
- project: Weights & Biases project name for experiment tracking.
- run_name: Name for the current run.
custom_loss: Boolean flag to enable custom loss functions (default: false).
quantile_token_loss (optional): Upweights loss on quantile boundary tokens.
- qt_weight: Weight multiplier for quantile tokens.
label_weighted_loss (optional): Upweights loss on specific tokens of clinical interest.
- tokens_of_interest: List of token labels to upweight.
- toi_weight: Weight multiplier applied to those tokens.
time_based_rope (optional): Enables time-aware rotary position embeddings.
- sec_per_pos_id: Number of seconds represented by one position id increment.
training_args: Arguments passed to HuggingFace's TrainingArguments.
tuning_args: Arguments passed to HuggingFace's hyperparameter_search when cotorra tune is called.

Extraction configuration (example)

Used by cotorra extract.

max_seq_len: Maximum sequence length.
time_based_rope (optional): Enables time-aware position ids during extraction (must match the setting used at training time).
- sec_per_pos_id: Number of seconds represented by one position id increment.
extract:
- max_len: Maximum input length (tokens) during extraction.
- batch_size: Batch size for inference.
- shard_size (optional): Number of samples per output parquet shard. Omit to write a single file per split.

Scoring configuration (example)

Used by cotorra generative-score and cotorra rep-based-score.

run_name: Name for the current run, used to label output files.
tokens_of_interest: List of token-based outcomes of interest.
score:
- max_len: Maximum input length (tokens) during scoring.
- n_samp: Number of Monte Carlo samples per input per trajectory type.
- target_tokens: Token-based outcomes of interest to score.
- end_tokens: Tokens that naturally terminate a generated sequence (e.g. EOS).
- suppressed_tokens: Tokens to suppress via logit bias during generation (e.g. PAD).
- trunc_id: Token id forced after the time horizon is exceeded.
- max_time: Maximum time horizon in minutes.
- batch_size: Batch size for inference.

Usage

We provide a CLI:

 Usage: cotorra [OPTIONS] COMMAND [ARGS]...

 Configurable training for generative event models (vXX.X.X)

╭─ Options ───────────────────────────────────────────────────────────────────╮
│ --install-completion          Install completion for the current shell.     │
│ --show-completion             Show completion for the current shell, to     │
│                               copy it or customize the installation.        │
│ --help                        Show this message and exit.                   │
╰─────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ──────────────────────────────────────────────────────────────────╮
│ train             Train a model on tokenized data. For tokenization,        │
│                   consult the cocoa package.                                │
│ tune              Run hyperparameter tuning while training a model.         │
│ extract           Extract representations from a trained model.             │
│ generative-score  Generate SCORE/REACH metrics from a trained model and     │
│                   save them to parquet.                                     │
│ rep-based-score   Generate rep-based scores for the token-based outcomes of │
│                   interest.                                                 │
│                   Note: this requires that features have already been       │
│                   extracted and saved                                       │
╰─────────────────────────────────────────────────────────────────────────────╯

with commands:

cotorra train

Usage: cotorra train [OPTIONS]

Train a model on tokenized data. For tokenization, consult the cocoa package.

╭─ Options ───────────────────────────────────────────────────────────────────╮
│    --training-config      -t      PATH  Training configuration file         │
│                                         (overrides default)                 │
│ *  --processed-data-home  -p      TEXT  Processed data directory (overrides │
│                                         config)                             │
│                                         [required]                          │
│ *  --output-home          -o      TEXT  Output directory for trained models │
│                                         [required]                          │
│    --verbose              -v            Verbose logging for collate         │
│    --help                               Show this message and exit.         │
╰─────────────────────────────────────────────────────────────────────────────╯

cotorra tune

Usage: cotorra tune [OPTIONS]

Run hyperparameter tuning while training a model.

╭─ Options ───────────────────────────────────────────────────────────────────╮
│    --training-config      -t      PATH  Training configuration file         │
│                                         (overrides default)                 │
│ *  --processed-data-home  -p      TEXT  Processed data directory (overrides │
│                                         config)                             │
│                                         [required]                          │
│ *  --output-home          -o      TEXT  Output directory for trained models │
│                                         [required]                          │
│    --verbose              -v            Verbose logging for collate         │
│    --help                               Show this message and exit.         │
╰─────────────────────────────────────────────────────────────────────────────╯

cotorra generative-score

Usage: cotorra generative-score [OPTIONS]

Generate SCORE/REACH metrics from a trained model and save them to parquet.

╭─ Options ───────────────────────────────────────────────────────────────────╮
│    --scoring-config       -s      PATH  Scoring configuration file          │
│                                         (overrides default)                 │
│ *  --processed-data-home  -p      TEXT  Processed data directory [required] │
│ *  --model-home           -m      TEXT  Directory of the trained model to   │
│                                         score with                          │
│                                         [required]                          │
│    --output-home          -o      TEXT  Output directory for scores,        │
│                                         defaults to processed-data-home     │
│    --verbose              -v            Verbose logging for collate         │
│    --help                               Show this message and exit.         │
╰─────────────────────────────────────────────────────────────────────────────╯

cotorra extract

Usage: cotorra extract [OPTIONS]

Extract representations from a trained model.

╭─ Options ───────────────────────────────────────────────────────────────────╮
│    --extraction-config    -e      PATH  Extraction configuration file       │
│                                         (overrides default)                 │
│ *  --processed-data-home  -p      TEXT  Processed data directory [required] │
│ *  --model-home           -m      TEXT  Directory of the trained model to   │
│                                         extract from                        │
│                                         [required]                          │
│    --output-home          -o      TEXT  Output directory for extracted      │
│                                         features, defaults to               │
│                                         processed-data-home                 │
│    --all-times            -a            Extract features for all time steps │
│                                         (instead of just the final one)?    │
│    --help                               Show this message and exit.         │
╰─────────────────────────────────────────────────────────────────────────────╯

cotorra rep-based-score (note: you need to run extract first)

Usage: cotorra rep-based-score [OPTIONS]

Generate rep-based scores for the token-based outcomes of interest. Note:
this requires that features have already been extracted and saved

╭─ Options ───────────────────────────────────────────────────────────────────╮
│    --scoring-config       -s      PATH  Scoring configuration file          │
│                                         (overrides default)                 │
│ *  --processed-data-home  -p      TEXT  Processed data directory [required] │
│ *  --model-home           -m      TEXT  Directory of the trained model to   │
│                                         score with                          │
│                                         [required]                          │
│    --output-home          -o      TEXT  Output directory for scores,        │
│                                         defaults to processed-data-home     │
│    --verbose              -v            Verbose logging for collate         │
│    --help                               Show this message and exit.         │
╰─────────────────────────────────────────────────────────────────────────────╯

L. Gersony, "The Quiet Victory of Chicago’s Monk Parakeets," The Chicago Maroon, 23 January 2022, https://chicagomaroon.com/28830/grey-city/quiet-protest-chicagos-monk-parakeets/ ↩
M. Burkhart, B. Ramadan, Z. Liao, K. Chhikara, J. Rojas, W. Parker, & B. Beaulieu-Jones, Foundation models for electronic health records: representation dynamics and transferability, arXiv:2504.10422 ↩
M. Burkhart, B. Ramadan, L. Solo, W. Parker, & B. Beaulieu-Jones, Quantifying surprise in clinical care: Detecting highly informative events in electronic health records with foundation models, Pacific Symposium on Biocomputing 31 (2026), 173–188 ↩
L. Solo, M. McDermott, W. Parker, B. Ramadan, M. Burkhart, & B. Beaulieu-Jones, Efficient generative prediction for EHR foundation models: the SCOPE and REACH estimators, arXiv:2602.03730 ↩
I. Lee, L. Solo, M. Burkhart, B. Ramadan, W. Parker, & B. Beaulieu-Jones, Representation before training: a fixed-budget benchmark for generative medical event models, arXiv:2604.16775 ↩

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.vscode		.vscode
config		config
img		img
recipes		recipes
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc.toml		.prettierrc.toml
.taplo.toml		.taplo.toml
LICENSE.md		LICENSE.md
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cotorra: a configurable trainer

About

Installation

Context

Configuration

Training configuration (example)

Extraction configuration (example)

Scoring configuration (example)

Usage

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cotorra: a configurable trainer

About

Installation

Context

Configuration

Training configuration (example)

Extraction configuration (example)

Scoring configuration (example)

Usage

Footnotes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages