Stationarity Bias · Scenarios A / B / C · Slurm + Optuna + PyPOTS
This document serves as the official, paper-faithful specification of the experimental protocol described in the accompanying publication. It is intended for reviewers, collaborators, and future maintainers who need to understand, reproduce, or extend the reported experiments.
Every configuration, parameter, and pipeline step documented here maps directly to the paper — there are no undocumented behaviors, implicit assumptions, or deviations from the published methodology.
All models are implemented using PyPOTS, a unified library for time-series imputation, and hyperparameter optimization is conducted with Optuna.
The paper formalizes three findings with broad implications beyond CGM:
- The Stationarity Bias. Uniform random masking on signals with a dominant stationary regime systematically inflates the apparent performance of simple imputation methods.
- The RMSE Mirage. Low pointwise error (RMSE) masks the destruction of signal morphology during critical transients. DTW reveals divergence up to 3× larger than RMSE suggests.
- Regime-Conditional Model Selection. Linear interpolation is optimal during stationary intervals; deep learning (SAITS) is essential during transients for morphological fidelity, clinical accuracy, and safety-relevant bias.
The paper defines two non-overlapping regimes that serve different purposes and must never be mixed.
| Regime | Purpose | Used When |
|---|---|---|
| Mixed | Robust representation learning | Training only |
| Scenario A/B/C | Stress-test evaluation | Validation and testing only |
Mixed is a training-only regime. It exposes models to ecologically valid missingness patterns derived from clinical trial data (DCLP3, DCLP5) and optimizes a real-missingness loss.
Note: Mixed is never used for evaluation.
Configuration:
type: Mixed
missing_enabled: True
miss_config:
is_real_mask: True| Aspect | Detail |
|---|---|
| Usage | Training only |
| Missingness | Real (trial-derived) |
| Loss | Real-missingness loss (paper) |
| Scenario A/B/C | Not applied |
Evaluation is performed under explicit stress-test scenarios that isolate distinct dynamical regimes. Each scenario targets a specific clinical context where imputation quality has direct safety implications.
| Scenario | Regime | What Is Masked | Masking Strategy |
|---|---|---|---|
| A | Stationary | Homeostatic gaps (euglycemic stability) | 10%, 20%, 30% of sequence length |
| B | Transient | Post-prandial peaks | 3.5–4 hr windows centered on 1, 2, or 3 peaks |
| C | Safety-critical | Hypoglycemic events during TCR activation | 1 hr window centered on glucose < 70 mg/dL |
Configuration:
missing_enabled: True
miss_config:
type: 'A' | 'B' | 'C'
is_real_mask: FalseEvaluates reconstruction fidelity during stable intervals. Candidate 30-minute windows must satisfy all five criteria simultaneously:
- Euglycemia: All glucose values in 70–140 mg/dL.
- Gradient stability: ≥ 85% of time points satisfy |∇gₜ| < 0.6 mg/dL/min.
- No exogenous inputs: No meal or bolus events within the window (Σcₜ = 0).
- Washout period: No meal or bolus events in the 60 minutes preceding the window.
- Low variability: Glucose range (gₘₐₓ − gₘᵢₙ) < 25 mg/dL.
Masking is allocated into full 30-minute segments, with any residual applied as a partial segment. Experiments are run at masking ratios of 10%, 20%, and 30% of total sequence length.
Evaluates reconstruction of critical glycemic excursions:
- Meals separated by < 1 hour are aggregated into single events.
- A 4-hour post-prandial window is scanned to identify each peak.
- A masking window of random duration 3.5–4 hours is centered on the identified peak.
- Masking windows must not cover pre-meal time steps or exceed total daily duration.
Experiments vary the number of masked peaks: 1, 2, or 3.
Evaluates imputation during insulin pump adjustments precipitated by impending hypoglycemia:
- Uses the TCR-Simulation dataset exclusively (no models are trained on this data).
- A 1-hour window is masked, centered on the hypoglycemic event (glucose < 70 mg/dL) observed during TCR activation.
- TCR activation: 2.5 hours post-meal, basal rate reduced to 5% of nominal for 4 hours.
All datasets use CGM measurements, timestamps, meal, basal, and bolus insulin at 5-minute intervals. Data is split at the patient level (70% train / 10% validation / 20% test) to assess generalization on unseen subjects.
The DCLP3, DCLP5, and PEDAP clinical datasets can be downloaded from the Jaeb Center for Health Research: https://public.jaeb.org/datasets/diabetes
| Dataset | Processing Pipeline |
|---|---|
| DCLP3, DCLP5 | Processed using the DIAX repository |
| PEDAP | Processed using the process_pedap.ipynb notebook included in this repository |
The UVA/Padova Simulation and TCR-Simulation datasets will be available for download soon.
All optimized hyperparameters, trained model checkpoints, UVA/Padova Simulation, and TCR-Simulation datasets are available at:
https://myuva-my.sharepoint.com/:f:/g/personal/aww9gh_virginia_edu/IgCoABI3SmJzQrGXDZKyl0G1AZuLXYe2nO05kiXvYIguzUQ?e=8s5S5o
After downloading, simply set saving_path in your configuration to the downloaded directory and run the evaluation pipeline directly.
102 children aged 2–6 years, 13-week clinical trial.
| Variant | Episode Split | Missing CGM Handling | Usage |
|---|---|---|---|
| Raw PEDAP | Gaps > 240 min | Retained as missing | Training only |
| Processed PEDAP | Gaps > 30 min | Linearly interpolated | Testing |
Auxiliary variables (meal, bolus, basal) are zero-filled in both variants. Models trained on Raw PEDAP are tested on Processed PEDAP.
100 adults, 3 meals/day (Breakfast 06:00–11:00, Lunch 11:00–13:00, Dinner 18:00–20:00). Meal times have random variability (σ = 20 min), meal sizes range 0.84–1.44 g/kg with 15% size variability. No exercise. Hypoglycemia treatment triggered below 70 mg/dL.
Follows the UVA/Padova protocol with temporal variability removed. Induces hypoglycemia via meal size overestimation (+20% to +30%). One meal per day triggers TCR activation 2.5 hours post-meal for 4 hours (basal rate × 0.05). Used exclusively for Protocol C evaluation — 20% of patients designated as the test set.
| Model | Category | Reference |
|---|---|---|
| SAITS | Dual-stage Transformer | Du et al., 2023 |
| GPT4TS | Frozen GPT-2 adaptation | Zhou et al., 2023 |
| FreTS | Frequency-domain MLP | Yi et al., 2023 |
| TSLANet | Spectral + convolution blocks | Eldele et al., 2024 |
| TimeMixer | Multi-scale decomposable mixing | Wang et al., 2024 |
| SCINet | Recursive downsampling | Liu et al., 2022 |
| TEFN | Evidence theory + fuzzy masses | Zhan et al., 2025 |
| TOTEM | VQ-VAE tokenization | Talukder et al., 2024 |
Mean, Median, LOCF (Last Observation Carried Forward), and Linear Interpolation (Lerp).
| Metric | What It Measures | Why It Matters |
|---|---|---|
| RMSE | Pointwise reconstruction error | Standard benchmark; susceptible to the RMSE Mirage |
| MARD | Mean absolute relative difference (%) | Clinical gold standard for CGM accuracy |
| DTW | Temporal/morphological similarity | Captures shape destruction invisible to RMSE |
| Bias | Systematic over-/under-estimation | Directly affects closed-loop controller calibration |
| EmpSE | Empirical standard error | Variability of residuals |
Mixed training relies on trial-derived missingness statistics estimated from DCLP3 and DCLP5. The stochastic model has three components:
- Hourly onset probability P_start(h): fraction of valid monitoring days on which a gap begins at hour h.
- Single-point dropout classification: probability π_short that a gap is a transient 5-minute dropout.
- Sustained gap duration model: mixture of Exponential + Gaussian (centered near 120-min sensor warm-up) + uniform component, estimated separately for Day (h ∈ [6, 24)) and Night (h ∈ [0, 6)) regimes. Maximum duration capped at 240 minutes.
miss_config:
csv_list:
- RawData/dclp3_cgm_plus_features.csv
- RawData/dclp5_cgm_plus_features.csv
valid_threshold: 0.5| Parameter | Meaning |
|---|---|
csv_list |
Clinical-trial feature files used to estimate real-missingness loss |
valid_threshold |
Minimum observed/expected ratio (O_obs / 288) for valid days |
miss_config:
protocol_mask_ratio: 0.3
num_meal_hide: 3
peak_prominence_mgdl: 40
peak_distance_minutes: 120| Parameter | Scenario | Meaning |
|---|---|---|
protocol_mask_ratio |
A | Fraction of sequence masked (0.1, 0.2, 0.3) |
num_meal_hide |
B | Number of post-prandial peaks masked (1–3) |
peak_prominence_mgdl |
B | Minimum peak prominence (mg/dL) |
peak_distance_minutes |
B | Minimum separation between peaks (minutes) |
All models are implemented using the PyPOTS library. The following settings are fixed across all experiments and must not be changed.
seq_len: 288 # 24 hours at 5-min intervals
stride_train: 57
stride_test: 128
batch_size: 32
max_day_length: 288
early_stopping_patience: 10
max_epochs: 100
random_seed: 7Optimization is conducted using Optuna over 40 trials per model. The best-performing configuration is selected for final evaluation.
--ParamRangeDir param_range.jsonDefines model-specific hyperparameter ranges and ensures Optuna samples only valid, paper-approved configurations.
Note: The number of trials is defined in Slurm, not in
config.yml.
Three SLURM scripts cover the full pipeline. All scripts activate the shared virtual environment
and write logs to new_logs/ or test_logs/.
Runs Optuna HPO on a single model, then immediately evaluates the best configuration under the
active scenario defined in config.yml.
Computing resources: 1 GPU · 10 CPUs · 100 GB RAM · 20 min
# Set MODEL_NAMES and --array in slurm.sh, then submit:
sbatch slurm.shThe script executes:
python hyperparameter_engine.py \
--model_name "$model" \
--ParamRangeDir param_range.json \
--config-path config.yml \
--NTrials 40 \
--is_evaluateTo train the best-found configuration instead of evaluating, swap --is_evaluate for --train_best.
Supported model names (uncomment the desired list in slurm.sh):
SAITS FreTS SCINet TimeMixer TimeMixerPP TSLANet TEFN TOTEM GPT4TS CSDI
Lerp LOCF Median Mean
Sweeps every combination of scenario parameters across three datasets and four models in a single SLURM array job (one task per model).
Computing resources: 1 GPU · 10 CPUs · 64 GB RAM · 72 h · --array=0-3
sbatch ablation_slurm.shModels: SAITS FreTS SCINet Lerp
Datasets:
| Label | Description | Flag |
|---|---|---|
simulate |
UVA/Padova simulation data | (none) |
pedap |
PEDAP clinical dataset | --is_pedap |
none_pedap |
PEDAP (non-pedap evaluation split) | --is_pedap |
Parameter sweep per scenario:
| Scenario | Parameter | Values swept |
|---|---|---|
| A | protocol_mask_ratio |
0.1 0.2 0.3 0.5 |
| A | window_scenario_A_length (min) |
10 30 60 |
| B | num_meal_hide |
1 2 3 |
| B | [min_length_B, max_length_B] (hr) |
[1, 2] [3.5, 4] [5, 6] |
| C | hypo_length (min) |
30 60 120 |
Scenario C runs exclusively on the TCR dataset (CGM_Imputation/tcr_data/).
The script calls hyperparameter_engine.py with --is_evaluate --is_ablation for every
combination. Results are stored under ./DAY_NIGHT/.
Runs final held-out test evaluation for seven models in parallel using pre-tuned hyperparameters
from best_hyperparameter.json.
Computing resources: 1 GPU · 10 CPUs · 100 GB RAM · 2 h · --array=0-6
sbatch test_slurm.shModels (array index → model):
| Index | Model |
|---|---|
| 0 | SAITS |
| 1 | FreTS |
| 2 | Transformer |
| 3 | CSDI |
| 4 | MRNN |
| 5 | GRUD |
| 6 | SCINet |
The script executes:
python hyperparameter_engine.py \
--model_name "$model" \
--ParamRangeDir best_hyperparameter.json \
--config-path config.yml \
--NTrials 40 \
--is_test| Script | Purpose | Array | Time | Memory |
|---|---|---|---|---|
slurm.sh |
HPO + scenario evaluation | 0 |
20 min | 100 GB |
ablation_slurm.sh |
Full parameter sweep (A / B / C) | 0-3 |
72 h | 64 GB |
test_slurm.sh |
Held-out test with best params | 0-6 |
2 h | 100 GB |
Key differences between scripts:
slurm.shusesparam_range.json;test_slurm.shusesbest_hyperparameter.json.ablation_slurm.shloops over all scenario-parameter combinations internally — no manual re-submission needed.- All scripts set
PYTHONHASHSEED=7for reproducibility.
Logs are written to:
new_logs/train-out_<jobid>_<taskid>.log— training / ablation jobstest_logs/test-out_<jobid>_<taskid>.log— test jobs
data_path: /path/to/your/data/
saving_path: /path/to/output/root/| Field | Meaning |
|---|---|
data_path |
Dataset used for training or evaluation |
saving_path |
Root directory for all outputs |
is_pedap: True | FalseWhen True, the pipeline uses the Raw PEDAP and Processed PEDAP datasets. When False, it uses the Simulation datasets (UVA/Padova, TCR-Simulation).
All outputs — training artifacts, Optuna logs, and scenario evaluation results — are stored under a single directory:
Mixed/
The following are held constant across all experiments:
- Fixed temporal context (seq_len = 288, stride = 128 for test)
- Fixed missingness definitions per scenario
- Fixed Optuna budget (40 trials per model)
- Scenario isolation (no cross-contamination between A/B/C)
- Identical compute resources
- Fixed random seed (7)
If you have any questions or issues running the code, don't hesitate to reach out: aww9gh@virginia.edu