Problem
The vessel_satellite_radiance_delayed_qc (and vessel_satellite_radiance_derived_product) dataset config combines data from two simultaneously-operating vessels into a single Zarr store using TIME as the only append dimension:
paths:
s3://imos-data/IMOS/SRS/OC/radiometer/VMQ9273_Solander # RV Solander
s3://imos-data/IMOS/SRS/OC/radiometer/VLHJ_Southern-Surveyor # RV Southern Surveyor
When both vessels are at sea at the same time, their TIME values overlap. The current code path in _write_ds detects the overlap and calls _handle_duplicate_regions, which overwrites the existing data at those TIME positions with the new batch's data. This means whichever vessel was processed last silently replaces the other vessel's observations — silent data loss.
The _find_duplicated_values method detects this post-write and logs a WARNING, but:
- The warning fires once per batch (many repeated lines with different UUIDs, same content)
- A misleading TODO comment claims this may be acceptable:
# Not necessarily an issue. For example, some SOOP dataset, same TIME, 2 different NetCDF files, 2 different vessel and location.
- It doesn't name the specific conflicting source files
The config already tracks vessel identity via platform_code (global attribute → per-TIME variable, e.g. VMQ9273 / VLHJ), so the information exists to detect and handle this.
Impact
Any TIME step where both vessels were active will only retain data from whichever vessel's batch was processed last. At re-run time the situation may flip, making results non-deterministic.
Options
Option 1 — Split into two separate dataset configs
Create:
vessel_satellite_radiance_VMQ9273_Solander_delayed_qc
vessel_satellite_radiance_VLHJ_Southern-Surveyor_delayed_qc
Each config points to one vessel's S3 path, producing a separate Zarr store. No TIME collisions possible.
Pros: Cleanest, minimal code changes, fully correct.
Cons: Two configs to maintain; downstream consumers query both stores.
Option 2 — Add platform as a fixed second dimension (recommended)
Restructure the Zarr to have dims (TIME, platform) (or (platform, TIME)):
TIME remains the append dimension — _handle_duplicate_regions works unchanged
platform is a fixed 2-value dim (["VMQ9273", "VLHJ"]) — no append logic needed
Implementation in preprocessing:
- Determine the vessel's
platform_code from the file's global attributes
- Expand the 1D
(TIME,) variables to (platform, TIME) with NaN fill for the inactive platform
- Set the
platform coordinate accordingly
The Zarr region-write logic (region={TIME: slice(start, end)}) still works because platform is a fixed dimension — _write_ds and _handle_duplicate_regions only iterate over the TIME dimension.
Pros: Single store, all vessels co-located, no data loss, platform is a queryable coordinate.
Cons: Requires preprocessing changes to expand 1D files to 2D; schema update; migration of existing stores.
Option 3 — Protective guard in _write_ds (short-term mitigation)
Detect cross-file TIME collisions at write time using the filename / platform_code variables already in the store. If common_append_dim_values > 0 AND the platform_code at those positions in the store ≠ the incoming batch's platform_code → log a clear ERROR and skip (don't overwrite) instead of calling _handle_duplicate_regions.
This prevents silent data loss while a proper fix is designed.
Recommendation
- Immediately (separate PR): implement Option 3 as a protective guard to stop silent overwriting and log the conflicting files clearly
- Preferred fix: implement Option 2 (add
platform fixed dim) — stays as a single dataset, correct semantics
- Alternative: Option 1 (split configs) if Option 2 proves too complex
cc @lbesnard
Problem
The
vessel_satellite_radiance_delayed_qc(andvessel_satellite_radiance_derived_product) dataset config combines data from two simultaneously-operating vessels into a single Zarr store usingTIMEas the only append dimension:When both vessels are at sea at the same time, their
TIMEvalues overlap. The current code path in_write_dsdetects the overlap and calls_handle_duplicate_regions, which overwrites the existing data at those TIME positions with the new batch's data. This means whichever vessel was processed last silently replaces the other vessel's observations — silent data loss.The
_find_duplicated_valuesmethod detects this post-write and logs a WARNING, but:# Not necessarily an issue. For example, some SOOP dataset, same TIME, 2 different NetCDF files, 2 different vessel and location.The config already tracks vessel identity via
platform_code(global attribute → per-TIME variable, e.g.VMQ9273/VLHJ), so the information exists to detect and handle this.Impact
Any TIME step where both vessels were active will only retain data from whichever vessel's batch was processed last. At re-run time the situation may flip, making results non-deterministic.
Options
Option 1 — Split into two separate dataset configs
Create:
vessel_satellite_radiance_VMQ9273_Solander_delayed_qcvessel_satellite_radiance_VLHJ_Southern-Surveyor_delayed_qcEach config points to one vessel's S3 path, producing a separate Zarr store. No TIME collisions possible.
Pros: Cleanest, minimal code changes, fully correct.
Cons: Two configs to maintain; downstream consumers query both stores.
Option 2 — Add
platformas a fixed second dimension (recommended)Restructure the Zarr to have dims
(TIME, platform)(or(platform, TIME)):TIMEremains the append dimension —_handle_duplicate_regionsworks unchangedplatformis a fixed 2-value dim (["VMQ9273", "VLHJ"]) — no append logic neededImplementation in preprocessing:
platform_codefrom the file's global attributes(TIME,)variables to(platform, TIME)withNaNfill for the inactive platformplatformcoordinate accordinglyThe Zarr region-write logic (
region={TIME: slice(start, end)}) still works becauseplatformis a fixed dimension —_write_dsand_handle_duplicate_regionsonly iterate over the TIME dimension.Pros: Single store, all vessels co-located, no data loss,
platformis a queryable coordinate.Cons: Requires preprocessing changes to expand 1D files to 2D; schema update; migration of existing stores.
Option 3 — Protective guard in
_write_ds(short-term mitigation)Detect cross-file TIME collisions at write time using the
filename/platform_codevariables already in the store. Ifcommon_append_dim_values > 0AND theplatform_codeat those positions in the store ≠ the incoming batch'splatform_code→ log a clear ERROR and skip (don't overwrite) instead of calling_handle_duplicate_regions.This prevents silent data loss while a proper fix is designed.
Recommendation
platformfixed dim) — stays as a single dataset, correct semanticscc @lbesnard