Skip to content

vessel_satellite_radiance: duplicate TIME collision when two vessels operate simultaneously #289

Description

@lbesnard

Problem

The vessel_satellite_radiance_delayed_qc (and vessel_satellite_radiance_derived_product) dataset config combines data from two simultaneously-operating vessels into a single Zarr store using TIME as the only append dimension:

paths:
  s3://imos-data/IMOS/SRS/OC/radiometer/VMQ9273_Solander      # RV Solander
  s3://imos-data/IMOS/SRS/OC/radiometer/VLHJ_Southern-Surveyor  # RV Southern Surveyor

When both vessels are at sea at the same time, their TIME values overlap. The current code path in _write_ds detects the overlap and calls _handle_duplicate_regions, which overwrites the existing data at those TIME positions with the new batch's data. This means whichever vessel was processed last silently replaces the other vessel's observations — silent data loss.

The _find_duplicated_values method detects this post-write and logs a WARNING, but:

  1. The warning fires once per batch (many repeated lines with different UUIDs, same content)
  2. A misleading TODO comment claims this may be acceptable: # Not necessarily an issue. For example, some SOOP dataset, same TIME, 2 different NetCDF files, 2 different vessel and location.
  3. It doesn't name the specific conflicting source files

The config already tracks vessel identity via platform_code (global attribute → per-TIME variable, e.g. VMQ9273 / VLHJ), so the information exists to detect and handle this.

Impact

Any TIME step where both vessels were active will only retain data from whichever vessel's batch was processed last. At re-run time the situation may flip, making results non-deterministic.

Options

Option 1 — Split into two separate dataset configs

Create:

  • vessel_satellite_radiance_VMQ9273_Solander_delayed_qc
  • vessel_satellite_radiance_VLHJ_Southern-Surveyor_delayed_qc

Each config points to one vessel's S3 path, producing a separate Zarr store. No TIME collisions possible.

Pros: Cleanest, minimal code changes, fully correct.
Cons: Two configs to maintain; downstream consumers query both stores.

Option 2 — Add platform as a fixed second dimension (recommended)

Restructure the Zarr to have dims (TIME, platform) (or (platform, TIME)):

  • TIME remains the append dimension_handle_duplicate_regions works unchanged
  • platform is a fixed 2-value dim (["VMQ9273", "VLHJ"]) — no append logic needed

Implementation in preprocessing:

  1. Determine the vessel's platform_code from the file's global attributes
  2. Expand the 1D (TIME,) variables to (platform, TIME) with NaN fill for the inactive platform
  3. Set the platform coordinate accordingly

The Zarr region-write logic (region={TIME: slice(start, end)}) still works because platform is a fixed dimension — _write_ds and _handle_duplicate_regions only iterate over the TIME dimension.

Pros: Single store, all vessels co-located, no data loss, platform is a queryable coordinate.
Cons: Requires preprocessing changes to expand 1D files to 2D; schema update; migration of existing stores.

Option 3 — Protective guard in _write_ds (short-term mitigation)

Detect cross-file TIME collisions at write time using the filename / platform_code variables already in the store. If common_append_dim_values > 0 AND the platform_code at those positions in the store ≠ the incoming batch's platform_code → log a clear ERROR and skip (don't overwrite) instead of calling _handle_duplicate_regions.

This prevents silent data loss while a proper fix is designed.

Recommendation

  1. Immediately (separate PR): implement Option 3 as a protective guard to stop silent overwriting and log the conflicting files clearly
  2. Preferred fix: implement Option 2 (add platform fixed dim) — stays as a single dataset, correct semantics
  3. Alternative: Option 1 (split configs) if Option 2 proves too complex

cc @lbesnard

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions