Skip to content

Fix Metview regrid memory leak#550

Draft
j9sh264 wants to merge 3 commits into
mainfrom
regrid-memory-cleanup
Draft

Fix Metview regrid memory leak#550
j9sh264 wants to merge 3 commits into
mainfrom
regrid-memory-cleanup

Conversation

@j9sh264

@j9sh264 j9sh264 commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Problem

During the execution of the Dataflow regrid pipeline, we observed persistent memory accumulation on the worker nodes, eventually leading to MemoryLimitExceeded errors. The memory leak was isolated to the fs.base_date() method in the metview-python package. When extracting the year for year-wise directories, fs.base_date() triggers the allocation of C-level memory structures (via the underlying Metview/ecCodes bindings) that are not properly released or tracked by Python's Garbage Collector.

Solution

This PR completely bypasses the leaky base_date() implementation by directly extracting primitive data types from the GRIB keys.

Key Changes:

  • Added get_safe_base_date() helper: Replaces the native fs.base_date() call. It utilizes fs.grib_get(["dataDate", "dataTime"]) to fetch the raw date/time strings directly from the ecCodes engine.
  • Because grib_get returns standard Python primitives (strings/lists), Python's Garbage Collector can effortlessly track and destroy them when they go out of scope.
  • We then manually parse these safe primitives into Python datetime objects using Metview's built-in utils.date_from_ecc_keys(d, t).

Fieldset = t.Any


def get_safe_base_date(fs: Fieldset) -> t.Union[datetime.datetime, t.List[datetime.datetime], None]:

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is taken from here

return result[0] if len(result) == 1 else result


def memory_usage_mb():

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for debugging purposes for now. Will remove it.

return len(matches[0].metadata_list) > 0

def apply(self, uri: str) -> None:
print(f"Initial Memory: {memory_usage_mb():.2f} MB")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for debugging purposes for now. Will remove it.

except Exception as e:
logger.error(f'Regrid failed for {uri!r}. Error: {str(e)}')

print(f"Final Memory: {memory_usage_mb():.2f} MB")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for debugging purposes for now. Will remove it.

@j9sh264 j9sh264 self-assigned this Jun 25, 2026
@j9sh264 j9sh264 changed the title Fix memory leak by explicitly releasing Metview objects. Fix Metview regrid memory leak Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant