Skip to content

Add Checkpointing / Resume Functionality for Experiments #3

Description

@AWbosman

Background

Currently, if an experiment run is interrupted (e.g., due to manual cancellation, a node going down, or other unexpected issues), it is not possible to resume the experiment without restarting everything from scratch.
This wastes time and compute, especially for large-scale experiments.

Proposal

Implement a checkpointing / resume system that allows VERONA to continue experiments from where they left off.

Key ideas:

  • Track which experiment instances have already been completed (e.g., store this in a results CSV or a lightweight database).
  • On restarting, check which instances are missing or incomplete.
  • Continue execution only for the incomplete instances.

Benefits

  • Saves time by avoiding re-running completed instances.
  • Makes VERONA more robust to interruptions and cluster instability.
  • Improves user experience for long-running experiments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions