Skip to content

feat(lifecycle): auto-archive and auto-delete stopped sandboxes with configurable idle thresholds#1096

Draft
zhangjaycee wants to merge 12 commits into
alibaba:masterfrom
zhangjaycee:feature/archive_autotransition
Draft

feat(lifecycle): auto-archive and auto-delete stopped sandboxes with configurable idle thresholds#1096
zhangjaycee wants to merge 12 commits into
alibaba:masterfrom
zhangjaycee:feature/archive_autotransition

Conversation

@zhangjaycee

@zhangjaycee zhangjaycee commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

close #1085

Mechanism Config Default Behavior
Auto archive auto_archive_after_sec 0 (disabled) _auto_archive_stopped scans STOPPED sandboxes, auto-archives after idle threshold
Auto delete auto_delete_after_sec 0 (disabled) _auto_delete_stopped scans STOPPED sandboxes, auto-deletes after idle threshold

Execution order

_auto_transition runs auto-delete first, then auto-archive. Sandboxes deleted in the first pass are excluded from the archive scan to avoid double-processing.

Idle time source

Both auto-archive and auto-delete measure idle time from stop_time in sandbox_info. Sandboxes without a stop_time are skipped.


Configuration (added by this change)

lifecycle:
  auto_archive_after_sec: 3600       # stop-idle threshold before auto-archive
  auto_delete_after_sec: 86400       # stop-idle threshold before auto-delete

Introduce SandboxLifecycleConfig as the unified config entry point for
sandbox lifecycle parameters (timeouts, auto-archive, auto-delete).
ArchiveConfig (with OssConfig, AcrConfig) is nested under
lifecycle.archive in YAML. RockConfig.from_env() parses the lifecycle
section and __post_init__ coerces dicts to dataclasses.
Add abstract interfaces (AbstractDirStorage, AbstractImageStorage) and
concrete implementations: OssDirStorage, S3DirStorage for directory
archives, DockerRegistryV2ImageStorage for container image snapshots
with optional Bearer token authentication.
…, operator, and reconciler

- Add ARCHIVING/ARCHIVED states and archive/restore transitions to the
  state machine with on_* callbacks for metadata persistence.
- Extend AbstractOperator with start_archive/start_restore; implement in
  RayOperator with low-resource actor override for archive tasks.
- Add SandboxActor.archive() and restore_and_start() methods for
  commit+push and pull+download+start workflows.
- Add archive_sandbox() and restart_from_archived() to SandboxManager
  with archive cleanup on delete.
- Add _reconcile (30s) with _reconcile_pending (restore timeout + alive
  advancement) and _reconcile_archiving (completion check + retry).
- Extract _try_advance_pending helper from get_status for reuse.
- Wire up /archive API endpoint, SDK client, and admin storage injection.
…load

Add max_image_push_size and max_dir_upload_size to ArchiveConfig (default
16g each). Enforced in SandboxActor.archive() before push/upload.
Supports Nacos dynamic override via RockConfig.update().
--table: generate DDL for specific tables instead of all.
--alter-from: compare current ORM against an old DDL file or git ref
(commit/tag) and output ALTER TABLE ADD COLUMN / CREATE INDEX statements.
Unit tests cover storage/snapshot interfaces, state machine transitions,
SandboxActor archive/restore, reconcile_archiving progress checks, and
delete-clears-archive behavior. Integration tests cover Registry V2
push/pull/delete, S3 storage round-trip, and full archive E2E with
MinIO + local registry fixtures.
Clarify intent: this long-interval scanner handles automatic state
transitions (expired → STOPPED), not generic background checks.
… time

Add _auto_archive_stopped to _auto_transition: STOPPED sandboxes whose
stop_time exceeds auto_archive_after_sec (default 3600s) are
automatically archived. Set auto_archive_after_sec=0 to disable.
Rename _check_job_background → _auto_transition to reflect expanded
scope.
…o-clear default

- Add auto_delete_after_sec / auto_clear_default_sec to SandboxLifecycleConfig
- Implement _auto_delete_stopped(): delete STOPPED sandboxes past threshold
  (runs before auto-archive; deleted IDs excluded from archive scan)
- Add _apply_auto_clear_default(): use lifecycle.auto_clear_default_sec when
  SDK does not explicitly set auto_clear_time_minutes
- Support Nacos override for lifecycle config section
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support Sandbox Archive/Restore Lifecycle

1 participant