Skip to content

feat(archive): integrate archive lifecycle into sandbox state machine, operator, and reconciler#1095

Open
zhangjaycee wants to merge 20 commits into
alibaba:masterfrom
zhangjaycee:feature/archive_lifecycle
Open

feat(archive): integrate archive lifecycle into sandbox state machine, operator, and reconciler#1095
zhangjaycee wants to merge 20 commits into
alibaba:masterfrom
zhangjaycee:feature/archive_lifecycle

Conversation

@zhangjaycee

@zhangjaycee zhangjaycee commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

close #1085

1. State Machine Overview

image

6 states: pending / running / stopped / archiving / archived / deleted (final).


2. Timestamp Fields

Field Meaning Set by Cleared by
create_time Sandbox creation time SandboxManager._build_sandbox_info_metadata Never
start_time First time container is alive on_alive (once, idempotent) Never
stop_time Container stop time on_stop Popped by on_restart / on_restore
state_enter_time Entry into an intermediate state (ARCHIVING or restore-PENDING) on_archive / on_restore Popped by on_archive_done / on_archive_failed / on_restore_failed
archive_time Archive completion time on_archive_done Popped by on_archive_failed
delete_time Soft-delete time on_delete Never (final state)

Key points

  • state_enter_time is the timeout anchor for intermediate states. _reconcile_archiving and _reconcile_pending compute elapsed time from this field. When it exceeds archive_timeout_sec / restore_timeout_sec, the reconciler fires archive_failed / restore_failed to roll back.
  • archive_time is the completion time, not the initiation time. It is written in on_archive_done (ARCHIVING → ARCHIVED), not in on_archive (STOPPED → ARCHIVING).
  • archive_time also serves as a restore-in-progress discriminator. _reconcile_pending uses info.get("archive_time") to distinguish a fresh-start PENDING sandbox from one that is restoring from ARCHIVED — the presence of archive_time indicates the latter, which requires restore timeout detection.

3. Constraints and Design Choices

Single archive per sandbox

Each sandbox has at most one archive — multi-version snapshots are not supported. Archive artifacts use deterministic names based solely on sandbox_id:

Storage Format Example
OSS / S3 key {prefix}{sandbox_id}.tar.gz rock-archives/sbx-abc123.tar.gz
ACR image ref {registry}/{namespace}/sandbox_archived:{sandbox_id} rock-acr.cr.aliyuncs.com/sandbox_archive/sandbox_archived:sbx-abc123

This means any component can compute the storage location from sandbox_id + config alone, without reading extra fields from the database.

Re-archive: explicit delete before write

A sandbox may go through multiple archive/restore cycles:

stopped → archive → archived → restore → pending → running → stopped → archive again

On the second archive, the previous key/ref still exist in remote storage (restore downloads but does not delete the remote copy). We handle this by explicitly deleting the old artifacts before starting a new archive — we do not rely on OSS/ACR implicit overwrite behavior.

Rationale:

  • Different S3-compatible implementations vary in PUT-overwrite semantics (versioning, ACL inheritance)
  • ACR tag overwrite behavior differs across registry implementations (some require manifest deletion first)
  • Explicit delete-then-write is predictable and easy to debug

The deletion is triggered in on_archive when sandbox_info already contains archive_time (indicating a prior successful archive). Both this and the on_delete cleanup are best-effort — failures are logged as warnings but do not block the main flow.

When Trigger What gets cleaned
on_archive (re-archive) archive_time already present in sandbox_info Old key + old ref
on_delete archive_time present and storage clients available key + ref

Size limits

Two configurable limits prevent archiving excessively large sandboxes:

Config Default Check point
max_image_push_size "16g" After docker commit, before push — checked via docker image inspect
max_dir_upload_size "16g" Before upload_dir — checked via du -sb

When a limit is exceeded, the actor raises RuntimeError and is killed. The _reconcile_archiving scanner detects the ARCHIVING timeout and fires archive_failed, rolling the sandbox back to STOPPED.

Note: the dir size check happens after the image has already been pushed. If the dir exceeds the limit, the image is rolled back (image_storage.delete) before the error propagates.


4. Configuration (added by this change)

lifecycle:
  archive_timeout_sec: 1800          # ARCHIVING state timeout → rollback to STOPPED
  restore_timeout_sec: 1800          # restore-PENDING timeout → rollback to ARCHIVED
  archive:
    max_image_push_size: "16g"       # image size limit before push
    max_dir_upload_size: "16g"       # log dir size limit before upload

@zhangjaycee zhangjaycee force-pushed the feature/archive_lifecycle branch from d2c9e3b to d71185a Compare June 11, 2026 12:04
Comment thread rock/admin/core/schema.py Outdated
memory = Column(String(64), nullable=True)
create_user_gray_flag = Column(Boolean, nullable=True)
archive_time = Column(String(64), nullable=True)
state_enter_time = Column(String(64), nullable=True)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是啥意思

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restart 操作进入 pending 状态或者 archive 操作进入 archiving 时设置 state_enter_time,以便后台 reconcile 任务判断 restart/archive 是否超时。为了更清晰,将 state_enter_time 改名 intermediate_state_started_at 。

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

挺诡异的这个字段。是不是可以用之前说的conditions

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以,等 conditions 那个开发完我 rebase 下这里

Comment thread rock/admin/core/schema.py Outdated
cpus = Column(Float, nullable=True)
memory = Column(String(64), nullable=True)
create_user_gray_flag = Column(Boolean, nullable=True)
archive_time = Column(String(64), nullable=True)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time类的放在一起

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

顺序改了

Comment thread rock/admin/entrypoints/sandbox_api.py Outdated
return RockResponse(result=f"{sandbox_id} deleted")


@sandbox_router.post("/archive")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restful

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成了 restful 形式

Comment thread rock/admin/main.py Outdated
archive_cfg = rock_config.lifecycle.archive
if archive_cfg.enabled:
acr = archive_cfg.acr
ds = archive_cfg.dir_storage

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要用非标缩写

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改了缩写

Comment thread rock/admin/main.py Outdated
)
set_sandbox_manager(sandbox_manager)

archive_cfg = rock_config.lifecycle.archive

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一段抽成函数

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

初始化挪到 sandbox_manager.init() 了

Comment thread rock/admin/main.py Outdated
ds_ready = ds.endpoint and ds.access_key_id and ds.access_key_secret
if not (acr_ready and ds_ready):
raise RuntimeError("archive.enabled=true but ACR or dir_storage credentials are missing")
if ds.type == "s3":

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里做个Factory

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factory 放到了 admin/sandbox/archive/factory.py

Comment thread rock/sandbox/archive/_constants.py Outdated
@@ -0,0 +1,6 @@
def dir_archive_key(sandbox_id: str, prefix: str) -> str:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

封装到类。不要再写这种无头函数了

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

封装成了 class ArchiveKeys

@zhangjaycee zhangjaycee force-pushed the feature/archive_lifecycle branch from d71185a to 3a100c1 Compare June 17, 2026 07:48
@@ -61,7 +62,6 @@ def _setup_metrics_scheduler(self):
logger.info("APScheduler started for metrics collection")

def _setup_job_check_scheduler(self):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块后面记得重构一下。父类怎么能调用子类函数呢

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

父类增加了 _reconcile 和 _setup_job_check_scheduler 的 abstract method 声明,子类 必须实现。

Comment thread rock/config.py Outdated
@dataclass
class SandboxLifecycleConfig:
reconcile_interval_sec: int = 30
archive_timeout_sec: int = 1800

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

风格保持一致

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后缀改为 seconds 了

@zhangjaycee zhangjaycee marked this pull request as ready for review June 17, 2026 12:40
@zhangjaycee zhangjaycee force-pushed the feature/archive_lifecycle branch from 46dbec1 to 9142d34 Compare June 17, 2026 12:42
Comment thread rock/sandbox/archive/factory.py Outdated
from rock.sandbox.archive.s3_storage import S3DirStorage


def make_dir_storage_from_config(dir_storage_cfg: ArchiveDirStorageConfig):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是啥factory。才刚说过不要无头函数,怎么又写一个

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

两个 factor 函数分别改到了 AbstractDirStorage 和 AbstractImageStorage 类中

Add time-series state transition records to sandbox status for audit
and debugging. Each transition captures from_state, to_state, event,
and timestamp via a before_transition callback in SandboxStateMachine.

- Add state_history field to SandboxInfo (persisted in Redis + DB)
- Add before_transition callback to record transitions (skip self-loops)
- Add GET /sandboxes/{sandbox_id}/state_history API endpoint
- Add StateTransitionRecord and StateHistoryResponse models
…er state

Restart reuses existing container without image pull, but the phases
data from the previous run lingered in meta_store causing image_pull
to show an incorrect status. Mark image_pull as SUCCESS immediately
in DockerDeployment.restart() and strip stale phases from the
meta_store entry in on_restart().
…, operator, and reconciler

- Add ARCHIVING/ARCHIVED states and archive/restore transitions to the
  state machine with on_* callbacks for metadata persistence.
- Extend AbstractOperator with start_archive/start_restore; implement in
  RayOperator with low-resource actor override for archive tasks.
- Add SandboxActor.archive() and restore_and_start() methods for
  commit+push and pull+download+start workflows.
- Add archive_sandbox() and restart_from_archived() to SandboxManager
  with archive cleanup on delete.
- Add _reconcile (30s) with _reconcile_pending (restore timeout + alive
  advancement) and _reconcile_archiving (completion check + retry).
- Extract _try_advance_pending helper from get_status for reuse.
- Wire up /archive API endpoint, SDK client, and admin storage injection.
…load

Add max_image_push_size and max_dir_upload_size to ArchiveConfig (default
16g each). Enforced in SandboxActor.archive() before push/upload.
Supports Nacos dynamic override via RockConfig.update().
--table: generate DDL for specific tables instead of all.
--alter-from: compare current ORM against an old DDL file or git ref
(commit/tag) and output ALTER TABLE ADD COLUMN / CREATE INDEX statements.
Unit tests cover storage/snapshot interfaces, state machine transitions,
SandboxActor archive/restore, reconcile_archiving progress checks, and
delete-clears-archive behavior. Integration tests cover Registry V2
push/pull/delete, S3 storage round-trip, and full archive E2E with
MinIO + local registry fixtures.
…ivate attrs

- Import from constants.ArchiveKeys instead of deleted _constants standalone fns
- Set _dir_storage/_image_storage on MagicMock(spec=...) so the storage guard
  in restart_from_archived is bypassed and the state check is reached
Align SandboxLifecycleConfig field names with the project convention
(ray_reconnect_interval_seconds, interval_seconds, etc.):
  reconcile_interval_sec → reconcile_interval_seconds
  archive_timeout_sec    → archive_timeout_seconds
  restore_timeout_sec    → restore_timeout_seconds
….__init__

- Add rock/sandbox/archive/factory.py with make_dir_storage_from_config()
  and make_image_storage_from_config(); _init_archive_storage delegates all
  construction to factory, keeping only credential validation logic itself
- SandboxManager._init_archive_storage() called from __init__ — storage is
  fully configured at construction time, no more two-phase init from main.py
- Remove archive storage helpers and unused imports from main.py
- Fix test_collect_sandbox_meta fixture: use concrete BaseManager subclass
  since BaseManager is now abstract
RESTful-style archive endpoint following the same pattern as
POST /sandboxes/{sandbox_id}/restart (0632c8b).
…istory lookup

Use the existing state_history list (recorded by before_transition callback)
to derive when a sandbox entered an intermediate state, eliminating the need
for a dedicated DB column. Add get_last_entered_at() helper for reconcile
timeout detection.
@zhangjaycee zhangjaycee force-pushed the feature/archive_lifecycle branch from fcab080 to db036dc Compare June 25, 2026 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support Sandbox Archive/Restore Lifecycle

2 participants