feat(archive): integrate archive lifecycle into sandbox state machine, operator, and reconciler#1095
feat(archive): integrate archive lifecycle into sandbox state machine, operator, and reconciler#1095zhangjaycee wants to merge 20 commits into
Conversation
d2c9e3b to
d71185a
Compare
| memory = Column(String(64), nullable=True) | ||
| create_user_gray_flag = Column(Boolean, nullable=True) | ||
| archive_time = Column(String(64), nullable=True) | ||
| state_enter_time = Column(String(64), nullable=True) |
There was a problem hiding this comment.
restart 操作进入 pending 状态或者 archive 操作进入 archiving 时设置 state_enter_time,以便后台 reconcile 任务判断 restart/archive 是否超时。为了更清晰,将 state_enter_time 改名 intermediate_state_started_at 。
There was a problem hiding this comment.
挺诡异的这个字段。是不是可以用之前说的conditions
There was a problem hiding this comment.
可以,等 conditions 那个开发完我 rebase 下这里
| cpus = Column(Float, nullable=True) | ||
| memory = Column(String(64), nullable=True) | ||
| create_user_gray_flag = Column(Boolean, nullable=True) | ||
| archive_time = Column(String(64), nullable=True) |
| return RockResponse(result=f"{sandbox_id} deleted") | ||
|
|
||
|
|
||
| @sandbox_router.post("/archive") |
There was a problem hiding this comment.
改成了 restful 形式
| archive_cfg = rock_config.lifecycle.archive | ||
| if archive_cfg.enabled: | ||
| acr = archive_cfg.acr | ||
| ds = archive_cfg.dir_storage |
| ) | ||
| set_sandbox_manager(sandbox_manager) | ||
|
|
||
| archive_cfg = rock_config.lifecycle.archive |
There was a problem hiding this comment.
初始化挪到 sandbox_manager.init() 了
| ds_ready = ds.endpoint and ds.access_key_id and ds.access_key_secret | ||
| if not (acr_ready and ds_ready): | ||
| raise RuntimeError("archive.enabled=true but ACR or dir_storage credentials are missing") | ||
| if ds.type == "s3": |
There was a problem hiding this comment.
factory 放到了 admin/sandbox/archive/factory.py
| @@ -0,0 +1,6 @@ | |||
| def dir_archive_key(sandbox_id: str, prefix: str) -> str: | |||
There was a problem hiding this comment.
封装成了 class ArchiveKeys
d71185a to
3a100c1
Compare
| @@ -61,7 +62,6 @@ def _setup_metrics_scheduler(self): | |||
| logger.info("APScheduler started for metrics collection") | |||
|
|
|||
| def _setup_job_check_scheduler(self): | |||
There was a problem hiding this comment.
这块后面记得重构一下。父类怎么能调用子类函数呢
There was a problem hiding this comment.
父类增加了 _reconcile 和 _setup_job_check_scheduler 的 abstract method 声明,子类 必须实现。
| @dataclass | ||
| class SandboxLifecycleConfig: | ||
| reconcile_interval_sec: int = 30 | ||
| archive_timeout_sec: int = 1800 |
There was a problem hiding this comment.
后缀改为 seconds 了
46dbec1 to
9142d34
Compare
| from rock.sandbox.archive.s3_storage import S3DirStorage | ||
|
|
||
|
|
||
| def make_dir_storage_from_config(dir_storage_cfg: ArchiveDirStorageConfig): |
There was a problem hiding this comment.
这是啥factory。才刚说过不要无头函数,怎么又写一个
There was a problem hiding this comment.
两个 factor 函数分别改到了 AbstractDirStorage 和 AbstractImageStorage 类中
Add time-series state transition records to sandbox status for audit
and debugging. Each transition captures from_state, to_state, event,
and timestamp via a before_transition callback in SandboxStateMachine.
- Add state_history field to SandboxInfo (persisted in Redis + DB)
- Add before_transition callback to record transitions (skip self-loops)
- Add GET /sandboxes/{sandbox_id}/state_history API endpoint
- Add StateTransitionRecord and StateHistoryResponse models
…er state Restart reuses existing container without image pull, but the phases data from the previous run lingered in meta_store causing image_pull to show an incorrect status. Mark image_pull as SUCCESS immediately in DockerDeployment.restart() and strip stale phases from the meta_store entry in on_restart().
…, operator, and reconciler - Add ARCHIVING/ARCHIVED states and archive/restore transitions to the state machine with on_* callbacks for metadata persistence. - Extend AbstractOperator with start_archive/start_restore; implement in RayOperator with low-resource actor override for archive tasks. - Add SandboxActor.archive() and restore_and_start() methods for commit+push and pull+download+start workflows. - Add archive_sandbox() and restart_from_archived() to SandboxManager with archive cleanup on delete. - Add _reconcile (30s) with _reconcile_pending (restore timeout + alive advancement) and _reconcile_archiving (completion check + retry). - Extract _try_advance_pending helper from get_status for reuse. - Wire up /archive API endpoint, SDK client, and admin storage injection.
…load Add max_image_push_size and max_dir_upload_size to ArchiveConfig (default 16g each). Enforced in SandboxActor.archive() before push/upload. Supports Nacos dynamic override via RockConfig.update().
--table: generate DDL for specific tables instead of all. --alter-from: compare current ORM against an old DDL file or git ref (commit/tag) and output ALTER TABLE ADD COLUMN / CREATE INDEX statements.
Unit tests cover storage/snapshot interfaces, state machine transitions, SandboxActor archive/restore, reconcile_archiving progress checks, and delete-clears-archive behavior. Integration tests cover Registry V2 push/pull/delete, S3 storage round-trip, and full archive E2E with MinIO + local registry fixtures.
…ackground and _reconcile
…name _constants to constants
…ivate attrs - Import from constants.ArchiveKeys instead of deleted _constants standalone fns - Set _dir_storage/_image_storage on MagicMock(spec=...) so the storage guard in restart_from_archived is bypassed and the state check is reached
Align SandboxLifecycleConfig field names with the project convention (ray_reconnect_interval_seconds, interval_seconds, etc.): reconcile_interval_sec → reconcile_interval_seconds archive_timeout_sec → archive_timeout_seconds restore_timeout_sec → restore_timeout_seconds
….__init__ - Add rock/sandbox/archive/factory.py with make_dir_storage_from_config() and make_image_storage_from_config(); _init_archive_storage delegates all construction to factory, keeping only credential validation logic itself - SandboxManager._init_archive_storage() called from __init__ — storage is fully configured at construction time, no more two-phase init from main.py - Remove archive storage helpers and unused imports from main.py - Fix test_collect_sandbox_meta fixture: use concrete BaseManager subclass since BaseManager is now abstract
RESTful-style archive endpoint following the same pattern as
POST /sandboxes/{sandbox_id}/restart (0632c8b).
…istory lookup Use the existing state_history list (recorded by before_transition callback) to derive when a sandbox entered an intermediate state, eliminating the need for a dedicated DB column. Add get_last_entered_at() helper for reconcile timeout detection.
fcab080 to
db036dc
Compare
close #1085
1. State Machine Overview
6 states:
pending/running/stopped/archiving/archived/deleted(final).2. Timestamp Fields
create_timeSandboxManager._build_sandbox_info_metadatastart_timeon_alive(once, idempotent)stop_timeon_stopon_restart/on_restorestate_enter_timeon_archive/on_restoreon_archive_done/on_archive_failed/on_restore_failedarchive_timeon_archive_doneon_archive_faileddelete_timeon_deleteKey points
state_enter_timeis the timeout anchor for intermediate states._reconcile_archivingand_reconcile_pendingcompute elapsed time from this field. When it exceedsarchive_timeout_sec/restore_timeout_sec, the reconciler firesarchive_failed/restore_failedto roll back.archive_timeis the completion time, not the initiation time. It is written inon_archive_done(ARCHIVING → ARCHIVED), not inon_archive(STOPPED → ARCHIVING).archive_timealso serves as a restore-in-progress discriminator._reconcile_pendingusesinfo.get("archive_time")to distinguish a fresh-start PENDING sandbox from one that is restoring from ARCHIVED — the presence ofarchive_timeindicates the latter, which requires restore timeout detection.3. Constraints and Design Choices
Single archive per sandbox
Each sandbox has at most one archive — multi-version snapshots are not supported. Archive artifacts use deterministic names based solely on
sandbox_id:{prefix}{sandbox_id}.tar.gzrock-archives/sbx-abc123.tar.gz{registry}/{namespace}/sandbox_archived:{sandbox_id}rock-acr.cr.aliyuncs.com/sandbox_archive/sandbox_archived:sbx-abc123This means any component can compute the storage location from
sandbox_id+ config alone, without reading extra fields from the database.Re-archive: explicit delete before write
A sandbox may go through multiple archive/restore cycles:
On the second archive, the previous key/ref still exist in remote storage (restore downloads but does not delete the remote copy). We handle this by explicitly deleting the old artifacts before starting a new archive — we do not rely on OSS/ACR implicit overwrite behavior.
Rationale:
The deletion is triggered in
on_archivewhensandbox_infoalready containsarchive_time(indicating a prior successful archive). Both this and theon_deletecleanup are best-effort — failures are logged as warnings but do not block the main flow.on_archive(re-archive)archive_timealready present insandbox_infoon_deletearchive_timepresent and storage clients availableSize limits
Two configurable limits prevent archiving excessively large sandboxes:
max_image_push_size"16g"docker commit, beforepush— checked viadocker image inspectmax_dir_upload_size"16g"upload_dir— checked viadu -sbWhen a limit is exceeded, the actor raises
RuntimeErrorand is killed. The_reconcile_archivingscanner detects the ARCHIVING timeout and firesarchive_failed, rolling the sandbox back to STOPPED.Note: the dir size check happens after the image has already been pushed. If the dir exceeds the limit, the image is rolled back (
image_storage.delete) before the error propagates.4. Configuration (added by this change)