diff --git a/docs/i18n/zh-Hans/docusaurus-plugin-content-docs/version-1.8.x/User Guides/sandbox_create_benchmark.md b/docs/i18n/zh-Hans/docusaurus-plugin-content-docs/version-1.8.x/User Guides/sandbox_create_benchmark.md
new file mode 100644
index 0000000000..a5bbb0f361
--- /dev/null
+++ b/docs/i18n/zh-Hans/docusaurus-plugin-content-docs/version-1.8.x/User Guides/sandbox_create_benchmark.md	
@@ -0,0 +1,43 @@
+# ROCK Sandbox 大规模并发创建压测报告
+
+## 1. 测试背景
+
+在 Agentic 强化学习训练以及大规模 Agent Rollout 场景中，**单个训练 step 往往需要同时拉起成千上万个 Sandbox**，因此 Sandbox 的并发创建吞吐与延迟直接决定了整体训练效率。
+
+本报告分别在 1000 / 2000 / 4000 / 8000 / 16000 并发规模下完成 Sandbox 批量创建压测，验证 ROCK 在大规模并发下的稳定性与延迟表现。
+
+## 2. 测试范围与口径
+
+- **被测对象**：ROCK Sandbox 的创建链路。
+- **统计指标**：单个 Sandbox 从发起创建请求到 Sandbox 处于存活（可用）状态的端到端耗时（秒）。
+- **并发规模**：1000、2000、4000、8000、16000 个 Sandbox 同时发起创建。
+- **并发模型**：采用 **多机分布式 + 纯多进程** 的方式驱动并发——由多台机器同时发压，每台机器内部以独立 OS 进程承载每个 Sandbox 创建任务，避免单进程内 GIL、事件循环、TLS 握手等成为客户端侧瓶颈，使数据真实反映服务端的处理能力。
+
+## 3. 测试结果
+
+### 3.1 Sandbox 创建耗时统计
+
+| 并发规模 | 样本数 | 成功率 | 最小值 | 最大值 | 平均值 | P50 | P95 | P99 |
+|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| **1000** | 1000 | 100% | 0.47s | 9.84s | 4.72s | 4.93s | 6.89s | 8.69s |
+| **2000** | 2000 | 100% | 0.64s | 11.20s | 4.90s | 4.92s | 9.76s | 10.53s |
+| **4000** | 4000 | 100% | 0.93s | 15.51s | 7.16s | 7.05s | 11.26s | 13.34s |
+| **8000** | 8000 | 100% | 3.85s | 33.99s | 18.06s | 17.86s | 30.09s | 32.03s |
+| **16000** | 16000 | 100% | 2.66s | 63.84s | 37.17s | 39.98s | 56.68s | 60.11s |
+
+> 时间单位均为秒（s）。所有规模下成功率均为 **100%**（0 失败）。
+>
+> 说明：随着并发规模的增大，为保护服务端稳定性，ROCK 会在控制面侧进行限流，因此耗时随并发上升而有所增加。
+
+![Sandbox 创建耗时分布](../../../../../static/img/sandbox_create_latency_distribution.png)
+
+## 4. 结论
+
+ROCK 在 1000 至 16000 并发规模下完成的 Sandbox 批量创建压测取得了以下结果：
+
+1. **100% 成功率**，验证了 ROCK 在大规模并发下的可靠性。
+2. **小规模并发（≤2000）几乎无排队**，P50 稳定在 5s 以内，可以满足绝大多数实时性敏感的训练 / 评估场景。
+3. **大规模并发（≥4000）延迟随并发数近似线性增长**，行为可预测，便于按 Sandbox 池规模做容量规划。
+4. **16000 并发场景下 P99 仍可控制在 60s 量级**，足以支撑超大规模 Agent Rollout 与并行 RL Step 等任务。
+
+ROCK 已经具备承载 **万级并发 Sandbox 创建** 的能力，为大规模 Agentic RL 训练提供了坚实的环境基础设施。
diff --git a/docs/i18n/zh-Hans/docusaurus-plugin-content-docs/version-1.8.x/User Guides/scheduler.md b/docs/i18n/zh-Hans/docusaurus-plugin-content-docs/version-1.8.x/User Guides/scheduler.md
new file mode 100644
index 0000000000..88662edd51
--- /dev/null
+++ b/docs/i18n/zh-Hans/docusaurus-plugin-content-docs/version-1.8.x/User Guides/scheduler.md	
@@ -0,0 +1,283 @@
+---
+sidebar_position: 5
+---
+
+# 任务调度器(Scheduler)
+
+ROCK 调度器是内嵌于 `admin` 服务中的周期性任务框架。它会按可配置的时间间隔,把后台维护任务(镜像清理、文件清理、容器清理、镜像预拉取、自定义任务……)分发到所有存活的 Ray worker 上,从而在无人工干预的情况下保持 worker 节点健康。
+
+本文介绍如何启用调度器、配置内置任务、编写自定义任务,以及如何观测运行状态。
+
+## 1. 工作原理
+
+- 调度器以独立的守护线程(`SchedulerThread`)运行在 `admin` 进程内,使用自己的 `asyncio` 事件循环。
+- 任务通过 [APScheduler](https://apscheduler.readthedocs.io/) 以固定间隔(`interval_seconds`)触发。
+- 每次触发时,调度器会先获取存活 Ray worker 列表(由 `worker_cache_ttl` 秒级缓存),然后并发地把任务下发到每个 worker(默认并发度:50)。
+- **下发动作通过 worker 上的 rocklet 服务以 HTTP 方式完成**:admin 端构造 `RemoteSandboxRuntime(host=worker_ip, port=Port.PROXY)`(参见 `rock.deployments.constants.Port`),并调用 `runtime.execute / read_file / write_file`。**因此每个 worker 都必须运行 `rocklet` 服务并在 `Port.PROXY` 上可达**,否则调度器无法下发命令、也无法在 worker 上读写状态文件。
+- 每个任务都继承自 `rock.admin.scheduler.task_base.BaseTask`,并必须实现 `run_action(runtime: RemoteSandboxRuntime)` —— 单个 worker 上的实际执行逻辑。
+- 每个 worker 的执行状态会被持久化到 `ROCK_SCHEDULER_STATUS_DIR`(默认 `/data/scheduler_status`)目录下的 JSON 文件中;每次执行结束后还会写入聚合报告 `<status_dir>/<task_type>_run_report.json`。
+- 当配置了 Nacos 配置源时,调度器会订阅配置变更,并按 diff 应用:仅 hash 发生变化的任务被重新安装,被删除的任务会同步从所有 worker 上清理。
+
+### 前置条件
+
+启用调度器之前,请确认每个 Ray worker 满足以下条件:
+
+| 条件 | 原因 |
+|------|------|
+| worker 上正在运行 `rocklet` 进程 | 调度器通过 rocklet HTTP 接口下发每一个任务;若 rocklet 不存在,`runtime.execute` 调用会超时。 |
+| admin 可访问 rocklet 监听端口 | 调度器固定使用 `Port.PROXY`(定义于 `rock.deployments.constants.Port`)作为下发目标,请确认防火墙 / 安全组未阻断该端口。 |
+| `ROCK_SCHEDULER_STATUS_DIR` 在 worker 内可写 | 任务会在该目录下读写 `<task>_status.json`,用于幂等控制和 PID 跟踪。 |
+| 任务依赖的工具在 worker 上可用 | 例如清理 / 拉取类任务需要 `docker`;`ImageCleanupTask` 首次运行会通过 `curl` 联网安装 `docuum`。 |
+
+rocklet 服务由 worker 标准启动脚本(`docker_run.sh`、`docker_run_with_uv.sh`、`docker_run_with_pip.sh`)自动拉起,通常等价于 `rocklet --port <Port.PROXY>`。如果你使用了自定义 entrypoint 来启动 worker,请确保等价命令被执行。具体的运行时类型与 rocklet 启动方式可参考 [Configuration](./configuration.md)。
+
+### 幂等性
+
+每个任务都需要声明自己的幂等模式,该模式直接影响重复触发时的行为:
+
+| 模式 | 行为 |
+|------|------|
+| `IDEMPOTENT` | 每次 tick 都会执行,可安全重复(例如 `docker pull`、`find -exec rm`)。 |
+| `NON_IDEMPOTENT` | 任务会拉起一个后台守护进程(例如 `docuum`)。调度器会读取上一次的状态文件,检查记录的 PID 是否仍存活,若仍在运行则跳过本次启动。当任务从配置中移除时,调度器会通过 `pkill` 杀掉该进程。 |
+
+## 2. 启用调度器
+
+调度器配置位于 ROCK admin YAML 顶层 `scheduler:` 字段下(例如 `rock-conf/rock-local.yml`、`rock-conf/rock-dev.yml`)。
+
+```yaml
+scheduler:
+  enabled: true              # 总开关
+  worker_cache_ttl: 43200    # Worker IP 缓存 TTL(秒)
+  tasks:
+    # ... 任务列表,详见下文
+```
+
+| 字段 | 类型 | 默认值 | 说明 |
+|------|------|--------|------|
+| `enabled` | bool | `false` | 总开关。设为 `false` 时所有任务会被卸载,且不再触发。 |
+| `worker_cache_ttl` | int | `3600` | 存活 worker IP 列表缓存时长(秒);超过该时长后会从 `ray.nodes()` 重新获取。 |
+| `tasks` | list | `[]` | `TaskConfig` 列表,详见 [第 4 节](#4-任务配置schema)。 |
+
+### 相关环境变量
+
+| 变量 | 默认值 | 用途 |
+|------|--------|------|
+| `ROCK_SCHEDULER_STATUS_DIR` | `/data/scheduler_status` | worker 上写入任务状态 JSON 与执行报告的目录。 |
+| `ROCK_LOGGING_PATH` | (未设置) | 设置后,调度器拉起的守护进程(docuum、container_cleanup、image_pull)会将 stdout/stderr 重定向到 `<ROCK_LOGGING_PATH>/<task>.log`。 |
+| `ROCK_DOCUUM_INSTALL_URL` | `https://raw.githubusercontent.com/stepchowfun/docuum/main/install.sh` | `ImageCleanupTask` 按需拉取 `docuum` 安装脚本的 URL。 |
+
+## 3. 内置任务
+
+ROCK 在 `rock.admin.scheduler.tasks` 下提供了 4 个内置任务,通过把 `task_class` 设置为对应的全限定类路径即可注册。
+
+### 3.1 ImageCleanupTask
+
+在每个 worker 上运行 [`docuum`](https://github.com/stepchowfun/docuum),当磁盘占用超过阈值时按 LRU 策略淘汰镜像。**非幂等** —— `docuum` 是常驻守护进程;调度器会跟踪其 PID,只要进程仍存活就跳过重复拉起。
+
+```yaml
+- task_class: rock.admin.scheduler.tasks.image_cleanup_task.ImageCleanupTask
+  enabled: true
+  interval_seconds: 43200       # 每 12 小时检查一次守护进程
+  params:
+    disk_threshold: "70%"       # 磁盘占用超过 70% 时触发淘汰
+    image_whitelist:             # 匹配 repository:tag 的 glob 模式,白名单内的镜像不会被淘汰
+      - "python:3.11"
+      - "my-registry.example.com/base/*"
+```
+
+| 参数 | 类型 | 默认值 | 说明 |
+|------|------|--------|------|
+| `disk_threshold` | str | `"1T"` | 传给 `docuum --threshold` 的磁盘阈值,支持容量(`100G`、`1T`)或百分比(`70%`)。 |
+| `image_whitelist` | list[str] | `[]` | 透传给 `docuum --keep` 的 glob 模式列表。 |
+
+### 3.2 FileCleanupTask
+
+遍历配置的目录,删除超过 `max_age_mins` 或大于 `max_file_size` 的文件,然后清理留下的空目录。**幂等**。
+
+```yaml
+- task_class: rock.admin.scheduler.tasks.file_cleanup_task.FileCleanupTask
+  enabled: true
+  interval_seconds: 86400       # 每天执行一次
+  params:
+    target_dirs:
+      # 字符串形式 —— 不配置排除项
+      - "/data/service_status"
+      # 对象形式 —— 配置该目录独有的排除项
+      - path: "/data/logs"
+        exclude_files:           # 支持纯文件名 / 相对路径 / 绝对路径
+          - "docuum.log"
+          - "./rocklet.log"
+          - "./access.log"
+        exclude_dirs:
+          - ".cache"
+    max_age_mins: 10080          # 7 天,超出此时间的文件会被删除
+    max_file_size: "1G"          # 大于此大小的文件会被删除(支持 K/M/G/T)
+```
+
+| 参数 | 类型 | 默认值 | 说明 |
+|------|------|--------|------|
+| `target_dirs` | list | `[]` | 每个条目是一个字符串(只填路径)或 `{path, exclude_files, exclude_dirs}`。 |
+| `max_age_mins` | int | `10080` | mtime 早于该分钟数的文件会被删除。 |
+| `max_file_size` | str | `"1G"` | 大于该阈值的文件会被删除,支持 `K/M/G/T` 单位。 |
+
+删除条件为 `(-mmin +max_age_mins) OR (-size +max_file_size)`。文件清理后,会再用 `find -depth -type d -empty -delete` 清理留下的空目录(同样遵循 `exclude_dirs` 配置)。
+
+### 3.3 ContainerCleanupTask
+
+删除停止时间超过指定时长的 Docker 容器,避免 worker 上的容器列表无限增长。**幂等**。
+
+```yaml
+- task_class: rock.admin.scheduler.tasks.container_cleanup_task.ContainerCleanupTask
+  enabled: true
+  interval_seconds: 86400
+  params:
+    max_age_hours: 72            # 删除超过 72 小时的 exited 容器
+```
+
+| 参数 | 类型 | 默认值 | 说明 |
+|------|------|--------|------|
+| `max_age_hours` | int | `24` | 已退出容器的最大保留时间(以 `FinishedAt` 起算的小时数);超出后被 `docker rm`。 |
+
+每次执行还会顺带清理处于 `created` 状态(从未启动)的容器。
+
+### 3.4 ImagePullTask
+
+在每个 worker 上预拉取一组 Docker 镜像,并可选地先登录私有仓库,以降低沙箱冷启动延迟。**幂等**(若镜像已是最新,`docker pull` 等同于空操作)。
+
+```yaml
+- task_class: rock.admin.scheduler.tasks.image_pull_task.ImagePullTask
+  enabled: true
+  interval_seconds: 21600       # 每 6 小时刷新一次
+  params:
+    images:
+      # 字符串形式 —— 公开镜像,无需鉴权
+      - "python:3.11"
+      # 对象形式 —— 私有镜像,需要登录
+      - image: "my-registry.example.com/chatos/python:313"
+        registry_username: "myuser"
+        registry_password: "bXlwYXNzd29yZA=="   # base64 编码
+```
+
+| 参数 | 类型 | 默认值 | 说明 |
+|------|------|--------|------|
+| `images` | list | `[]` | 每个条目是一个镜像字符串或 `{image, registry_username, registry_password}`。 |
+
+`registry_password` 必须是 base64 编码,worker 端会先解码再通过 `docker login --password-stdin` 登录。仓库地址会从镜像名称中解析,因此每个镜像可以指向不同的仓库。
+
+## 4. 任务配置 Schema
+
+`scheduler.tasks` 下的每一项都会被解析为 `rock.config.TaskConfig`:
+
+| 字段 | 类型 | 默认值 | 说明 |
+|------|------|--------|------|
+| `task_class` | str | `""` | Python 类的全限定路径,**必填**。 |
+| `enabled` | bool | `true` | 设为 `false` 的任务在加载阶段会被跳过,在 reload 阶段会被卸载。 |
+| `interval_seconds` | int | `3600` | APScheduler `interval` 间隔(秒)。 |
+| `params` | dict | `{}` | 任务自定义参数,会在 `from_config()` 中被消费。 |
+
+只要某个任务条目中任何字段发生变化,调度器就会卸载旧任务(非幂等任务还会清理 worker 上的守护进程与状态文件)再安装新任务,**整个过程不需要重启 admin 进程**。
+
+## 5. 编写自定义任务
+
+任何位于 Python 路径下、继承自 `BaseTask` 的类都可以注册为调度任务。最小契约示例如下:
+
+```python
+# my_pkg/my_tasks/disk_report_task.py
+from rock.admin.proto.request import SandboxCommand as Command
+from rock.admin.scheduler.task_base import BaseTask, IdempotencyType, TaskStatusEnum
+from rock.sandbox.remote_sandbox import RemoteSandboxRuntime
+
+
+class DiskReportTask(BaseTask):
+    """记录每个 worker 的 `df -h` 输出。"""
+
+    def __init__(self, interval_seconds: int = 3600, mount_point: str = "/"):
+        super().__init__(
+            type="disk_report",                   # 同时作为 APScheduler job id 与状态文件名前缀
+            interval_seconds=interval_seconds,
+            idempotency=IdempotencyType.IDEMPOTENT,
+        )
+        self.mount_point = mount_point
+
+    @classmethod
+    def from_config(cls, task_config) -> "DiskReportTask":
+        return cls(
+            interval_seconds=task_config.interval_seconds,
+            mount_point=task_config.params.get("mount_point", "/"),
+        )
+
+    async def run_action(self, runtime: RemoteSandboxRuntime) -> dict:
+        result = await runtime.execute(
+            Command(command=f"df -h {self.mount_point}", shell=True),
+        )
+        return {
+            "status": TaskStatusEnum.SUCCESS,
+            "exit_code": result.exit_code,
+            "stdout": result.stdout,
+        }
+```
+
+随后在 YAML 中注册:
+
+```yaml
+scheduler:
+  enabled: true
+  tasks:
+    - task_class: my_pkg.my_tasks.disk_report_task.DiskReportTask
+      enabled: true
+      interval_seconds: 600
+      params:
+        mount_point: "/data"
+```
+
+### 自定义任务编写要点
+
+- **`super().__init__()` 中的 `type` 必须全局唯一**:它会同时被用作 APScheduler job id、状态文件名(`<type>_status.json`)与执行报告文件名(`<type>_run_report.json`)。两个任务不能共用同一个 `type`。
+- **正确选择 `IdempotencyType`**:
+  - 当 `run_action` 同步执行完毕、可安全重入时,使用 `IDEMPOTENT`。
+  - 当任务通过 `nohup` 拉起常驻守护进程并返回 PID 时,使用 `NON_IDEMPOTENT`;调度器会跟踪 PID,在其存活期间跳过重复启动,在卸载时通过 `pkill` 杀掉。
+- **`run_action` 必须返回 dict**,推荐字段:
+  - `status` —— `TaskStatusEnum` 值,会写入状态文件。
+  - `pid` —— `NON_IDEMPOTENT` 守护型任务必须返回(可使用 `rock.utils.system.extract_nohup_pid` 从 `nohup ... & echo PID_PREFIX${{!}}PID_SUFFIX` 输出中提取)。
+  - 其他诊断字段会落到状态文件的 `extra` 块里。
+- **重写 `from_config(cls, task_config)`** 用于把 `task_config.params` 翻译成 `__init__` 的入参。
+- **使用 `runtime.execute / read_file / write_file`** 与 worker 通信,**不要**在本地直接执行 shell —— 调度器是把任务下发到远端 worker 的 `RemoteSandboxRuntime`。
+
+## 6. 可观测性
+
+每个任务会在每个 worker 上产生两类产物:
+
+| 路径 | 写入方 | 内容 |
+|------|--------|------|
+| `<ROCK_SCHEDULER_STATUS_DIR>/<task_type>_status.json` | `BaseTask.save_task_status` | 单 worker 最新状态:`task_name`、`worker_ip`、`pid`、`status`(`pending`/`running`/`success`/`failed`)、`last_run`、`error`,以及任务自定义的 `extra` 字段。 |
+| `<ROCK_SCHEDULER_STATUS_DIR>/<task_type>_run_report.json` | `BaseTask.run`(由 admin 端在本轮 tick 结束后写入) | 聚合报告:总数 / 成功数 / 失败数、`success_ips` 列表、`failed_details`(`ip` + 错误堆栈)。 |
+
+调度器内部日志会输出到 ROCK admin 标准日志路径下,logger 名称包括 `name="scheduler"`、`name="task_base"`、`name="image_clean"` 等。当设置了 `ROCK_LOGGING_PATH` 时,被调度器拉起的守护进程会把自身日志写入 `<ROCK_LOGGING_PATH>/<task>.log`(例如 `docuum.log`、`container_cleanup.log`、`image_pull.log`)。
+
+## 7. 通过 Nacos 动态热更(可选)
+
+当 admin 服务启用了 Nacos 配置源时,调度器会注册一个 YAML 监听器,并对配置推送做出响应:
+
+- 仅检查 `scheduler:` 段,其他段被忽略。
+- 新配置段会被计算 hash 并与上一次的 hash 比对 —— 重复推送会被自动跳过。
+- 通过对新旧任务列表 diff,决定哪些任务需要安装、卸载或重新安装(`params` / `interval_seconds` / `enabled` 任意改动都会触发)。
+- 被删除或重新安装的非幂等任务会先做清理:杀掉守护进程 PID 并删除状态文件。
+
+由此,任务间隔调整、参数微调、增删任务等都可以在 admin 进程不重启的前提下生效。
+
+## 8. 常见问题排查
+
+| 现象 | 可能原因 | 检查项 |
+|------|----------|--------|
+| admin 日志输出 `Scheduler disabled, all tasks removed` | `scheduler.enabled` 为 `false` | 把 YAML 中的 `enabled` 改为 `true`。 |
+| `No alive workers found for task '<type>'` | Ray 集群没有存活的 worker | 确认 `ray.nodes()` 返回的 CPU worker 处于 alive 状态;新加 worker 时可调小 `worker_cache_ttl`。 |
+| 任务到点触发,但每个 worker 都进入 `failed_details` 且报连接错误 | worker 上 rocklet 未运行,或 `Port.PROXY` 被防火墙阻断 | 在 worker 上访问 rocklet 存活探针 `GET /is_alive`(例如 `curl http://<worker_ip>:<Port.PROXY>/is_alive`);若无响应,使用 `rocklet --port <Port.PROXY>` 重启或在防火墙放行该端口。 |
+| 非幂等任务在某个 worker 上始终不再触发 | 状态文件中记录的 PID 仍存活 | 查看 `<status_dir>/<task>_status.json`,如 `status: running` 且 PID 仍在,则 `should_run` 会返回 `False`,属预期行为。 |
+| 日志报 `Failed to create task '<class>'` | `task_class` 导入失败 | 确认对应模块在 admin 进程中可被导入(同一个 venv、`PYTHONPATH` 中可见)。 |
+| `ImagePullTask` 中 `docker login` 失败 | `registry_password` 未做 base64 编码,或镜像名称解析出的仓库地址有误 | 用 `echo -n '<pwd>' \| base64` 重新编码;确认镜像名中的仓库 host 正确。 |
+
+## 相关文档
+
+- [Configuration](./configuration.md) —— 环境变量与运行时部署说明
+- [API Documentation](../References/api.md) —— admin HTTP API
+- [Python SDK Documentation](../References/Python%20SDK%20References/python_sdk.md) —— SDK 编程式使用
diff --git a/docs/static/img/sandbox_create_latency_distribution.png b/docs/static/img/sandbox_create_latency_distribution.png
new file mode 100644
index 0000000000..2f2307dd3c
Binary files /dev/null and b/docs/static/img/sandbox_create_latency_distribution.png differ
diff --git a/docs/versioned_docs/version-1.8.x/User Guides/sandbox_create_benchmark.md b/docs/versioned_docs/version-1.8.x/User Guides/sandbox_create_benchmark.md
new file mode 100644
index 0000000000..fac924534e
--- /dev/null
+++ b/docs/versioned_docs/version-1.8.x/User Guides/sandbox_create_benchmark.md	
@@ -0,0 +1,43 @@
+# ROCK Sandbox Large-Scale Concurrent Creation Benchmark Report
+
+## 1. Background
+
+In Agentic Reinforcement Learning training and large-scale agent rollout scenarios, **a single training step often needs to spin up thousands of sandboxes concurrently**, so the throughput and latency of sandbox creation directly determine end-to-end training efficiency.
+
+This report benchmarks ROCK Sandbox batch creation at concurrency levels of **1000 / 2000 / 4000 / 8000 / 16000**, to validate ROCK's stability and latency profile under heavy concurrent load.
+
+## 2. Scope & Methodology
+
+- **System under test**: the ROCK Sandbox creation path.
+- **Metric**: per-sandbox wall-clock time from issuing the create request until the sandbox reaches the alive (usable) state (seconds).
+- **Concurrency levels**: 1000, 2000, 4000, 8000, and 16000 sandboxes created in parallel.
+- **Concurrency model**: a **distributed, multi-machine + pure multi-process** driver is used — load is generated from multiple machines simultaneously, and each machine runs every sandbox-creation task in an isolated OS process. This avoids GIL contention, event-loop overhead, and TLS-handshake serialization becoming client-side bottlenecks, so the numbers truly reflect the server-side capacity.
+
+## 3. Results
+
+### 3.1 Sandbox Creation Latency Statistics
+
+| Concurrency | Samples | Success Rate | Min | Max | Avg | P50 | P95 | P99 |
+|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| **1000** | 1000 | 100% | 0.47s | 9.84s | 4.72s | 4.93s | 6.89s | 8.69s |
+| **2000** | 2000 | 100% | 0.64s | 11.20s | 4.90s | 4.92s | 9.76s | 10.53s |
+| **4000** | 4000 | 100% | 0.93s | 15.51s | 7.16s | 7.05s | 11.26s | 13.34s |
+| **8000** | 8000 | 100% | 3.85s | 33.99s | 18.06s | 17.86s | 30.09s | 32.03s |
+| **16000** | 16000 | 100% | 2.66s | 63.84s | 37.17s | 39.98s | 56.68s | 60.11s |
+
+> All values are in seconds. **Success rate is 100% (zero failures)** at every concurrency level.
+>
+> Note: as concurrency grows, ROCK applies rate limiting on the control plane to protect server-side stability, so the observed latency increases accordingly with higher concurrency.
+
+![Sandbox Create Latency Distribution](../../../static/img/sandbox_create_latency_distribution.png)
+
+## 4. Conclusion
+
+The benchmark from 1000 up to 16000 concurrent sandbox creations on ROCK shows:
+
+1. **100% success rate**, validating ROCK's reliability under heavy concurrent load.
+2. **Small-scale concurrency (≤ 2000) is essentially queue-free**, with P50 below 5s — comfortably suitable for latency-sensitive training and evaluation scenarios.
+3. **At large scale (≥ 4000), latency grows roughly linearly with concurrency**, giving predictable, easy-to-capacity-plan behaviour for sandbox pools.
+4. **Even at 16000 concurrency, P99 stays under ~60s**, sufficient to support very large agent rollouts and parallel RL steps.
+
+ROCK is therefore proven capable of supporting **tens of thousands of concurrent sandbox creations**, providing solid environment infrastructure for large-scale agentic RL training.
diff --git a/docs/versioned_docs/version-1.8.x/User Guides/scheduler.md b/docs/versioned_docs/version-1.8.x/User Guides/scheduler.md
new file mode 100644
index 0000000000..68a63cb427
--- /dev/null
+++ b/docs/versioned_docs/version-1.8.x/User Guides/scheduler.md	
@@ -0,0 +1,283 @@
+---
+sidebar_position: 5
+---
+
+# Scheduler
+
+The ROCK scheduler is a periodic task framework embedded in the `admin` service. It dispatches background maintenance tasks (image cleanup, file cleanup, container cleanup, image pre-pull, custom tasks, ...) to every alive Ray worker on a configurable interval, so worker nodes stay healthy without manual intervention.
+
+This guide covers how to enable the scheduler, configure built-in tasks, write your own task, and inspect runtime status.
+
+## 1. How It Works
+
+- The scheduler runs inside the `admin` process as a dedicated daemon thread (`SchedulerThread`) with its own `asyncio` event loop.
+- Tasks are scheduled by [APScheduler](https://apscheduler.readthedocs.io/) using fixed intervals (`interval_seconds`).
+- For each tick, the scheduler resolves the list of alive Ray workers (cached via `worker_cache_ttl` seconds) and dispatches the task to every worker concurrently (default concurrency: 50).
+- Dispatch is done over HTTP through the worker's **rocklet** service: the admin builds a `RemoteSandboxRuntime(host=worker_ip, port=Port.PROXY)` (see `rock.deployments.constants.Port`) and calls `runtime.execute / read_file / write_file` against it. **Every worker must therefore have the `rocklet` server running and reachable on `Port.PROXY`** — otherwise the scheduler cannot push commands or read/write status files on that worker.
+- Each task subclasses `rock.admin.scheduler.task_base.BaseTask` and must implement `run_action(runtime: RemoteSandboxRuntime)` — the work performed on a single worker.
+- Per-worker execution status is persisted to the worker filesystem under `ROCK_SCHEDULER_STATUS_DIR` (default `/data/scheduler_status`), and an aggregated execution report is written to `<status_dir>/<task_type>_run_report.json` after every run.
+- If a Nacos config provider is enabled, the scheduler subscribes to config changes and applies a diff: only tasks whose hash changed are re-installed; removed tasks are cleaned up from all workers.
+
+### Prerequisites
+
+Before enabling the scheduler, make sure each Ray worker meets the following requirements:
+
+| Requirement | Why |
+|-------------|-----|
+| `rocklet` process is running on the worker | The scheduler dispatches every task through the rocklet HTTP API; without it, `runtime.execute` calls time out. |
+| Rocklet's listening port is reachable from the admin | The scheduler uses `Port.PROXY` (defined in `rock.deployments.constants.Port`) as the dispatch target. Make sure no firewall / security group blocks it. |
+| `ROCK_SCHEDULER_STATUS_DIR` is writable inside the worker | Tasks read and write `<task>_status.json` here for idempotency / PID tracking. |
+| Tools required by the task are available on the worker | e.g. `docker` for the cleanup / pull tasks, `curl` and outbound network for `ImageCleanupTask` to install `docuum` on first run. |
+
+The rocklet server is started automatically by the standard worker bootstrap scripts (`docker_run.sh`, `docker_run_with_uv.sh`, `docker_run_with_pip.sh`) — typically `rocklet --port <Port.PROXY>`. If you bring up workers with a custom entrypoint, ensure the equivalent command is invoked. See the [Configuration](./configuration.md) guide for the runtime-environment options that govern how rocklet is started.
+
+### Idempotency
+
+Each task declares an idempotency mode that affects how it is re-run:
+
+| Mode | Behavior |
+|------|----------|
+| `IDEMPOTENT` | Always run on every tick. Safe to repeat (e.g. `docker pull`, `find -exec rm`). |
+| `NON_IDEMPOTENT` | Spawns a background daemon (e.g. `docuum`). The scheduler reads the previous status file, checks whether the recorded PID is still alive, and skips re-launch if the daemon is still running. On task removal the daemon is killed via `pkill`. |
+
+## 2. Enabling the Scheduler
+
+The scheduler is configured under the top-level `scheduler:` key of the ROCK admin YAML (e.g. `rock-conf/rock-local.yml`, `rock-conf/rock-dev.yml`).
+
+```yaml
+scheduler:
+  enabled: true              # Master switch
+  worker_cache_ttl: 43200    # Worker IP cache TTL in seconds
+  tasks:
+    # ... task list, see below
+```
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `enabled` | bool | `false` | Master switch. When `false`, all tasks are removed and no new ticks fire. |
+| `worker_cache_ttl` | int | `3600` | Seconds the alive-worker IP list is cached before refreshing from `ray.nodes()`. |
+| `tasks` | list | `[]` | List of `TaskConfig` entries (see [Section 4](#4-task-config-schema)). |
+
+### Related Environment Variables
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `ROCK_SCHEDULER_STATUS_DIR` | `/data/scheduler_status` | Directory on workers where per-task status JSON and run reports are written. |
+| `ROCK_LOGGING_PATH` | (unset) | When set, scheduler-spawned daemons (docuum, container_cleanup, image_pull) redirect their stdout/stderr to `<ROCK_LOGGING_PATH>/<task>.log`. |
+| `ROCK_DOCUUM_INSTALL_URL` | `https://raw.githubusercontent.com/stepchowfun/docuum/main/install.sh` | Install script URL for `docuum`, fetched on demand by `ImageCleanupTask`. |
+
+## 3. Built-in Tasks
+
+ROCK ships with four built-in tasks under `rock.admin.scheduler.tasks`. Each task is registered by setting `task_class` to its fully qualified class path.
+
+### 3.1 ImageCleanupTask
+
+Runs [`docuum`](https://github.com/stepchowfun/docuum) on every worker to evict the least-recently-used Docker images once disk usage crosses a threshold. **Non-idempotent** — `docuum` runs as a long-lived daemon; the scheduler tracks its PID and skips re-launch while the daemon is alive.
+
+```yaml
+- task_class: rock.admin.scheduler.tasks.image_cleanup_task.ImageCleanupTask
+  enabled: true
+  interval_seconds: 43200       # Re-check daemon every 12 hours
+  params:
+    disk_threshold: "70%"       # Trigger eviction when disk usage exceeds 70%
+    image_whitelist:            # Glob patterns matching repository:tag — never evicted
+      - "python:3.11"
+      - "my-registry.example.com/base/*"
+```
+
+| Param | Type | Default | Description |
+|-------|------|---------|-------------|
+| `disk_threshold` | str | `"1T"` | Disk usage threshold passed to `docuum --threshold`. Accepts size (`100G`, `1T`) or percentage (`70%`). |
+| `image_whitelist` | list[str] | `[]` | Glob patterns forwarded to `docuum --keep`. |
+
+### 3.2 FileCleanupTask
+
+Walks each configured directory and removes files that are either older than `max_age_mins` or larger than `max_file_size`, then prunes empty subdirectories. **Idempotent**.
+
+```yaml
+- task_class: rock.admin.scheduler.tasks.file_cleanup_task.FileCleanupTask
+  enabled: true
+  interval_seconds: 86400       # Run daily
+  params:
+    target_dirs:
+      # Plain string form — no exclusions
+      - "/data/service_status"
+      # Object form — per-directory exclusions
+      - path: "/data/logs"
+        exclude_files:           # Plain name | relative path | absolute path
+          - "docuum.log"
+          - "./rocklet.log"
+          - "./access.log"
+        exclude_dirs:
+          - ".cache"
+    max_age_mins: 10080          # 7 days; older files are removed
+    max_file_size: "1G"          # Files larger than this are removed (supports K/M/G/T)
+```
+
+| Param | Type | Default | Description |
+|-------|------|---------|-------------|
+| `target_dirs` | list | `[]` | Each entry is either a string (path only) or `{path, exclude_files, exclude_dirs}`. |
+| `max_age_mins` | int | `10080` | Files whose mtime is older than this many minutes are deleted. |
+| `max_file_size` | str | `"1G"` | Files larger than this are deleted. Suffixes `K/M/G/T` accepted. |
+
+The deletion condition is `(-mmin +max_age_mins) OR (-size +max_file_size)`. After file removal, a second `find -depth -type d -empty -delete` pass removes empty directories left behind (also honoring `exclude_dirs`).
+
+### 3.3 ContainerCleanupTask
+
+Removes stopped Docker containers older than a configurable age. Helps prevent the worker's container list from growing unbounded between sandbox runs. **Idempotent**.
+
+```yaml
+- task_class: rock.admin.scheduler.tasks.container_cleanup_task.ContainerCleanupTask
+  enabled: true
+  interval_seconds: 86400
+  params:
+    max_age_hours: 72            # Remove exited containers older than 72 hours
+```
+
+| Param | Type | Default | Description |
+|-------|------|---------|-------------|
+| `max_age_hours` | int | `24` | Maximum age (hours since `FinishedAt`) for kept exited containers. Older ones are `docker rm`'d. |
+
+The task also removes any container in the `created` state (never started) on every run.
+
+### 3.4 ImagePullTask
+
+Pre-pulls a list of Docker images on every worker, optionally logging in to private registries first. Reduces sandbox cold-start latency. **Idempotent** (`docker pull` is a no-op when the image is already up-to-date).
+
+```yaml
+- task_class: rock.admin.scheduler.tasks.image_pull_task.ImagePullTask
+  enabled: true
+  interval_seconds: 21600       # Refresh every 6 hours
+  params:
+    images:
+      # Plain string form — public image, no auth
+      - "python:3.11"
+      # Object form — private image with registry login
+      - image: "my-registry.example.com/chatos/python:313"
+        registry_username: "myuser"
+        registry_password: "bXlwYXNzd29yZA=="   # base64-encoded
+```
+
+| Param | Type | Default | Description |
+|-------|------|---------|-------------|
+| `images` | list | `[]` | Each entry is either an image string or `{image, registry_username, registry_password}`. |
+
+`registry_password` must be base64-encoded; the worker decodes it and pipes it to `docker login --password-stdin`. The registry host is parsed from the image name, so each image can target a different registry.
+
+## 4. Task Config Schema
+
+Every entry under `scheduler.tasks` is loaded as a `rock.config.TaskConfig`:
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `task_class` | str | `""` | Fully qualified Python class path. Required. |
+| `enabled` | bool | `true` | Disabled tasks are skipped at install time and torn down on reload. |
+| `interval_seconds` | int | `3600` | APScheduler `interval` in seconds. |
+| `params` | dict | `{}` | Task-specific kwargs forwarded to `from_config()`. |
+
+A change in any field of an existing task entry causes the scheduler to uninstall the old task (cleaning up its worker-side state when non-idempotent) and install the new one — without restarting the admin process.
+
+## 5. Writing a Custom Task
+
+Any class under your Python path that subclasses `BaseTask` can be registered. The minimum contract is:
+
+```python
+# my_pkg/my_tasks/disk_report_task.py
+from rock.admin.proto.request import SandboxCommand as Command
+from rock.admin.scheduler.task_base import BaseTask, IdempotencyType, TaskStatusEnum
+from rock.sandbox.remote_sandbox import RemoteSandboxRuntime
+
+
+class DiskReportTask(BaseTask):
+    """Log `df -h` output from every worker."""
+
+    def __init__(self, interval_seconds: int = 3600, mount_point: str = "/"):
+        super().__init__(
+            type="disk_report",                   # Used as job id and status filename prefix
+            interval_seconds=interval_seconds,
+            idempotency=IdempotencyType.IDEMPOTENT,
+        )
+        self.mount_point = mount_point
+
+    @classmethod
+    def from_config(cls, task_config) -> "DiskReportTask":
+        return cls(
+            interval_seconds=task_config.interval_seconds,
+            mount_point=task_config.params.get("mount_point", "/"),
+        )
+
+    async def run_action(self, runtime: RemoteSandboxRuntime) -> dict:
+        result = await runtime.execute(
+            Command(command=f"df -h {self.mount_point}", shell=True),
+        )
+        return {
+            "status": TaskStatusEnum.SUCCESS,
+            "exit_code": result.exit_code,
+            "stdout": result.stdout,
+        }
+```
+
+Then register it from YAML:
+
+```yaml
+scheduler:
+  enabled: true
+  tasks:
+    - task_class: my_pkg.my_tasks.disk_report_task.DiskReportTask
+      enabled: true
+      interval_seconds: 600
+      params:
+        mount_point: "/data"
+```
+
+### Authoring Checklist
+
+- **Always set a unique `type` string** in `super().__init__()`. It is used as the APScheduler job id, the status filename (`<type>_status.json`), and the run-report filename (`<type>_run_report.json`). Two tasks must not share a `type`.
+- **Pick the right `IdempotencyType`**:
+  - Use `IDEMPOTENT` when `run_action` finishes synchronously and is safe to re-execute.
+  - Use `NON_IDEMPOTENT` when you `nohup` a long-running daemon and return its PID. The scheduler will then track the PID, skip re-launch while it is alive, and `pkill` it on uninstall.
+- **Return a dict from `run_action`**. Recommended keys:
+  - `status` — a `TaskStatusEnum` value, persisted into the status file.
+  - `pid` — required for `NON_IDEMPOTENT` daemons (use `rock.utils.system.extract_nohup_pid` on the `nohup ... & echo PID_PREFIX${{!}}PID_SUFFIX` output).
+  - Any other diagnostic fields are written to the status file's `extra` section.
+- **Override `from_config(cls, task_config)`** to translate `task_config.params` into your `__init__` kwargs.
+- **Use `runtime.execute / read_file / write_file`** rather than running shell commands locally — the scheduler is dispatching to remote workers via `RemoteSandboxRuntime`.
+
+## 6. Observability
+
+For each task, the scheduler writes two artifacts on every worker:
+
+| Path | Written By | Contents |
+|------|------------|----------|
+| `<ROCK_SCHEDULER_STATUS_DIR>/<task_type>_status.json` | `BaseTask.save_task_status` | Latest per-worker status: `task_name`, `worker_ip`, `pid`, `status` (`pending`/`running`/`success`/`failed`), `last_run`, `error`, plus task-specific `extra` fields. |
+| `<ROCK_SCHEDULER_STATUS_DIR>/<task_type>_run_report.json` | `BaseTask.run` (admin side, after the tick completes) | Aggregated report: total/success/failed counts, list of `success_ips`, and `failed_details` (`ip` + traceback). |
+
+Scheduler-internal logs are written under the standard ROCK admin log path, with `name="scheduler"`, `name="task_base"`, `name="image_clean"`, etc. Scheduler-spawned daemons additionally write their own logs to `<ROCK_LOGGING_PATH>/<task>.log` when `ROCK_LOGGING_PATH` is set (e.g. `docuum.log`, `container_cleanup.log`, `image_pull.log`).
+
+## 7. Dynamic Reload via Nacos (Optional)
+
+When the admin service is configured with a Nacos provider, the scheduler installs a YAML listener and reacts to config pushes:
+
+- Only the `scheduler:` section is inspected; other sections are ignored.
+- The new section is hashed and compared against the previous one — duplicate notifications are skipped.
+- A diff between old and new task lists determines which tasks to install, uninstall, or reinstall (changed `params` / `interval_seconds` / `enabled`).
+- Non-idempotent tasks that are removed or re-installed are first cleaned up: their daemon PID is killed and the status file is removed.
+
+This means task interval changes, parameter tweaks, and adding/removing tasks can be applied without restarting the admin process.
+
+## 8. Troubleshooting
+
+| Symptom | Likely Cause | What to Check |
+|---------|--------------|----------------|
+| `Scheduler disabled, all tasks removed` in admin log | `scheduler.enabled` is `false` | Set `enabled: true` in YAML. |
+| `No alive workers found for task '<type>'` | Ray cluster has no live worker nodes | Verify `ray.nodes()` reports alive CPU workers; consider lowering `worker_cache_ttl` if workers were just added. |
+| Task ticks fire but every worker shows up in `failed_details` with connection errors | `rocklet` is not running on the workers, or `Port.PROXY` is blocked | On the worker host, hit the rocklet liveness endpoint `GET /is_alive` on `Port.PROXY` (e.g. `curl http://<worker_ip>:<Port.PROXY>/is_alive`); if it does not respond, restart rocklet (`rocklet --port <Port.PROXY>`) or open the port in the firewall. |
+| Task runs but never repeats on a non-idempotent worker | Recorded PID still alive | Inspect `<status_dir>/<task>_status.json`; if `status: running` and the PID is alive, `should_run` returns `False`. |
+| `Failed to create task '<class>'` | `task_class` import failed | Ensure the module is importable inside the admin process (installed in the same venv, on `PYTHONPATH`). |
+| `docker login` failing in `ImagePullTask` | `registry_password` not base64-encoded, or wrong registry parsed from image | Re-encode the password with `echo -n '<pwd>' \| base64`; double-check the image's registry host. |
+
+## Related Documents
+
+- [Configuration](./configuration.md) — Environment variables and runtime layout
+- [API Documentation](../References/api.md) — Admin HTTP API
+- [Python SDK Documentation](../References/Python%20SDK%20References/python_sdk.md) — Programmatic sandbox usage