Skip to content

feat(sandbox): add user-facing disk quota with Ray scheduling and metrics#977

Open
zhangjaycee wants to merge 5 commits into
alibaba:masterfrom
zhangjaycee:feat/disk_quota
Open

feat(sandbox): add user-facing disk quota with Ray scheduling and metrics#977
zhangjaycee wants to merge 5 commits into
alibaba:masterfrom
zhangjaycee:feat/disk_quota

Conversation

@zhangjaycee

@zhangjaycee zhangjaycee commented May 18, 2026

Copy link
Copy Markdown
Collaborator

close #976

Summary

  • User-facing disk field on SDK / API / status response, with cluster cap validation
  • Ray custom resource "disk" scheduling — workers need --resources='{"disk": <bytes>}'
  • Disk total/available metrics (resource.disk.total, resource.disk.available)
  • Internal rename disk_limit_rootfsdisk (Nacos keys unchanged)

Design notes

  • DB column disk is new — no migration needed
  • Nacos key sandbox_disk_limit_rootfs intentionally kept to avoid breaking live config
  • effective_disk may be None when worker lacks --storage-opt support (graceful degradation)

@zhangjaycee zhangjaycee changed the title feat(sandbox): add user-facing disk quota field with cluster cap validation feat(sandbox): add user-facing disk quota with Ray scheduling and metrics Jun 11, 2026
@zhangjaycee zhangjaycee force-pushed the feat/disk_quota branch 4 times, most recently from 616cb36 to 23192a2 Compare June 17, 2026 06:20
Unify the internal field name with the user-facing API field.
Nacos keys (SANDBOX_DISK_LIMIT_ROOTFS_KEY, RuntimeConfig.sandbox_disk_limit_rootfs)
are intentionally kept unchanged to avoid breaking live production config.
…dation

- add `disk` field to SandboxStartRequest, SandboxConfig, and StandardSpec for user-specified disk quota
- update _apply_disk_limits priority: user request > Nacos > RuntimeConfig > None
- set max_allowed_spec.disk default to 256g in RuntimeConfig
- validate disk_limit_rootfs against max_allowed_spec.disk in SandboxManager and raise BadRequestRockError on excess
- propagate `disk` field from SDK SandboxConfig through to admin start request via HTTP
- add compatibility handling in from_request to exclude `disk` field from request dump
- ensure backward compatibility: Optional field defaults to None, existing requests unaffected
Propagate disk_limit_rootfs as a Ray custom resource named "disk"
(in bytes) so Ray's scheduler accounts for disk quota when placing
sandbox actors on worker nodes.

- RayOperator._generate_actor_options: merge disk into custom_resources
  dict alongside existing node-pinning resource
- RayDeployment._generate_actor_options: add disk resource when
  disk_limit_rootfs is set
- Fix test_disk_user_field tests broken by rebase (image now required)
- Add unit tests for both _generate_actor_options paths
Add disk resource gauges (resource.disk.total, resource.disk.available)
alongside existing CPU and memory metrics, reading from Ray's custom
resource "disk" declared by workers at startup.
Add configurable disk overcommit ratio (Nacos > RuntimeConfig YAML)
so Ray scheduling requests disk/ratio resources while Docker uses the
full disk value. This enables scheduling more containers per node when
not all are expected to use their full disk allocation simultaneously.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support user-defined disk quota in sandbox start request

1 participant