feat(sandbox): add user-facing disk quota with Ray scheduling and metrics#977
Open
zhangjaycee wants to merge 5 commits into
Open
feat(sandbox): add user-facing disk quota with Ray scheduling and metrics#977zhangjaycee wants to merge 5 commits into
zhangjaycee wants to merge 5 commits into
Conversation
c3f30b7 to
ebb6b5b
Compare
616cb36 to
23192a2
Compare
Unify the internal field name with the user-facing API field. Nacos keys (SANDBOX_DISK_LIMIT_ROOTFS_KEY, RuntimeConfig.sandbox_disk_limit_rootfs) are intentionally kept unchanged to avoid breaking live production config.
…dation - add `disk` field to SandboxStartRequest, SandboxConfig, and StandardSpec for user-specified disk quota - update _apply_disk_limits priority: user request > Nacos > RuntimeConfig > None - set max_allowed_spec.disk default to 256g in RuntimeConfig - validate disk_limit_rootfs against max_allowed_spec.disk in SandboxManager and raise BadRequestRockError on excess - propagate `disk` field from SDK SandboxConfig through to admin start request via HTTP - add compatibility handling in from_request to exclude `disk` field from request dump - ensure backward compatibility: Optional field defaults to None, existing requests unaffected
Propagate disk_limit_rootfs as a Ray custom resource named "disk" (in bytes) so Ray's scheduler accounts for disk quota when placing sandbox actors on worker nodes. - RayOperator._generate_actor_options: merge disk into custom_resources dict alongside existing node-pinning resource - RayDeployment._generate_actor_options: add disk resource when disk_limit_rootfs is set - Fix test_disk_user_field tests broken by rebase (image now required) - Add unit tests for both _generate_actor_options paths
Add disk resource gauges (resource.disk.total, resource.disk.available) alongside existing CPU and memory metrics, reading from Ray's custom resource "disk" declared by workers at startup.
Add configurable disk overcommit ratio (Nacos > RuntimeConfig YAML) so Ray scheduling requests disk/ratio resources while Docker uses the full disk value. This enables scheduling more containers per node when not all are expected to use their full disk allocation simultaneously.
23192a2 to
7d91704
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
close #976
Summary
diskfield on SDK / API / status response, with cluster cap validation"disk"scheduling — workers need--resources='{"disk": <bytes>}'resource.disk.total,resource.disk.available)disk_limit_rootfs→disk(Nacos keys unchanged)Design notes
diskis new — no migration neededsandbox_disk_limit_rootfsintentionally kept to avoid breaking live configeffective_diskmay beNonewhen worker lacks--storage-optsupport (graceful degradation)