Replies: 1 comment 1 reply
-
|
Thanks for the rfc! We’ll definitely need Phase 1 and Phase 2 for Pie to be practical. Specifically, namespace isolation during KV page export, the key-value store, and inter-inferlet communication like you mentioned. For Phase 3, we can also take advantage of Wasmtime’s features, such as fuels. However, since security is another large and orthogonal concern to Pie, I think we can deprioritize it relative to the other tasks. For Phase 4, I think it’s complex enough that we could write a paper about it. We can continue the discussion during the design of the multi-GPU batch scheduler? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
RFC: Multi-tenant Support
multi-tenantSummary
Enable mutually untrusted users to run inferlets concurrently within a single Pie instance. This feature aims to improve hardware utilization by allowing more users to share GPU resources efficiently. Transitioning Pie into a multi-tenant engine will occur in four distinct phases.
Motivation
GPU efficiency is maximized with larger inference batch sizes, but single-user scenarios often underutilize the hardware. Running multiple inferlets in parallel increases efficiency, and with more users, it becomes easier to reach the optimal utilization point.
Multi-tenancy is also especially valuable for academic research groups. High-performance GPUs are expensive, and most research teams rely on shared clusters rather than providing each member with a dedicated machine. A multi-tenant Pie instance will enable group members to share high-end GPUs effectively for their inference research.
Supporting mutually untrusted users requires Pie to provide robust isolation between tenants. Currently, Pie assumes a single user who acts as both administrator and end user. This RFC outlines a phased approach to introducing multi-tenancy, with each phase addressing specific challenges. The progression between phases is designed to be incremental to avoid significant disruptions for early Pie adopters.
Phase 1: Implement Client-Server Model
Goal: Split the current multi-purpose CLI into a user-facing CLI (
pie) and an administrator-facing CLI (pied).piedmanages the long-running Pie engine. It is responsible for starting the engine, attaching the selected backend, and listening for inferlet related requests from users. It also maintains the state of all user sessions and submitted inferlets.pieis the end-user tool for interacting with a running Pie engine. Before any operation, it must authenticate with the engine. It should support at minimum the following functions:For authentication, pied should preferably use asymmetric key pairs, ideally leveraging existing SSH keys. Public keys are stored as authorized user identities within the Pie engine, making it straightforward to add or revoke access. This approach follows a widely adopted security model.
Tracking Issues and PRs:
pie submitthat execute an inferlet on a long running engine #116pieand system admin facingpied. #121Phase 2: Control Visibility
Goal: Ensure that each user operates within an isolated namespace, preventing unauthorized access to other users’ resources.
By default, the following identifiers and statistics should remain private to the user who owns them, unless explicit sharing is granted:
Visibility enforcement should be implemented using language mechanisms wherever possible. In particular, we can leverage the Rust type system to define and control capabilities, drawing inspiration from Rust-based operating systems such as Theseus and Tock. Unsafe Rust should be limited to the minimal core components where it is strictly necessary, while the remainder of the engine should rely on safe Rust exclusively. This approach enables a type-driven capability model, providing compile-time guarantees for isolation and reducing the risk of accidental information leaks.
Tracking Issues and PRs: TBD
Phase 3: Enforce Spatial Resource Limits
Goal: Ensure that each user operates within defined resource constraints to prevent any single tenant from exhausting system resources.
To protect the overall stability of the Pie engine in a multi-tenant environment, the following resource categories must have enforceable limits:
Enforcement should build on the capability-based isolation mechanisms introduced in Phase 2. Where possible, limits should be applied in a manner that allows for graceful performance degradation rather than abrupt termination. For example, if the number of KV cache pages exceeds a defined threshold, the engine can transparently swap less active pages between GPU memory and CPU DRAM. This approach preserves ongoing inferlet execution while reducing performance instead of immediately terminating inferlets upon failed allocation.
Tracking Issues and PRs: TBD
Phase 4: Implement Performance Isolation
Goal: Ensure fair and efficient allocation of computing resources across all tenants.
Each tenant should receive a fair-share allocation of CPU resources, along with an upper limit on maximum CPU usage. When a single tenant is active, they may utilize up to the maximum CPU limit. When inferlets from multiple users are running concurrently, CPU scheduling should proportionally reflect each tenant’s fair-share allocation.
GPU resource isolation presents a greater challenge due to the nature of cross-tenant batching, which is essential for achieving optimal system throughput. We will need to explore scheduling and batching strategies implemented by recent works on shared GPU serving systems.
Tracking Issues and PRs: TBD
Potential Drawbacks
Beta Was this translation helpful? Give feedback.
All reactions