RFC: Multi-tenant Support #119

zyma98 · 2025-10-16T22:10:54Z

zyma98
Oct 16, 2025
Maintainer

RFC: Multi-tenant Support

Feature name: multi-tenant
Start date: 2025-10-16
Tracking issues and PRs: See phases below for details

Summary

Enable mutually untrusted users to run inferlets concurrently within a single Pie instance. This feature aims to improve hardware utilization by allowing more users to share GPU resources efficiently. Transitioning Pie into a multi-tenant engine will occur in four distinct phases.

Motivation

GPU efficiency is maximized with larger inference batch sizes, but single-user scenarios often underutilize the hardware. Running multiple inferlets in parallel increases efficiency, and with more users, it becomes easier to reach the optimal utilization point.

Multi-tenancy is also especially valuable for academic research groups. High-performance GPUs are expensive, and most research teams rely on shared clusters rather than providing each member with a dedicated machine. A multi-tenant Pie instance will enable group members to share high-end GPUs effectively for their inference research.

Supporting mutually untrusted users requires Pie to provide robust isolation between tenants. Currently, Pie assumes a single user who acts as both administrator and end user. This RFC outlines a phased approach to introducing multi-tenancy, with each phase addressing specific challenges. The progression between phases is designed to be incremental to avoid significant disruptions for early Pie adopters.

Phase 1: Implement Client-Server Model

Goal: Split the current multi-purpose CLI into a user-facing CLI (pie) and an administrator-facing CLI (pied).

pied manages the long-running Pie engine. It is responsible for starting the engine, attaching the selected backend, and listening for inferlet related requests from users. It also maintains the state of all user sessions and submitted inferlets.

pie is the end-user tool for interacting with a running Pie engine. Before any operation, it must authenticate with the engine. It should support at minimum the following functions:

Submit an inferlet for execution
Query the status of a submitted inferlet
Stream the output of a running or completed inferlet
Wait for a submitted inferlet to finish
Terminate a submitted inferlet

For authentication, pied should preferably use asymmetric key pairs, ideally leveraging existing SSH keys. Public keys are stored as authorized user identities within the Pie engine, making it straightforward to add or revoke access. This approach follows a widely adopted security model.

Tracking Issues and PRs:

Phase 2: Control Visibility

Goal: Ensure that each user operates within an isolated namespace, preventing unauthorized access to other users’ resources.

By default, the following identifiers and statistics should remain private to the user who owns them, unless explicit sharing is granted:

Handles for submitted inferlets
Handles for exported KV cache pages
Endpoints of publish/subscribe channels
Buffers containing any intermediate or final output
Resource usage statistics

Visibility enforcement should be implemented using language mechanisms wherever possible. In particular, we can leverage the Rust type system to define and control capabilities, drawing inspiration from Rust-based operating systems such as Theseus and Tock. Unsafe Rust should be limited to the minimal core components where it is strictly necessary, while the remainder of the engine should rely on safe Rust exclusively. This approach enables a type-driven capability model, providing compile-time guarantees for isolation and reducing the risk of accidental information leaks.

Tracking Issues and PRs: TBD

Phase 3: Enforce Spatial Resource Limits

Goal: Ensure that each user operates within defined resource constraints to prevent any single tenant from exhausting system resources.

To protect the overall stability of the Pie engine in a multi-tenant environment, the following resource categories must have enforceable limits:

Number of unreaped (pending) inferlets
Number of KV cache pages
Number of publish/subscribe channel endpoints
Number of created network endpoints
Total runtime memory allocated by inferlets
Total size of buffered output
Total size of allocated inferlet binaries

Enforcement should build on the capability-based isolation mechanisms introduced in Phase 2. Where possible, limits should be applied in a manner that allows for graceful performance degradation rather than abrupt termination. For example, if the number of KV cache pages exceeds a defined threshold, the engine can transparently swap less active pages between GPU memory and CPU DRAM. This approach preserves ongoing inferlet execution while reducing performance instead of immediately terminating inferlets upon failed allocation.

Tracking Issues and PRs: TBD

Phase 4: Implement Performance Isolation

Goal: Ensure fair and efficient allocation of computing resources across all tenants.

Each tenant should receive a fair-share allocation of CPU resources, along with an upper limit on maximum CPU usage. When a single tenant is active, they may utilize up to the maximum CPU limit. When inferlets from multiple users are running concurrently, CPU scheduling should proportionally reflect each tenant’s fair-share allocation.

GPU resource isolation presents a greater challenge due to the nature of cross-tenant batching, which is essential for achieving optimal system throughput. We will need to explore scheduling and batching strategies implemented by recent works on shared GPU serving systems.

Tracking Issues and PRs: TBD

Potential Drawbacks

Increased complexity in scheduling to account for quotas of each tenant
Increased performance overhead for single-user scenarios
Multiple breaking changes for early Pie adopters due to phased implementation

ingim · 2025-10-25T03:46:23Z

ingim
Oct 25, 2025
Maintainer

Thanks for the rfc! We’ll definitely need Phase 1 and Phase 2 for Pie to be practical. Specifically, namespace isolation during KV page export, the key-value store, and inter-inferlet communication like you mentioned.

For Phase 3, we can also take advantage of Wasmtime’s features, such as fuels. However, since security is another large and orthogonal concern to Pie, I think we can deprioritize it relative to the other tasks.

For Phase 4, I think it’s complex enough that we could write a paper about it. We can continue the discussion during the design of the multi-GPU batch scheduler?

1 reply

zyma98 Oct 27, 2025
Maintainer Author

For Phase 3, I'd like to note that it's not for security, but just to prevent a buggy user (though potentially can be malicious) from submitting too many inferlets or use too much memory that crashes other users.

And I am indeed planning to submit some followup papers based on these. Let's definitely chat.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Multi-tenant Support #119

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

RFC: Multi-tenant Support #119

Uh oh!

Uh oh!

zyma98 Oct 16, 2025 Maintainer

RFC: Multi-tenant Support

Summary

Motivation

Phase 1: Implement Client-Server Model

Phase 2: Control Visibility

Phase 3: Enforce Spatial Resource Limits

Phase 4: Implement Performance Isolation

Potential Drawbacks

Replies: 1 comment · 1 reply

Uh oh!

ingim Oct 25, 2025 Maintainer

Uh oh!

zyma98 Oct 27, 2025 Maintainer Author

zyma98
Oct 16, 2025
Maintainer

Replies: 1 comment 1 reply

ingim
Oct 25, 2025
Maintainer

zyma98 Oct 27, 2025
Maintainer Author