Replies: 12 comments 5 replies
-
|
I feel this RFC reveals more fundamental issue in Pie. Lack of inferlet composability & polymorphism. Having thought about the problem for a while, I propose a slightly different idea to approach this: Nested Inferlet Calls Motivation I propose we implement nested inferlet calls. This allows inferlets to compose other inferlets directly, similar to library imports. With the recent addition of Design Sketch: Rust #[inferlet::main]
async fn main(mut args: Args) -> Result<()> {
await import("std/text-completion").run(
prompt="Hello, my name is ",
max_tokens=20,
drafter="std/cacheback@latest", ## IMPORTANT: Inferlet Polymorphism via function composition. See below.
)
await import("std/python-3.12").run(
code="...",
)
await import("std/javascript").run(
code="...",
)
}
// Standard inferlets are assumed always available, and therefore have an preinstantiztied instance.
use inferlet.std.text_completion as text_completion
await text_completion.run(
prompt "Hello, my name is ",
max_tokens=20,
drafter="std/cacheback@latest",
)Design Sketch: Python SDK # note that "std" dependencies can be omitted.
@inferlet(deps=["std/text-completion","std/cacheback@latest"])
async def my_agent(prompt: str, max_tokens: int) -> str:
# numpy is prepackaged in the std/python-3.12 inferlet environment
import numpy as np
response = await use("std/text-completion").run( # since "import" is a reserved keyword in python, "use" instead.
prompt=prompt,
max_tokens=max_tokens,
drafter="std/cacheback@latest",
)
return response
# if the user wants to use custom packages, they must bring their own interpreter inferlet.
@inferlet(
interpreter="ingim/my-custom-python-env",
deps=[
"ingim/my-custom-python-env",
]
)
async def my_agent_with_custom_env(prompt: str, max_tokens: int) -> str:
import custom_package
return response
# using inferlets is as simple as calling a function:
with PieClient("localhost:8080"):
response = await my_agent(
prompt="Hello, my name is ",
max_tokens=20,
)Manifest Changes (Pie.toml) Current: [package]
name = "std/text-completion"
version = "0.1.0"
description = "Simple text completion inferlet"
repository = "https://github.com/pie-project/pie"
[engine]
min_version = "^1.0.0"
[interface]
inputs = [
{ name = "prompt", type = "string", description = "The user message to complete" },
{ name = "system", type = "string", optional = true, description = "System prompt to set assistant behavior" },
{ name = "max_tokens", type = "int", optional = true, description = "Maximum number of tokens to generate (default: 256)" },
{ name = "temperature", type = "float", optional = true, description = "Sampling temperature (default: 0.6)" },
{ name = "top_p", type = "float", optional = true, description = "Top-p nucleus sampling threshold (default: 0.95)" },
]
outputs = [
{ name = "completion", type = "string", description = "The generated text completion" }
]
Updated Proposal: [package]
name = "std/text-completion"
version = "0.1.0"
description = "Simple text completion inferlet"
repository = "https://github.com/pie-project/pie"
[engine]
min_version = "^1.0.0"
[interface]
inputs = [
{ name = "prompt", type = "string", description = "The user message to complete" },
{ name = "system", type = "string", optional = true, description = "System prompt to set assistant behavior" },
{ name = "max_tokens", type = "int", optional = true, description = "Maximum number of tokens to generate (default: 256)" },
{ name = "temperature", type = "float", optional = true, description = "Sampling temperature (default: 0.6)" },
{ name = "top_p", type = "float", optional = true, description = "Top-p nucleus sampling threshold (default: 0.95)" },
]
outputs = [
{ name = "completion", type = "string", description = "The generated text completion" }
]
[dependencies]
"std/text-completion" = "latest"
"std/cacheback" = "latest"
Inferlet Interfaces & Polymorphism Any inferlet that matches the signature can be swapped in. This is critical for higher-order functions, such as passing a "drafter" inferlet into a text completion inferlet. [interface]
inputs = [
{ name = "prompt", type = "string", description = "The user message to complete" },
{ name = "drafter", type = "inferlet(context:string)->string", optional = true, description = "Drafter inferlet module" },
]
outputs = [
{ name = "completion", type = "string", description = "The generated text completion" }
]
@zyma98 what do you think? My proposal is far from complete. Your inputs are appreciated. |
Beta Was this translation helpful? Give feedback.
-
|
I just realized I missed explanation on how this mechanism resolves "fat interpreter" problem. So basically, the server launches If the user wants to import a custom library, then they must build a new Python interpreter inferlet (with the compatibile input signature as the |
Beta Was this translation helpful? Give feedback.
-
|
Hey @ingim, thank you for suggesting the new approach. This is something I haven't thought about. Now I start to see a strong duality between inferlets and conventional programs regarding composition. Language Specific Solution It's the same in inferlet programming as in conventional programming. Composable parts are written in the same language and interact with function calls. One example is our current monolithic Rust Language Agnostic Solution - Static Linking In inferlet programming, this approach is to compile each composable unit into WASM. The interface between them is defined by WIT. One example is my experimental Language Agnostic Solution - Dynamic Linking This is my proposal in this RFC. Composable units are still compiled into WASM, but linking happens at runtime to allow applications to share a single library instance. In conventional programming, the counterpart is dynamic linking. Language Agnostic Solution - Multiprocessing This is your new proposal. Composable units are executable WASM. I feel like this is the multiprocessing approach in conventional programming. I have a few concerns about this approach:
With all these said, my suggestion is to not kill any of these approaches at the moment. In conventional programming, all these four approaches thrive. I'd recommend that we implement all four approaches and reveal their pros and cons with quantitative performance statistics. Approach 1 and 2 are already running. I have the proof of concept code for approach 3 running. Approach 4 sounds also not hard. |
Beta Was this translation helpful? Give feedback.
-
|
@zyma98 Thank you for the comments! That makes a lot of sense. After reflecting on your comment, I realized my motivations were actually:
I am convinced that dynamic linking looks like the right mechanism to achieve this. Because it is actually fully compatible with the design sketch I proposed earlier. For example, we could express this as an explicit #[inferlet::main]
async fn main(mut args: Args) -> Result<()> {
// Dynamically link to the interface at runtime and execute its main entry point
await link("std/text-completion").main(
prompt="Hello, my name is ",
max_tokens=20,
drafter="std/cacheback@latest",
)
await link("std/python-3.12").main(
code="...",
)
}And the "std/text-completion"'s manifest (Pie.toml) can define the interface (ie main [[interface]]
[main]
inputs = ...
outputs = ...I am currently cautious about "library inferlets" (inferlets that expose multiple public functions other than Does this make sense to you? |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for your comments @ingim! I gained some clarity about our proposed design. I'm summarizing them below to ensure that we are on the same page. Application vs Library InferletsMechanically, there is no difference at the interface level. Both application and library inferlets export interfaces. Application inferlets export the The proposed Load-time vs Runtime Symbol ResolutionMy proposed approach will use load-time symbol resolution. The dependencies are recorded in the compiled WASM binary mirroring those imports and exports in the WIT files used during compilation. The engine verifies that the dependencies can be satisfied at load time, i.e., WASM instantiation time. The counterpart in conventional programming is linking against a dynamic library via the Your proposed approach will use runtime symbol resolution. The WASM binary does not record the dependencies. The counterpart in conventional programming is the POSIX I do have some concerns about using runtime symbol resolution plus exporting all functionalities through Runtime symbol resolution definitely allows more flexibilities in applications. However, functions calls across WASM components will need to go through more indirections and suffer from larger overheads. For now, I don't have an estimate of how large the additional overhead may be. Exporting all functionalities of an inferlet through the The code may also become less ergonomic. For example, a drafter may need to keep some states (e.g., Cacheback's cache) across decoding iterations. If the I totally agree that we should make the component interface simple and reusable. In my envisioned design, we will just use the WIT file as the canonical interface definition. As long as we keep each library to focus on a small purpose, I believe it's manageable. |
Beta Was this translation helpful? Give feedback.
-
|
Would it be possible to (1) share some example inferlet code using the proposed approach both in Rust and inline Python, that calls into other inferlets (or inferlet libraries), and (2) imagine how it would fit into a package manager ecosystem? (eg. can pie load be replaced by a [dependencies] section in the manifest?, should WIT be included as a metadata in bakery registry, or it is generated on the fly from Pie.toml manigest?) As long as the end user experience is pleasant, I’m happy with your approach. |
Beta Was this translation helpful? Give feedback.
-
|
I'm not familiar with how currently the inline Python inferlet works. For Rust, it will look very similar to how we currently program. Here is the text completion application written with the inferlet libraries. The We can definitely automate dependency loading with the package management system. We can specify the dependency library in a I prefer including the WIT as part of the metadata in the bakery registry. It will make it easier to verify that the specified dependencies have the correct WIT interfaces matched up. |
Beta Was this translation helpful? Give feedback.
-
|
If we go with the load-time resolution approach, it seems like this effectively moves us to the WASM component model standard. In this picture, Pie becomes just a specialized WASM component runtime, and "dynamic linking" is essentially solving the composition and "big binary" problem. I agree this is the good technical direction, piggybacking on the existing tech is usualy better than reinventing wheels. However, I am worried about the user expereince. Most of our users (AI engineers, Python devs) won't want to learn WIT or manage
Since Pie is an LLM serving system, we should balance flexibility with simplicity. Option 1: Hide the Component Model with a "magic tool"
It keeps the manifest simple: [package]
name = "my-agent"
# Auto-generates the WIT export for the main entry point
[interface]
inputs = [
{ name = "prompt", type = "string" },
{ name = "temperature", type = "float", optional = true }
]
outputs = [
{ name = "completion", type = "string" }
]
# used to generate WIT imports automatically
[dependencies]
"zyma98/my-inferlet" = "0.1.0"
Option 2: Runtime Symbol Resolution
What do you think? |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the clarification! I personally prefer Option 1. If we later figure out that only a subset of the Wasm component interface model is usually used in Pie, we can leverage it to simplify our in-house interface definition and how bakery handles them. |
Beta Was this translation helpful? Give feedback.
-
|
since this will shape the future of the inferlet ecosystem, I think we need to prioritize its development. I see three remaining tasks: (1). redesign "standard inferlets" (inferlets in the "std" namespace) and the library inferlets that glue them together based on the Component Model.
(2). update
(3). implement dynamic lnking We could do (1) together. |
Beta Was this translation helpful? Give feedback.
-
|
Summarizing our discussion for the next steps to incorporate dynamic linking with Bakery and forming a dev plan. Please let me know how it looks. @ingim Packet Search PathWe'll make bakery the canonical source for downloading both inferlets and libraries. I'll use the term package to refer to either inferlet or library. We'll refactor the client On the engine side, we'll implement a two level search. The engine will first search through the packages uploaded to the engine. If the specified package is not found, it'll proceed to searching through the registry. Dependency SpecificationEach package's dependency will be specified in its manifest file. During inferlet instantiation or library load time, the engine will recursively load the dependencies if they haven't already been loaded. Since we'll refactor the engine to accept package upload, the engine will always have the dependency information either from the uploaded package or from registry, therefore the client need not specify the dependency again when instantiating an inferlet. Library ErgonomicsWe'll provide smooth experience for inferlet developers who want to use libraries from the registry. The Wasm binary uploaded to the registry already contains the interface information, which can be examined with tools like [package]
name = "my-inferlet"
version = "0.1.0"
edition = "2024"
[dependencies]
inferlet-std = { version = "1.0", registry = "pie-registry" }Dev Plan
PRs
Related Discussions
|
Beta Was this translation helpful? Give feedback.
-
|
This design is a bit scary to me because it effectively requires us to become package registry maintainers. Yet it sounds fantastic for the developer experience. Can we do a quick sanity check to see if hosting our own Crates/NPM/PyPI endpoints is not too complicated? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
RFC: Dynamic Linking Support
dynamic-linkingSummary
Enable the Pie engine to load inferlet libraries in WASM binary form and resolve dependencies at runtime. This feature aims to improve memory efficiency and inferlet spawning latency.
Motivation
Pie has started to support inferlets written in JavaScript and Python. Inferlets written in these interpreted languages are significantly larger because, at the current stage, the inferlets must contain the interpreter and language runtime. It not only causes the inferlet binary size to bloat up to multiple tens of MBs but also the inferlet launch time to increase up to a few seconds.
One possible solution to the problem is to implement dynamic linking, where we separate out the interpreter and language runtime into a library which is loaded ahead of time than the application binary. There will be only one instantiated copy of the library in the engine, so the memory usage as well as the storage useage will be efficient. Also, by splitting out the interpreter and language runtime, the application binary will be small, so that launch time will be significantly reduced.
Mechanism
On the client side, two new commands will be added to the
pie-cliprogram:loadcommand, which uploads a library to the Pie engine. The library is allowed to have interface dependencies to host provided functions or exported interfaces of already loaded libraries.purgecommand, which removes all loaded libraries from the Pie engine. This command is allowed to run only when the engine is quiescent, meaning that no application inferlet is running.Upon a
loadcommand, the engine will receive a library inferlet. To make the exported interfaces from this library available to subsequently loaded libraries or launched applications, the engine will create a shim as host provided interfaces which forwards calls to the library.The shim sounds complicated, but it is necessary because an inferlet, as a WASM component, cannot directly call into another WASM component. (Unless, these components are composed together into a single binary and becomes a single component, but then it will become static linking and defeat our purposes.) The two feasible operations are calling from an inferlet into host provided functions or calling from host to an inferlet exported functions.
Therefore, to glue dynamic libraries and applications together, the engine defines host provided functions that have exactly the same signature as the library exported functions, which is the aforementioned shim. As such, applications will call the host provided shim, transfering the control flow from the application WASM to the host, and then the host will call the library, transfering the control back to the WASM world.
The reason to have a
purgecommand that removes all loaded libraries but not anunloadcommand that removes a specified library is due to the limitation ofwasmtime::Linker. The linker can only add definitions but cannot remove one, so removing one library at a time is not possible. Thepurgecommand will be supported by dropping the oldLinkerandStoreand creating new ones, effectively clearing everything.Potential Drawbacks
Apart from increased complexity in the engine, one major drawback of the dynamic linking approach is the need for host mediation when function calls cross WASM boundaries. We need to first implement dynamic linking to study its performance impact.
Beta Was this translation helpful? Give feedback.
All reactions