Ensemble/BLS models where individual steps are hosted on different machines (or clusters)? #7166

vadimkantorov · 2024-04-26T17:53:35Z

vadimkantorov
Apr 26, 2024

Is it supported by Ensemble models or BLS models the scheme where individual pipeline component / model is scheduled on a different machine?

E.g. it might make sense to host/scale certain preprocessing component / instance group on a big CPU machine (or even some cluster) and run another step on a GPU machine.

I guess for this during BLS/Ensemble config we'd need to specify URL endpoint for individual steps

Is sth like this supported natively by the Triton Server?

Thanks!

rmccorm4 · 2024-05-09T22:44:57Z

rmccorm4
May 9, 2024
Collaborator

Hi @vadimkantorov, thanks for reaching out! It is not natively supported using the ensembles/BLS today through standard Triton conventions only.

You could theoretically do many things with BLS, such as using a tritonclient within the BLS model, and forwarding the request to another server using that client, but it would be a fairly manually crafted solution at the moment.

Similarly, another example is that you can use Triton's in-process Python API to embed Triton within a RayServe deployment and use Ray to manage the multi-node logic: https://docs.ray.io/en/latest/serve/tutorials/triton-server-integration.html#start-the-triton-server-inside-a-ray-serve-application.

Can you elaborate on your goal or use case? Is this for a max throughput scenario where you don't care as much about a minimum latency? Do you have any constraints around your use case on when you know you could afford the cost of communicating with another node round trip?

CC @nnshah1 for viz

1 reply

vadimkantorov May 14, 2024
Author

The goal is to first start hosting a collection of models on a single Triton-managed machine, but later on moving some pipeline components on another machine for maximizing utilization (or for scaling these components and thus requiring more memory) and reducing latency variability

Yeah, I realize that we totally can do RPC-requests from within BLS model, but what'd be great is a recipe/tutorial on how to do it reliably and scalably (e.g. how to handle timeouts/hangs/retries/request handling prioritization/latency control/graceful service degradation for these gRPC calls or how to do them asyncronously - maybe in some coroutine-like fashion which can be put to sleep and resurrected when a response comes back. These aspects are important in building real systems, so some advise/guidance/discussion of them would be very helpfl!)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensemble/BLS models where individual steps are hosted on different machines (or clusters)? #7166

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Ensemble/BLS models where individual steps are hosted on different machines (or clusters)? #7166

vadimkantorov Apr 26, 2024

Replies: 1 comment · 1 reply

rmccorm4 May 9, 2024 Collaborator

vadimkantorov May 14, 2024 Author

vadimkantorov
Apr 26, 2024

Replies: 1 comment 1 reply

rmccorm4
May 9, 2024
Collaborator

vadimkantorov May 14, 2024
Author