what is the strategy of triton for running models in parallel, multi-thread or multi-process? #6253
-
what is the strategy of triton for running models in parallel, multi-thread or multi-process? |
Beta Was this translation helpful? Give feedback.
Answered by
dyastremsky
Sep 1, 2023
Replies: 1 comment
-
This differs based on the backend and model configuration. For example, Python backend runs models in their own processes. TensorRT uses CUDA streams. It also depends on your model configuration (e.g. if you specify multiple model instances, they will be on the same device, so many backends use multi-threading to enable parallel inference). |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
dyastremsky
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This differs based on the backend and model configuration. For example, Python backend runs models in their own processes. TensorRT uses CUDA streams. It also depends on your model configuration (e.g. if you specify multiple model instances, they will be on the same device, so many backends use multi-threading to enable parallel inference).