-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue - High queue times in perf_analyzer #7986
Comments
Hi @asaff1, the queue time is likely so high compared to compute times because the model config has defined a max batch size of 16, but is being hit by PA with a concurrency of 1000, leaving many requests to be queued while at most 16 requests at a time are executed. Can you build an engine that supports a greater max batch size?
Can you elaborate on this? What instance group configurations have you tried, and how did they affect the results? |
@rmccorm4 I understand. Thanks. The system has one RTX 4090. My goal is to reduce latency to bellow 10ms and to have 1000 requests per second. I've tried using PA with
And then I see less infer/sec then using
I've tried to increate to max_batch_size = 128, (and disabled preferred_batch_sizes ), and latency is still high
Tried to increate instance_group =8 and max_batch_size = 64, and did see some latency improvement, still not my goal:
Then, I've tried max_batch_size = 64 and instance_group = 16, which I assume should be able to handle 1024 requests concurrently ? (64x16 = 1024), yet the latency queue time is still high:
What performance impact it has? In my application requests are coming one by one. |
Hi @asaff1, Have you tried Model Analyzer for finding an optimal model config (instance count, batching settings, etc): |
I've used trtexec to improve a model performance.
perf_analyzer shows that the infer compute time is very log (a few milliseconds), yet the queue and wait time are high (300ms). What is the reason for requests spending a long time in the queue? A detailed explanation here will be appreciated.
Ideally I want to request time to match the inference time. Any ideas?
I've tried playing with instance_groups with no success.
config.pbtxt:
The text was updated successfully, but these errors were encountered: