Performance issue - High queue times in perf_analyzer #7986

asaff1 · 2025-02-04T09:24:06Z

I've used trtexec to improve a model performance.
perf_analyzer shows that the infer compute time is very log (a few milliseconds), yet the queue and wait time are high (300ms). What is the reason for requests spending a long time in the queue? A detailed explanation here will be appreciated.
Ideally I want to request time to match the inference time. Any ideas?
I've tried playing with instance_groups with no success.

root@5d6049652465:/opt/tritonserver# perf_analyzer -i grpc -m model_trt_fp16 --concurrency 1000 --shared-memory system
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1000
  Client:
    Request count: 72262
    Throughput: 4005.66 infer/sec
    Avg latency: 247602 usec (standard deviation 9319 usec)
    p50 latency: 245439 usec
    p90 latency: 255399 usec
    p95 latency: 282710 usec
    p99 latency: 345187 usec
    Avg gRPC time: 247578 usec ((un)marshal request/response 11 usec + response wait 247567 usec)
  Server:
    Inference count: 72262
    Execution count: 4518
    Successful request count: 72262
    Avg request latency: 247496 usec (overhead 844 usec + queue 242704 usec + compute input 2241 usec + compute infer 1613 usec + compute output 93 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1000, throughput: 4005.66 infer/sec, latency 247602 usec

config.pbtxt:

platform: "tensorrt_plan"
max_batch_size: 16
dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 100
}
optimization {
  cuda { graphs: true }
}

model_warmup {
  batch_size: 1
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [3, 256, 256]
      zero_data: true
    }
  }
}
model_warmup {
  batch_size: 2
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [3, 256, 256]
      zero_data: true
    }
  }
}
model_warmup {
  batch_size: 3
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [3, 256, 256]
      zero_data: true
    }
  }
}
model_warmup {
  batch_size: 4
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [3, 256, 256]
      zero_data: true
    }
  }
}

The text was updated successfully, but these errors were encountered:

rmccorm4 · 2025-02-05T19:48:16Z

Hi @asaff1, the queue time is likely so high compared to compute times because the model config has defined a max batch size of 16, but is being hit by PA with a concurrency of 1000, leaving many requests to be queued while at most 16 requests at a time are executed.

Can you build an engine that supports a greater max batch size?

I've tried playing with instance_groups with no success.

Can you elaborate on this? What instance group configurations have you tried, and how did they affect the results?

asaff1 · 2025-02-06T14:57:37Z

@rmccorm4 I understand. Thanks.

The system has one RTX 4090. My goal is to reduce latency to bellow 10ms and to have 1000 requests per second. I've tried using PA with --request-range 1000 but I get the warning

[WARNING] Perf Analyzer was not able to keep up with the desired request rate. 99.96% of the requests were delayed.

And then I see less infer/sec then using --concurrency 1000

What is better in this case, increasing model instances, or increasing batch size?

I've tried to increate to max_batch_size = 128, (and disabled preferred_batch_sizes ), and latency is still high

root@c13a7cbe8571:/opt/shlomo/benchmark_models# perf_analyzer -i grpc -m model_trt_fp16 --concurrency 1000 --shared-memory system
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1000
  Client:
    Request count: 90978
    Throughput: 5044.3 infer/sec
    Avg latency: 197140 usec (standard deviation 11611 usec)
    p50 latency: 196110 usec
    p90 latency: 217432 usec
    p95 latency: 227722 usec
    p99 latency: 245582 usec
    Avg gRPC time: 197135 usec ((un)marshal request/response 5 usec + response wait 197130 usec)
  Server:
    Inference count: 90888
    Execution count: 712
    Successful request count: 90888
    Avg request latency: 197695 usec (overhead 3773 usec + queue 168707 usec + compute input 13714 usec + compute infer 11225 usec + compute output 275 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1000, throughput: 5044.3 infer/sec, latency 197140 usec

Tried to increate instance_group =8 and max_batch_size = 64, and did see some latency improvement, still not my goal:

root@5f8b48be57d2:/opt/tritonserver# perf_analyzer -i grpc -m model_trt_fp16 --concurrency 1000 --shared-memory system
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1000
  Client:
    Request count: 152714
    Throughput: 8454.02 infer/sec
    Avg latency: 117871 usec (standard deviation 6402 usec)
    p50 latency: 117517 usec
    p90 latency: 133338 usec
    p95 latency: 139219 usec
    p99 latency: 149502 usec
    Avg gRPC time: 117860 usec ((un)marshal request/response 5 usec + response wait 117855 usec)
  Server:
    Inference count: 152714
    Execution count: 2388
    Successful request count: 152714
    Avg request latency: 117684 usec (overhead 2016 usec + queue 94008 usec + compute input 8234 usec + compute infer 12207 usec + compute output 1218 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1000, throughput: 8454.02 infer/sec, latency 117871 usec

Then, I've tried max_batch_size = 64 and instance_group = 16, which I assume should be able to handle 1024 requests concurrently ? (64x16 = 1024), yet the latency queue time is still high:

root@5f8b48be57d2:/opt/tritonserver# perf_analyzer -i grpc -m model_trt_fp16 --concurrency 1000 --shared-memory system
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1000

  Client:
    Request count: 155884
    Throughput: 8631.19 infer/sec
    Avg latency: 115285 usec (standard deviation 8749 usec)
    p50 latency: 115097 usec
    p90 latency: 132436 usec
    p95 latency: 138639 usec
    p99 latency: 151940 usec
    Avg gRPC time: 115276 usec ((un)marshal request/response 5 usec + response wait 115271 usec)
  Server:
    Inference count: 155881
    Execution count: 2438
    Successful request count: 155881
    Avg request latency: 114746 usec (overhead 2320 usec + queue 91500 usec + compute input 8227 usec + compute infer 11599 usec + compute output 1099 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1000, throughput: 8631.19 infer/sec, latency 115285 usec

How should I proceed?
Also, when dynamic_batching is enabled, what is the best values to give for trtexec for optShape? Today I use:

--minShapes=input:1x3x256x256 --optShapes=input:16x3x256x256 --maxShapes=input:256x3x256x256

What performance impact it has? In my application requests are coming one by one.

rmccorm4 · 2025-02-07T18:49:11Z

Hi @asaff1,

Have you tried Model Analyzer for finding an optimal model config (instance count, batching settings, etc):

asaff1 changed the title ~~high queue times in perf_analyzer~~ Performance issue - High queue times in perf_analyzer Feb 4, 2025

rmccorm4 added question Further information is requested performance A possible performance tune-up labels Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue - High queue times in perf_analyzer #7986

Performance issue - High queue times in perf_analyzer #7986

asaff1 commented Feb 4, 2025

rmccorm4 commented Feb 5, 2025

asaff1 commented Feb 6, 2025

rmccorm4 commented Feb 7, 2025

Performance issue - High queue times in perf_analyzer #7986

Performance issue - High queue times in perf_analyzer #7986

Comments

asaff1 commented Feb 4, 2025

rmccorm4 commented Feb 5, 2025

asaff1 commented Feb 6, 2025

rmccorm4 commented Feb 7, 2025