Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue - High queue times in perf_analyzer #7986

Open
asaff1 opened this issue Feb 4, 2025 · 3 comments
Open

Performance issue - High queue times in perf_analyzer #7986

asaff1 opened this issue Feb 4, 2025 · 3 comments
Labels
performance A possible performance tune-up question Further information is requested

Comments

@asaff1
Copy link

asaff1 commented Feb 4, 2025

I've used trtexec to improve a model performance.
perf_analyzer shows that the infer compute time is very log (a few milliseconds), yet the queue and wait time are high (300ms). What is the reason for requests spending a long time in the queue? A detailed explanation here will be appreciated.
Ideally I want to request time to match the inference time. Any ideas?
I've tried playing with instance_groups with no success.

root@5d6049652465:/opt/tritonserver# perf_analyzer -i grpc -m model_trt_fp16 --concurrency 1000 --shared-memory system
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1000
  Client:
    Request count: 72262
    Throughput: 4005.66 infer/sec
    Avg latency: 247602 usec (standard deviation 9319 usec)
    p50 latency: 245439 usec
    p90 latency: 255399 usec
    p95 latency: 282710 usec
    p99 latency: 345187 usec
    Avg gRPC time: 247578 usec ((un)marshal request/response 11 usec + response wait 247567 usec)
  Server:
    Inference count: 72262
    Execution count: 4518
    Successful request count: 72262
    Avg request latency: 247496 usec (overhead 844 usec + queue 242704 usec + compute input 2241 usec + compute infer 1613 usec + compute output 93 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1000, throughput: 4005.66 infer/sec, latency 247602 usec

config.pbtxt:

platform: "tensorrt_plan"
max_batch_size: 16
dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 100
}
optimization {
  cuda { graphs: true }
}

model_warmup {
  batch_size: 1
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [3, 256, 256]
      zero_data: true
    }
  }
}
model_warmup {
  batch_size: 2
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [3, 256, 256]
      zero_data: true
    }
  }
}
model_warmup {
  batch_size: 3
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [3, 256, 256]
      zero_data: true
    }
  }
}
model_warmup {
  batch_size: 4
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [3, 256, 256]
      zero_data: true
    }
  }
}
@asaff1 asaff1 changed the title high queue times in perf_analyzer Performance issue - High queue times in perf_analyzer Feb 4, 2025
@rmccorm4 rmccorm4 added question Further information is requested performance A possible performance tune-up labels Feb 5, 2025
@rmccorm4
Copy link
Contributor

rmccorm4 commented Feb 5, 2025

Hi @asaff1, the queue time is likely so high compared to compute times because the model config has defined a max batch size of 16, but is being hit by PA with a concurrency of 1000, leaving many requests to be queued while at most 16 requests at a time are executed.

Can you build an engine that supports a greater max batch size?

I've tried playing with instance_groups with no success.

Can you elaborate on this? What instance group configurations have you tried, and how did they affect the results?

@asaff1
Copy link
Author

asaff1 commented Feb 6, 2025

@rmccorm4 I understand. Thanks.

The system has one RTX 4090. My goal is to reduce latency to bellow 10ms and to have 1000 requests per second. I've tried using PA with --request-range 1000 but I get the warning

[WARNING] Perf Analyzer was not able to keep up with the desired request rate. 99.96% of the requests were delayed.

And then I see less infer/sec then using --concurrency 1000

  • What is better in this case, increasing model instances, or increasing batch size?

I've tried to increate to max_batch_size = 128, (and disabled preferred_batch_sizes ), and latency is still high

root@c13a7cbe8571:/opt/shlomo/benchmark_models# perf_analyzer -i grpc -m model_trt_fp16 --concurrency 1000 --shared-memory system
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1000
  Client:
    Request count: 90978
    Throughput: 5044.3 infer/sec
    Avg latency: 197140 usec (standard deviation 11611 usec)
    p50 latency: 196110 usec
    p90 latency: 217432 usec
    p95 latency: 227722 usec
    p99 latency: 245582 usec
    Avg gRPC time: 197135 usec ((un)marshal request/response 5 usec + response wait 197130 usec)
  Server:
    Inference count: 90888
    Execution count: 712
    Successful request count: 90888
    Avg request latency: 197695 usec (overhead 3773 usec + queue 168707 usec + compute input 13714 usec + compute infer 11225 usec + compute output 275 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1000, throughput: 5044.3 infer/sec, latency 197140 usec

Tried to increate instance_group =8 and max_batch_size = 64, and did see some latency improvement, still not my goal:

root@5f8b48be57d2:/opt/tritonserver# perf_analyzer -i grpc -m model_trt_fp16 --concurrency 1000 --shared-memory system
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1000
  Client:
    Request count: 152714
    Throughput: 8454.02 infer/sec
    Avg latency: 117871 usec (standard deviation 6402 usec)
    p50 latency: 117517 usec
    p90 latency: 133338 usec
    p95 latency: 139219 usec
    p99 latency: 149502 usec
    Avg gRPC time: 117860 usec ((un)marshal request/response 5 usec + response wait 117855 usec)
  Server:
    Inference count: 152714
    Execution count: 2388
    Successful request count: 152714
    Avg request latency: 117684 usec (overhead 2016 usec + queue 94008 usec + compute input 8234 usec + compute infer 12207 usec + compute output 1218 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1000, throughput: 8454.02 infer/sec, latency 117871 usec

Then, I've tried max_batch_size = 64 and instance_group = 16, which I assume should be able to handle 1024 requests concurrently ? (64x16 = 1024), yet the latency queue time is still high:

root@5f8b48be57d2:/opt/tritonserver# perf_analyzer -i grpc -m model_trt_fp16 --concurrency 1000 --shared-memory system
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1000

  Client:
    Request count: 155884
    Throughput: 8631.19 infer/sec
    Avg latency: 115285 usec (standard deviation 8749 usec)
    p50 latency: 115097 usec
    p90 latency: 132436 usec
    p95 latency: 138639 usec
    p99 latency: 151940 usec
    Avg gRPC time: 115276 usec ((un)marshal request/response 5 usec + response wait 115271 usec)
  Server:
    Inference count: 155881
    Execution count: 2438
    Successful request count: 155881
    Avg request latency: 114746 usec (overhead 2320 usec + queue 91500 usec + compute input 8227 usec + compute infer 11599 usec + compute output 1099 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1000, throughput: 8631.19 infer/sec, latency 115285 usec
  • How should I proceed?

  • Also, when dynamic_batching is enabled, what is the best values to give for trtexec for optShape? Today I use:

--minShapes=input:1x3x256x256 --optShapes=input:16x3x256x256 --maxShapes=input:256x3x256x256

What performance impact it has? In my application requests are coming one by one.

@rmccorm4
Copy link
Contributor

rmccorm4 commented Feb 7, 2025

Hi @asaff1,

Have you tried Model Analyzer for finding an optimal model config (instance count, batching settings, etc):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A possible performance tune-up question Further information is requested
Development

No branches or pull requests

2 participants