You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to use dali for preprocessing images to send to yolo. I have a two GPU system, that after a few minutes running dali, segfaults and dies. The Dali step is run in an triton ensemble model that sends data to a yolo model compiled to tensorrt. The stacktrace seems to indicate the problem is within dali.
There are about 100fps (1 batch size with 32 concurrency) of images being sent to triton, causing it to throw this error after several rounds of processing. The error reproduces after a few minutes:
Signal (11) received.
0# 0x00005572DF1FBBD9 in tritonserver
1# 0x00007F6BFF552210 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# 0x00007F6BF5602D40 in /usr/local/cuda/compat/lib.real/libcuda.so.1
3# 0x00007F6BF57383E3 in /usr/local/cuda/compat/lib.real/libcuda.so.1
4# 0x00007F6BF585EB02 in /usr/local/cuda/compat/lib.real/libcuda.so.1
5# 0x00007F6BF55BF2E3 in /usr/local/cuda/compat/lib.real/libcuda.so.1
6# 0x00007F6BF55BFAC4 in /usr/local/cuda/compat/lib.real/libcuda.so.1
7# 0x00007F6BF55C1BD5 in /usr/local/cuda/compat/lib.real/libcuda.so.1
8# 0x00007F6BF562FAAE in /usr/local/cuda/compat/lib.real/libcuda.so.1
9# 0x00007F6BC80CF0D9 in /opt/tritonserver/backends/dali/libtriton_dali.so
10# 0x00007F6BC809EFED in /opt/tritonserver/backends/dali/libtriton_dali.so
11# 0x00007F6BC80F3B65 in /opt/tritonserver/backends/dali/libtriton_dali.so
12# 0x00007F6BC8098130 in /opt/tritonserver/backends/dali/libtriton_dali.so
13# dali::ThreadPool::ThreadMain(int, int, bool) in /opt/tritonserver/backends/dali/dali/libdali.so
14# 0x00007F6ABDAE526F in /opt/tritonserver/backends/dali/dali/libdali.so
15# 0x00007F6BFFDBE609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
16# clone in /usr/lib/x86_64-linux-gnu/libc.so.6
Causes Signal (11) to show up faster (seconds) rather than minutes during ensembling over two GPUs.
Other things I've tried
I've tried multiple changes, such as
calling .gpu() and setting the external source to cpu, which causes the signal 11 to show up later.
single GPU instance makes dali play nicely and does not crash.
Theories
There is some kind of weirdness with the dali backend confusing gpu tensors across gpus perhaps?
Since the issue takes a few minutes to occur without instance groups, it is likely that cross scheduling devices (preprocessing in gpu:0 and then tensorrt on gpu:1) is causing issues and segfaults.
Versions
NVIDIA Release 21.12 (build 30441439)
Any thoughts? Appreciate this in advance.
The text was updated successfully, but these errors were encountered:
Hi @natel9178 !
Thanks for thorough description of the problem. If I understand correctly, you are running 32 parallel models here. Additionally I see these in the stacktrace:
13# dali::ThreadPool::ThreadMain(int, int, bool) in /opt/tritonserver/backends/dali/dali/libdali.so
15# 0x00007F6BFFDBE609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
This would suggest a problem with CPU threads used. Could you try to reduce num_threads=1 in DALI pipeline definition and see, if this helps? Generally, CPU operations in DALI use thread-per-sample mapping. So if you have batch_size=1 there won't be any difference, still only one thread will be used.
If this does not help, could you support us with a core dump or a repro, that we can run on our side?
Hi!
I'm trying to use dali for preprocessing images to send to yolo. I have a two GPU system, that after a few minutes running dali, segfaults and dies. The Dali step is run in an triton ensemble model that sends data to a yolo model compiled to tensorrt. The stacktrace seems to indicate the problem is within dali.
Repro:
The preprocessing code is the following:
and config.pbtxt
There are about 100fps (1 batch size with 32 concurrency) of images being sent to triton, causing it to throw this error after several rounds of processing. The error reproduces after a few minutes:
Also, interestingly enough, running this with
Causes Signal (11) to show up faster (seconds) rather than minutes during ensembling over two GPUs.
Other things I've tried
I've tried multiple changes, such as
.gpu()
and setting the external source to cpu, which causes the signal 11 to show up later.Theories
Versions
NVIDIA Release 21.12 (build 30441439)
Any thoughts? Appreciate this in advance.
The text was updated successfully, but these errors were encountered: