Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU Configuration with dali Signal (11)s on triton 21.12 during ensemble processing #116

Open
natel9178 opened this issue Jan 26, 2022 · 1 comment
Assignees
Labels
bug Something isn't working lack_of_repro

Comments

@natel9178
Copy link

natel9178 commented Jan 26, 2022

Hi!

I'm trying to use dali for preprocessing images to send to yolo. I have a two GPU system, that after a few minutes running dali, segfaults and dies. The Dali step is run in an triton ensemble model that sends data to a yolo model compiled to tensorrt. The stacktrace seems to indicate the problem is within dali.

Repro:

The preprocessing code is the following:

import nvidia.dali as dali


@dali.pipeline_def(batch_size=64, num_threads=4, device_id=0)
def pipe():
    images = dali.fn.external_source(
        device="gpu", name="DALI_INPUT_0", dtype=dali.types.UINT8)
    images = dali.fn.color_space_conversion(
        images, image_type=dali.types.BGR, output_type=dali.types.RGB, device='gpu')
    images = dali.fn.resize(images, mode="not_larger",
                            resize_x=640, resize_y=384, device='gpu')
    images = dali.fn.crop(images, crop_w=640, crop_h=384, crop_pos_x=0, crop_pos_y=0,
                          fill_values=114, out_of_bounds_policy="pad", device='gpu')
    images = dali.fn.transpose(images, perm=[2, 0, 1], device='gpu')
    images = dali.fn.cast(images, dtype=dali.types.FLOAT, device='gpu')
    return images


pipe().serialize(filename="1/model.dali")

and config.pbtxt

name: "preprocessbgr"
backend: "dali"
max_batch_size: 64 
input [
{
    name: "DALI_INPUT_0"
    data_type: TYPE_UINT8
    dims: [ -1, -1, 3 ]
}
]
 
output [
{
    name: "OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ 3, 384, 640 ]
}
]
dynamic_batching { }

There are about 100fps (1 batch size with 32 concurrency) of images being sent to triton, causing it to throw this error after several rounds of processing. The error reproduces after a few minutes:

Signal (11) received.
 0# 0x00005572DF1FBBD9 in tritonserver
 1# 0x00007F6BFF552210 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# 0x00007F6BF5602D40 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 3# 0x00007F6BF57383E3 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 4# 0x00007F6BF585EB02 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 5# 0x00007F6BF55BF2E3 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 6# 0x00007F6BF55BFAC4 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 7# 0x00007F6BF55C1BD5 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 8# 0x00007F6BF562FAAE in /usr/local/cuda/compat/lib.real/libcuda.so.1
 9# 0x00007F6BC80CF0D9 in /opt/tritonserver/backends/dali/libtriton_dali.so
10# 0x00007F6BC809EFED in /opt/tritonserver/backends/dali/libtriton_dali.so
11# 0x00007F6BC80F3B65 in /opt/tritonserver/backends/dali/libtriton_dali.so
12# 0x00007F6BC8098130 in /opt/tritonserver/backends/dali/libtriton_dali.so
13# dali::ThreadPool::ThreadMain(int, int, bool) in /opt/tritonserver/backends/dali/dali/libdali.so
14# 0x00007F6ABDAE526F in /opt/tritonserver/backends/dali/dali/libdali.so
15# 0x00007F6BFFDBE609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
16# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Also, interestingly enough, running this with

instance_group [
    {
        count: 1
        kind: KIND_GPU
        gpus: [ 0 ]
    }
]

Causes Signal (11) to show up faster (seconds) rather than minutes during ensembling over two GPUs.

Other things I've tried

I've tried multiple changes, such as

  1. calling .gpu() and setting the external source to cpu, which causes the signal 11 to show up later.
  2. single GPU instance makes dali play nicely and does not crash.

Theories

  1. There is some kind of weirdness with the dali backend confusing gpu tensors across gpus perhaps?
  2. Since the issue takes a few minutes to occur without instance groups, it is likely that cross scheduling devices (preprocessing in gpu:0 and then tensorrt on gpu:1) is causing issues and segfaults.

Versions

NVIDIA Release 21.12 (build 30441439)

Any thoughts? Appreciate this in advance.

@szalpal szalpal self-assigned this Jan 26, 2022
@szalpal
Copy link
Member

szalpal commented Jan 26, 2022

Hi @natel9178 !
Thanks for thorough description of the problem. If I understand correctly, you are running 32 parallel models here. Additionally I see these in the stacktrace:

13# dali::ThreadPool::ThreadMain(int, int, bool) in /opt/tritonserver/backends/dali/dali/libdali.so
15# 0x00007F6BFFDBE609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0

This would suggest a problem with CPU threads used. Could you try to reduce num_threads=1 in DALI pipeline definition and see, if this helps? Generally, CPU operations in DALI use thread-per-sample mapping. So if you have batch_size=1 there won't be any difference, still only one thread will be used.

If this does not help, could you support us with a core dump or a repro, that we can run on our side?

@JanuszL JanuszL added bug Something isn't working lack_of_repro labels Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lack_of_repro
Development

No branches or pull requests

3 participants