Multi-GPU Configuration with dali Signal (11)s on triton 21.12 during ensemble processing #116

natel9178 · 2022-01-26T08:32:25Z

Hi!

I'm trying to use dali for preprocessing images to send to yolo. I have a two GPU system, that after a few minutes running dali, segfaults and dies. The Dali step is run in an triton ensemble model that sends data to a yolo model compiled to tensorrt. The stacktrace seems to indicate the problem is within dali.

Repro:

The preprocessing code is the following:

import nvidia.dali as dali


@dali.pipeline_def(batch_size=64, num_threads=4, device_id=0)
def pipe():
    images = dali.fn.external_source(
        device="gpu", name="DALI_INPUT_0", dtype=dali.types.UINT8)
    images = dali.fn.color_space_conversion(
        images, image_type=dali.types.BGR, output_type=dali.types.RGB, device='gpu')
    images = dali.fn.resize(images, mode="not_larger",
                            resize_x=640, resize_y=384, device='gpu')
    images = dali.fn.crop(images, crop_w=640, crop_h=384, crop_pos_x=0, crop_pos_y=0,
                          fill_values=114, out_of_bounds_policy="pad", device='gpu')
    images = dali.fn.transpose(images, perm=[2, 0, 1], device='gpu')
    images = dali.fn.cast(images, dtype=dali.types.FLOAT, device='gpu')
    return images


pipe().serialize(filename="1/model.dali")

and config.pbtxt

name: "preprocessbgr"
backend: "dali"
max_batch_size: 64 
input [
{
    name: "DALI_INPUT_0"
    data_type: TYPE_UINT8
    dims: [ -1, -1, 3 ]
}
]
 
output [
{
    name: "OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ 3, 384, 640 ]
}
]
dynamic_batching { }

There are about 100fps (1 batch size with 32 concurrency) of images being sent to triton, causing it to throw this error after several rounds of processing. The error reproduces after a few minutes:

Signal (11) received.
 0# 0x00005572DF1FBBD9 in tritonserver
 1# 0x00007F6BFF552210 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# 0x00007F6BF5602D40 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 3# 0x00007F6BF57383E3 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 4# 0x00007F6BF585EB02 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 5# 0x00007F6BF55BF2E3 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 6# 0x00007F6BF55BFAC4 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 7# 0x00007F6BF55C1BD5 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 8# 0x00007F6BF562FAAE in /usr/local/cuda/compat/lib.real/libcuda.so.1
 9# 0x00007F6BC80CF0D9 in /opt/tritonserver/backends/dali/libtriton_dali.so
10# 0x00007F6BC809EFED in /opt/tritonserver/backends/dali/libtriton_dali.so
11# 0x00007F6BC80F3B65 in /opt/tritonserver/backends/dali/libtriton_dali.so
12# 0x00007F6BC8098130 in /opt/tritonserver/backends/dali/libtriton_dali.so
13# dali::ThreadPool::ThreadMain(int, int, bool) in /opt/tritonserver/backends/dali/dali/libdali.so
14# 0x00007F6ABDAE526F in /opt/tritonserver/backends/dali/dali/libdali.so
15# 0x00007F6BFFDBE609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
16# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Also, interestingly enough, running this with

instance_group [
    {
        count: 1
        kind: KIND_GPU
        gpus: [ 0 ]
    }
]

Causes Signal (11) to show up faster (seconds) rather than minutes during ensembling over two GPUs.

Other things I've tried

I've tried multiple changes, such as

calling .gpu() and setting the external source to cpu, which causes the signal 11 to show up later.
single GPU instance makes dali play nicely and does not crash.

Theories

There is some kind of weirdness with the dali backend confusing gpu tensors across gpus perhaps?
Since the issue takes a few minutes to occur without instance groups, it is likely that cross scheduling devices (preprocessing in gpu:0 and then tensorrt on gpu:1) is causing issues and segfaults.

Versions

NVIDIA Release 21.12 (build 30441439)

Any thoughts? Appreciate this in advance.

The text was updated successfully, but these errors were encountered:

szalpal · 2022-01-26T12:03:43Z

Hi @natel9178 !
Thanks for thorough description of the problem. If I understand correctly, you are running 32 parallel models here. Additionally I see these in the stacktrace:

13# dali::ThreadPool::ThreadMain(int, int, bool) in /opt/tritonserver/backends/dali/dali/libdali.so
15# 0x00007F6BFFDBE609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0

This would suggest a problem with CPU threads used. Could you try to reduce num_threads=1 in DALI pipeline definition and see, if this helps? Generally, CPU operations in DALI use thread-per-sample mapping. So if you have batch_size=1 there won't be any difference, still only one thread will be used.

If this does not help, could you support us with a core dump or a repro, that we can run on our side?

szalpal self-assigned this Jan 26, 2022

JanuszL added bug Something isn't working lack_of_repro labels Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU Configuration with dali Signal (11)s on triton 21.12 during ensemble processing #116

Multi-GPU Configuration with dali Signal (11)s on triton 21.12 during ensemble processing #116

natel9178 commented Jan 26, 2022 •

edited

Loading

szalpal commented Jan 26, 2022 •

edited

Loading

Multi-GPU Configuration with dali Signal (11)s on triton 21.12 during ensemble processing #116

Multi-GPU Configuration with dali Signal (11)s on triton 21.12 during ensemble processing #116

Comments

natel9178 commented Jan 26, 2022 • edited Loading

Repro:

Other things I've tried

Theories

Versions

szalpal commented Jan 26, 2022 • edited Loading

natel9178 commented Jan 26, 2022 •

edited

Loading

szalpal commented Jan 26, 2022 •

edited

Loading