PTX CUDA API: Segmentation fault, stuck in loop #302

rohanjuneja · 2024-07-23T20:00:23Z

While executing GPGPU-sim, am getting a segmentation fault. Looking at the log file, the PTX cuda api is stuck in a loop.

And executing gemm cutlass implementation also throws segmentation fault.

GCC314 · 2024-08-30T19:29:07Z

tl;dr: might be a UB, use older gcc/g++.

I ran into the same problem when I used g++-9 as my compiler. By using gdb, I found my program repeatedly looping in the function __cudaPushCallConfiguration() at libcuda/cuda_runtime_api.cc:3592:

unsigned CUDARTAPI __cudaPushCallConfiguration(dim3 gridDim, dim3 blockDim,
                                               size_t sharedMem = 0,
                                               struct CUstream_st *stream = 0) {
  if (g_debug_execution >= 3) {
    announce_call(__my_func__);
  }
  cudaConfigureCallInternal(gridDim, blockDim, sharedMem, stream);
}

And I checked that g_debug_execution is always 0, so the if branch should have never been executed.
However, we can find that __cudaPushCallConfiguration() should return unsigned value but actually didn't have any return statement, so there is an undefined behavior.

Furthermore, I want to know whether it's a mistake or an intended trick, so I use ghidra to decompile cuda_runtime_api.o from myself and This docker image of gpgpusim respectively to compare their __cudaPushCallConfiguration().

My version, which turned out to be a nonsense loop:

uint __cudaPushCallConfiguration(dim3 gridDim,dim3 blockDim,size_t sharedMem,CUstream_st *stream)

{
  size_t *psVar1;
  size_t *psVar2;
  dim3 blockDim_00;
  gpgpu_context *apgStack_50 [3];
  undefined auStack_38 [16];
  dim3 blockDim_local;
  dim3 gridDim_local;
  
  psVar2 = (size_t *)auStack_38;
  blockDim_local.x = (uint)blockDim._0_8_;
  blockDim_local.y = SUB84(blockDim._0_8_,4);
  blockDim_local.z = blockDim.z;
  psVar1 = (size_t *)auStack_38;
  if (2 < _g_debug_execution) goto LAB_00106308;
  do {
    *(undefined8 *)((long)psVar2 + -0x10) = 0;
    *(undefined8 *)((long)psVar2 + -0x18) = 0x106303;
    blockDim_00.z = *(uint *)((long)psVar2 + 0x18);
    blockDim_00._0_8_ = *(undefined8 *)((long)psVar2 + 0x10);
    cudaConfigureCallInternal
              (*(dim3 *)((long)psVar2 + 0x20),blockDim_00,sharedMem,stream,
               *(gpgpu_context **)((long)psVar2 + -0x10));
    psVar1 = (size_t *)((long)psVar2 + -0x10);
LAB_00106308:
    psVar2 = psVar1;
    psVar2[1] = (size_t)stream;
    *psVar2 = sharedMem;
    psVar2[-1] = 0x10631d;
    announce_call("unsigned int __cudaPushCallConfiguration(dim3, dim3, size_t, CUstream_st*)");
    stream = (CUstream_st *)psVar2[1];
    sharedMem = *psVar2;
  } while( true );
}

And this docker image of gpgpusim 's version, which resembles to libcuda/cuda_runtime_api.cc's, and pass the return value from cudaConfigureCallInternal() as return value.

uint __cudaPushCallConfiguration(dim3 gridDim,dim3 blockDim,size_t sharedMem,CUstream_st *stream)

{
  cudaError_t cVar1;
  dim3 blockDim_local;
  dim3 gridDim_local;
  
  if (2 < _g_debug_execution) {
    announce_call("unsigned int __cudaPushCallConfiguration(dim3, dim3, size_t, CUstream_st*)");
  }
  cVar1 = cudaConfigureCallInternal(gridDim,blockDim,sharedMem,stream,(gpgpu_context *)0x0);
  return cVar1;
}

In summary, it seems that there is a missing 'return', which triggers UB and leads to unexpected results. And there seems to be some other UBs in the code, so using an older compiler might help as it reproduces the same compiler behaviours.

I hope this helps.

jidle123 · 2024-10-11T07:49:07Z

Has anyone dealed it? My gcc/g++ version is 5.5.
Give me some tips plz... I have trapped in the loop

GPGPU-Sim PTX: __cudaRegisterFunction __nv_static_51__38_cuda_device_runtime_compute_80_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi0ELi0ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_ : hostFun 0x0x55b69c0674b0, fat_cubin_handle = 3 Warning: cannot find deviceFun __nv_static_51__38_cuda_device_runtime_compute_80_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi0ELi0ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_ GPGPU-Sim PTX: __cudaRegisterVar: hostVar = 0x55b69c33d680; deviceAddress = __nv_static_51__38_cuda_device_runtime_compute_80_cpp1_ii_8b1a5d37_set_kernel32; deviceName = __nv_static_51__38_cuda_device_runtime_compute_80_cpp1_ii_8b1a5d37_set_kernel32 GPGPU-Sim PTX: __cudaRegisterVar: Registering const memory space of 64 bytes

It finally wasted all my const mem and caused core dumped.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PTX CUDA API: Segmentation fault, stuck in loop #302

PTX CUDA API: Segmentation fault, stuck in loop #302

rohanjuneja commented Jul 23, 2024 •

edited

Loading

GCC314 commented Aug 30, 2024 •

edited

Loading

jidle123 commented Oct 11, 2024 •

edited

Loading

PTX CUDA API: Segmentation fault, stuck in loop #302

PTX CUDA API: Segmentation fault, stuck in loop #302

Comments

rohanjuneja commented Jul 23, 2024 • edited Loading

GCC314 commented Aug 30, 2024 • edited Loading

jidle123 commented Oct 11, 2024 • edited Loading

rohanjuneja commented Jul 23, 2024 •

edited

Loading

GCC314 commented Aug 30, 2024 •

edited

Loading

jidle123 commented Oct 11, 2024 •

edited

Loading