Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PTX CUDA API: Segmentation fault, stuck in loop #302

Open
rohanjuneja opened this issue Jul 23, 2024 · 2 comments
Open

PTX CUDA API: Segmentation fault, stuck in loop #302

rohanjuneja opened this issue Jul 23, 2024 · 2 comments

Comments

@rohanjuneja
Copy link

rohanjuneja commented Jul 23, 2024

While executing GPGPU-sim, am getting a segmentation fault. Looking at the log file, the PTX cuda api is stuck in a loop.
image

And executing gemm cutlass implementation also throws segmentation fault.
image

@GCC314
Copy link

GCC314 commented Aug 30, 2024

tl;dr: might be a UB, use older gcc/g++.

I ran into the same problem when I used g++-9 as my compiler. By using gdb, I found my program repeatedly looping in the function __cudaPushCallConfiguration() at libcuda/cuda_runtime_api.cc:3592:

unsigned CUDARTAPI __cudaPushCallConfiguration(dim3 gridDim, dim3 blockDim,
                                               size_t sharedMem = 0,
                                               struct CUstream_st *stream = 0) {
  if (g_debug_execution >= 3) {
    announce_call(__my_func__);
  }
  cudaConfigureCallInternal(gridDim, blockDim, sharedMem, stream);
}

And I checked that g_debug_execution is always 0, so the if branch should have never been executed.
However, we can find that __cudaPushCallConfiguration() should return unsigned value but actually didn't have any return statement, so there is an undefined behavior.

Furthermore, I want to know whether it's a mistake or an intended trick, so I use ghidra to decompile cuda_runtime_api.o from myself and This docker image of gpgpusim respectively to compare their __cudaPushCallConfiguration().

My version, which turned out to be a nonsense loop:

uint __cudaPushCallConfiguration(dim3 gridDim,dim3 blockDim,size_t sharedMem,CUstream_st *stream)

{
  size_t *psVar1;
  size_t *psVar2;
  dim3 blockDim_00;
  gpgpu_context *apgStack_50 [3];
  undefined auStack_38 [16];
  dim3 blockDim_local;
  dim3 gridDim_local;
  
  psVar2 = (size_t *)auStack_38;
  blockDim_local.x = (uint)blockDim._0_8_;
  blockDim_local.y = SUB84(blockDim._0_8_,4);
  blockDim_local.z = blockDim.z;
  psVar1 = (size_t *)auStack_38;
  if (2 < _g_debug_execution) goto LAB_00106308;
  do {
    *(undefined8 *)((long)psVar2 + -0x10) = 0;
    *(undefined8 *)((long)psVar2 + -0x18) = 0x106303;
    blockDim_00.z = *(uint *)((long)psVar2 + 0x18);
    blockDim_00._0_8_ = *(undefined8 *)((long)psVar2 + 0x10);
    cudaConfigureCallInternal
              (*(dim3 *)((long)psVar2 + 0x20),blockDim_00,sharedMem,stream,
               *(gpgpu_context **)((long)psVar2 + -0x10));
    psVar1 = (size_t *)((long)psVar2 + -0x10);
LAB_00106308:
    psVar2 = psVar1;
    psVar2[1] = (size_t)stream;
    *psVar2 = sharedMem;
    psVar2[-1] = 0x10631d;
    announce_call("unsigned int __cudaPushCallConfiguration(dim3, dim3, size_t, CUstream_st*)");
    stream = (CUstream_st *)psVar2[1];
    sharedMem = *psVar2;
  } while( true );
}

And this docker image of gpgpusim 's version, which resembles to libcuda/cuda_runtime_api.cc's, and pass the return value from cudaConfigureCallInternal() as return value.

uint __cudaPushCallConfiguration(dim3 gridDim,dim3 blockDim,size_t sharedMem,CUstream_st *stream)

{
  cudaError_t cVar1;
  dim3 blockDim_local;
  dim3 gridDim_local;
  
  if (2 < _g_debug_execution) {
    announce_call("unsigned int __cudaPushCallConfiguration(dim3, dim3, size_t, CUstream_st*)");
  }
  cVar1 = cudaConfigureCallInternal(gridDim,blockDim,sharedMem,stream,(gpgpu_context *)0x0);
  return cVar1;
}

In summary, it seems that there is a missing 'return', which triggers UB and leads to unexpected results. And there seems to be some other UBs in the code, so using an older compiler might help as it reproduces the same compiler behaviours.

I hope this helps.

@jidle123
Copy link

jidle123 commented Oct 11, 2024

Has anyone dealed it? My gcc/g++ version is 5.5.
Give me some tips plz... I have trapped in the loop

GPGPU-Sim PTX: __cudaRegisterFunction __nv_static_51__38_cuda_device_runtime_compute_80_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi0ELi0ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_ : hostFun 0x0x55b69c0674b0, fat_cubin_handle = 3 Warning: cannot find deviceFun __nv_static_51__38_cuda_device_runtime_compute_80_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi0ELi0ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_ GPGPU-Sim PTX: __cudaRegisterVar: hostVar = 0x55b69c33d680; deviceAddress = __nv_static_51__38_cuda_device_runtime_compute_80_cpp1_ii_8b1a5d37_set_kernel32; deviceName = __nv_static_51__38_cuda_device_runtime_compute_80_cpp1_ii_8b1a5d37_set_kernel32 GPGPU-Sim PTX: __cudaRegisterVar: Registering const memory space of 64 bytes

It finally wasted all my const mem and caused core dumped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants