Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: test with mpirun -np 4, 1 GPU per rank #48

Merged
merged 1 commit into from
Jun 27, 2024

Conversation

mgates3
Copy link
Collaborator

@mgates3 mgates3 commented May 9, 2023

The gpu_bind.sh script avoids oversubscribing GPUs, which can be detrimental, e.g. on a DGX, 4 ranks, each with 8 GPUs! However, it no longer tests the multi-GPU per MPI rank code.

@mgates3
Copy link
Collaborator Author

mgates3 commented Jun 29, 2023

Commit 3c4cdcc has 3 failures

CUDA has 2 failures, one in Cholesky QR, one in SVD:

mpirun -np 4 ./gpu_bind.sh ./tester  --origin d --target d --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 --dim 100x50 cholqr
rank 1, gpu_kind cuda, gpus 2 3 4 5 6 7, ndev 6, dev 1, visible_devices 3, rank_var OMPI_COMM_WORLD_LOCAL_RANK
rank 2, gpu_kind cuda, gpus 2 3 4 5 6 7, ndev 6, dev 2, visible_devices 4, rank_var OMPI_COMM_WORLD_LOCAL_RANK
rank 3, gpu_kind cuda, gpus 2 3 4 5 6 7, ndev 6, dev 3, visible_devices 5, rank_var OMPI_COMM_WORLD_LOCAL_RANK
rank 0, gpu_kind cuda, gpus 2 3 4 5 6 7, ndev 6, dev 0, visible_devices 2, rank_var OMPI_COMM_WORLD_LOCAL_RANK
% SLATE version 2022.07.00, id 637beca
% input: ./tester --origin d --target d --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 --dim 100x50 cholqr
% 2023-06-29 13:05:33, 4 MPI ranks, CPU-only MPI, 8 OpenMP threads, 1 GPU devices per MPI rank
                                                                                                                                                      
type  origin  target  cholQR   A       m       n    nb  ib    p    q  la  pt      error   time (s)       gflop/s  ref time (s)   ref gflop/s  status  
   s     dev     dev    auto   1     100     100     8  32    2    2   1   4        nan      0.205       0.00661            NA            NA  FAILED  
   s     dev     dev    auto   1     100      50     8  32    2    2   1   4   4.85e-10     0.0228        0.0186            NA            NA  pass    
mpirun -np 4 ./gpu_bind.sh ./tester  --origin d --target d --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 --dim 100x50 gesvd
rank 0, gpu_kind cuda, gpus 2 3 4 5 6 7, ndev 6, dev 0, visible_devices 2, rank_var OMPI_COMM_WORLD_LOCAL_RANK
rank 3, gpu_kind cuda, gpus 2 3 4 5 6 7, ndev 6, dev 3, visible_devices 5, rank_var OMPI_COMM_WORLD_LOCAL_RANK
rank 1, gpu_kind cuda, gpus 2 3 4 5 6 7, ndev 6, dev 1, visible_devices 3, rank_var OMPI_COMM_WORLD_LOCAL_RANK
rank 2, gpu_kind cuda, gpus 2 3 4 5 6 7, ndev 6, dev 2, visible_devices 4, rank_var OMPI_COMM_WORLD_LOCAL_RANK
% SLATE version 2022.07.00, id 637beca
% input: ./tester --origin d --target d --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 --dim 100x50 gesvd
% 2023-06-29 13:07:02, 4 MPI ranks, CPU-only MPI, 8 OpenMP threads, 1 GPU devices per MPI rank
                                                                                                                                        
type  origin  target   A       jobu      jobvt       m       n    nb  ib    p    q  la  pt      error   time (s)  ref time (s)  status  
   s     dev     dev   1      novec      novec     100     100     8  32    2    2   1   4   1.84e-07      0.170        0.0248  pass    
[b056946f5bbe:2957951] *** Process received signal ***
[b056946f5bbe:2957951] Signal: Segmentation fault (11)
[b056946f5bbe:2957951] Signal code: Address not mapped (1)
[b056946f5bbe:2957951] Failing at address: 0xffffffffd4002328
[b056946f5bbe:2957951] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7f67710bdcf0]
[b056946f5bbe:2957951] [ 1] /spack/opt/spack/linux-rocky8-x86_64/gcc-9.5.0/openmpi-4.1.5-mq4fge7ljjlzhbkj45nbsy7cwb677als/lib/libmpi.so.40(MPI_Bcast+0x55)[0x7f675886bc35]
[b056946f5bbe:2957951] [ 2] /spack/opt/spack/linux-rocky8-x86_64/gcc-9.5.0/intel-oneapi-mkl-2023.1.0-4wn3k3zcaxj2ie7ecz4o5irqkhhovxg2/mkl/2023.1.0/lib/intel64/libmkl_blacs_openmpi_lp64.so.2(MKLMPI_Bcast+0xdd)[0x7f677a2b9e6d]
[b056946f5bbe:2957951] [ 3] /spack/opt/spack/linux-rocky8-x86_64/gcc-9.5.0/intel-oneapi-mkl-2023.1.0-4wn3k3zcaxj2ie7ecz4o5irqkhhovxg2/mkl/2023.1.0/lib/intel64/libmkl_scalapack_lp64.so.2(PB_CpgemmMPI+0xcef)[0x7f6778b3c71f]
[b056946f5bbe:2957951] [ 4] /spack/opt/spack/linux-rocky8-x86_64/gcc-9.5.0/intel-oneapi-mkl-2023.1.0-4wn3k3zcaxj2ie7ecz4o5irqkhhovxg2/mkl/2023.1.0/lib/intel64/libmkl_scalapack_lp64.so.2(psgemm_+0xdc0)[0x7f6778b90840]
[b056946f5bbe:2957951] [ 5] /spack/opt/spack/linux-rocky8-x86_64/gcc-9.5.0/intel-oneapi-mkl-2023.1.0-4wn3k3zcaxj2ie7ecz4o5irqkhhovxg2/mkl/2023.1.0/lib/intel64/libmkl_scalapack_lp64.so.2(psgebrd_+0x863)[0x7f6778663623]
[b056946f5bbe:2957951] [ 6] /spack/opt/spack/linux-rocky8-x86_64/gcc-9.5.0/intel-oneapi-mkl-2023.1.0-4wn3k3zcaxj2ie7ecz4o5irqkhhovxg2/mkl/2023.1.0/lib/intel64/libmkl_scalapack_lp64.so.2(psgesvd_+0x596)[0x7f67786721d6]
[b056946f5bbe:2957951] [ 7] ./tester[0x589861]
[b056946f5bbe:2957951] [ 8] ./tester[0x58acd8]
[b056946f5bbe:2957951] [ 9] ./tester[0x4a68b2]
[b056946f5bbe:2957951] [10] ./tester[0x4a7066]
[b056946f5bbe:2957951] [11] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f6757892d85]
[b056946f5bbe:2957951] [12] ./tester[0x4449ee]
[b056946f5bbe:2957951] *** End of error message ***
./gpu_bind.sh: line 59: 2957951 Segmentation fault      (core dumped) $@
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[22574,1],2]
  Exit code:    139
--------------------------------------------------------------------------
FAILED: gesvd, exit code 139
--------------------------------------------------------------------------------

AMD has 1 failure, only in CMake build:

mpirun -np 4 ./gpu_bind.sh ./tester  --origin d --target d --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 --uplo l,u --dev-dist  c,r potrf
rank 1, gpu_kind rocm, gpus 0 1, ndev 2, dev 1, visible_devices 1, rank_var MPI_LOCALRANKID
rank 0, gpu_kind rocm, gpus 0 1, ndev 2, dev 0, visible_devices 0, rank_var MPI_LOCALRANKID
rank 2, gpu_kind rocm, gpus 0 1, ndev 2, dev 0, visible_devices 0, rank_var MPI_LOCALRANKID
rank 3, gpu_kind rocm, gpus 0 1, ndev 2, dev 1, visible_devices 1, rank_var MPI_LOCALRANKID
% SLATE version 2022.07.00, id 637beca
% input: ./tester --origin d --target d --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 --uplo l,u --dev-dist c,r potrf
% 2023-06-29 12:44:33, 4 MPI ranks, CPU-only MPI, 8 OpenMP threads, 1 GPU devices per MPI rank
                                                                                                                                                                                                          
Timeout (limit=1200.0): potrf, exit code -9
--------------------------------------------------------------------------------
1 routines FAILED: potrf

@mgates3
Copy link
Collaborator Author

mgates3 commented Jun 29, 2023

I can't reproduce the AMD error on histamine inside Docker. Bewildering. Unfortunately, I can't run Docker on dopamine right now.

@mgates3
Copy link
Collaborator Author

mgates3 commented Jul 17, 2023

2 routines FAILED: hegv, gesvd
Both have segfaults. SVD may be fixed with SVD PR.

@mgates3
Copy link
Collaborator Author

mgates3 commented Sep 28, 2023

On CUDA, hegv and svd have segfaults, which appear to be inside Open MPI in ScaLAPACK BLACS:

mpirun -np 4 ./gpu_bind.sh ./tester  --origin d --target d --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 --jobz n,v --itype 1,2,3 --uplo l,u hegv
local_rank 1, gpu_kind cuda, available_gpus 0 1 2 3 4 5 6 7, ndev 8, dev 1, visible_devices 1, rank_var OMPI_COMM_WORLD_LOCAL_RANK
local_rank 2, gpu_kind cuda, available_gpus 0 1 2 3 4 5 6 7, ndev 8, dev 2, visible_devices 2, rank_var OMPI_COMM_WORLD_LOCAL_RANK
local_rank 0, gpu_kind cuda, available_gpus 0 1 2 3 4 5 6 7, ndev 8, dev 0, visible_devices 0, rank_var OMPI_COMM_WORLD_LOCAL_RANK
local_rank 3, gpu_kind cuda, available_gpus 0 1 2 3 4 5 6 7, ndev 8, dev 3, visible_devices 3, rank_var OMPI_COMM_WORLD_LOCAL_RANK
% SLATE version 2023.08.25, id 4ee7fd2
% input: ./tester --origin d --target d --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 --jobz n,v --itype 1,2,3 --uplo l,u hegv
% 2023-09-28 03:01:04, 4 MPI ranks, CPU-only MPI, 8 OpenMP threads, 1 GPU devices per MPI rank
                                                                                                                                                   
type  origin  target   A   B   C   jobz    uplo       n  itype    nb  ib    p    q  la  pt      error     error2   time (s)  ref time (s)  status  
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x300014e8
[ 0] /lib64/libc.so.6(+0x54df0)[0x7fbdc4186df0]
[ 1] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-x3joar7jwecznna2oshzb3yhf74gthie/lib/libmpi.so.40(PMPI_Comm_size+0x37)[0x7fbdc4747407]
[ 2] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_blacs_openmpi_lp64.so.2(MKLMPI_Comm_size+0x2a)[0x7fbde3c7451a]
[ 3] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(PB_CpgemmMPI+0x15c)[0x7fbde41ddb5c]
[ 4] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(psgemm_+0xdc0)[0x7fbde4232810]
[ 5] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(pssyngst_+0xcfc)[0x7fbde3d7111c]
[ 6] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(pssygvx_+0xcad)[0x7fbde3d6fa9d]
[ 7] ./tester[0x67f3f4]
[ 8] ./tester[0x4c3e6a]
[ 9] ./tester[0x4c4633]
[10] /lib64/libc.so.6(+0x3feb0)[0x7fbdc4171eb0]
[11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7fbdc4171f60]
[12] ./tester[0x4579c5]
*** End of error message ***


mpirun -np 4 ./gpu_bind.sh ./tester  --origin d --target d --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 --dim 100 --dim 100x50 --dim 50x100 --dim 25x50x75 --jobu n --jobvt n svd
local_rank 0, gpu_kind cuda, available_gpus 0 1 2 3 4 5 6 7, ndev 8, dev 0, visible_devices 0, rank_var OMPI_COMM_WORLD_LOCAL_RANK
local_rank 1, gpu_kind cuda, available_gpus 0 1 2 3 4 5 6 7, ndev 8, dev 1, visible_devices 1, rank_var OMPI_COMM_WORLD_LOCAL_RANK
local_rank 2, gpu_kind cuda, available_gpus 0 1 2 3 4 5 6 7, ndev 8, dev 2, visible_devices 2, rank_var OMPI_COMM_WORLD_LOCAL_RANK
local_rank 3, gpu_kind cuda, available_gpus 0 1 2 3 4 5 6 7, ndev 8, dev 3, visible_devices 3, rank_var OMPI_COMM_WORLD_LOCAL_RANK
% SLATE version 2023.08.25, id 4ee7fd2
% input: ./tester --origin d --target d --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 --dim 100 --dim 100x50 --dim 50x100 --dim 25x50x75 --jobu n --jobvt n svd
% 2023-09-28 03:01:12, 4 MPI ranks, CPU-only MPI, 8 OpenMP threads, 1 GPU devices per MPI rank
                                                                                                                                                                         
type  origin  target   A       jobu      jobvt       m       n    nb  ib    p    q  la  pt   S - Sref   Backward    U orth.    V orth.   time (s)  ref time (s)  status  
   s     dev     dev   1      novec      novec     100     100     8  32    2    2   1   4   2.00e-07         NA         NA         NA     0.0912        0.0191  pass    
   s     dev     dev   1      novec      novec     100     100     8  32    2    2   1   4   1.86e-07         NA         NA         NA     0.0685        0.0185  pass    
   s     dev     dev   1      novec      novec     100      50     8  32    2    2   1   4   1.12e-07         NA         NA         NA     0.0818       0.00997  pass    
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0xfffffffff814dc18
[ 0] /lib64/libc.so.6(+0x54df0)[0x7f0585fc9df0]
[ 1] /lib64/libc.so.6(+0x54df0)[0x7f4a71de4df0]
[ 1] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-x3joar7jwecznna2oshzb3yhf74gthie/lib/libmpi.so.40(PMPI_Comm_size+0x37)[0x7f058658a407]
[ 2] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_blacs_openmpi_lp64.so.2(MKLMPI_Comm_size+0x2a)[0x7f05a5ab751a]
[ 3] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-x3joar7jwecznna2oshzb3yhf74gthie/lib/libmpi.so.40(MPI_Bcast+0x55)[0x7f4a7239ebe5]
[ 2] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_blacs_openmpi_lp64.so.2(MKLMPI_Bcast+0xdd)[0x7f4a918d1dcd]
[ 3] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(PB_CpgemmMPI+0x15c)[0x7f05a6020b5c]
[ 4] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(PB_CpgemmMPI+0xcef)[0x7f4a91e3c6ef]
[ 4] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(psgemm_+0xdc0)[0x7f05a6075810]
[ 5] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(psgemm_+0xdc0)[0x7f4a91e90810]
[ 5] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(psgebrd_+0x863)[0x7f05a5b48613]
[ 6] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(psgebrd_+0x863)[0x7f4a91963613]
[ 6] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(psgesvd_+0x596)[0x7f05a5b571c6]
[ 7] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(psgesvd_+0x596)[0x7f4a919721c6]
[ 7] ./tester[0x79376a]
[ 8] ./tester[0x4c3e6a]
[ 9] ./tester[0x79376a]
[ 8] ./tester[0x4c4633]
[10] ./tester[0x4c3e6a]
[ 9] ./tester[0x4c4633]
[10] /lib64/libc.so.6(+0x3feb0)[0x7f0585fb4eb0]
[11] /lib64/libc.so.6(+0x3feb0)[0x7f4a71dcfeb0]
[11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f0585fb4f60]
[12] ./tester[0x4579c5]
*** End of error message ***


mpirun -np 4 ./gpu_bind.sh ./tester  --origin d --target d --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 --dim 100 --dim 100x50 --dim 50x100 --dim 25x50x75 --jobu v --jobvt v svd
local_rank 0, gpu_kind cuda, available_gpus 0 1 2 3 4 5 6 7, ndev 8, dev 0, visible_devices 0, rank_var OMPI_COMM_WORLD_LOCAL_RANK
local_rank 1, gpu_kind cuda, available_gpus 0 1 2 3 4 5 6 7, ndev 8, dev 1, visible_devices 1, rank_var OMPI_COMM_WORLD_LOCAL_RANK
local_rank 2, gpu_kind cuda, available_gpus 0 1 2 3 4 5 6 7, ndev 8, dev 2, visible_devices 2, rank_var OMPI_COMM_WORLD_LOCAL_RANK
local_rank 3, gpu_kind cuda, available_gpus 0 1 2 3 4 5 6 7, ndev 8, dev 3, visible_devices 3, rank_var OMPI_COMM_WORLD_LOCAL_RANK
% SLATE version 2023.08.25, id 4ee7fd2
% input: ./tester --origin d --target d --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 --dim 100 --dim 100x50 --dim 50x100 --dim 25x50x75 --jobu v --jobvt v svd
% 2023-09-28 03:01:15, 4 MPI ranks, CPU-only MPI, 8 OpenMP threads, 1 GPU devices per MPI rank
                                                                                                                                                                         
type  origin  target   A       jobu      jobvt       m       n    nb  ib    p    q  la  pt   S - Sref   Backward    U orth.    V orth.   time (s)  ref time (s)  status  
   s     dev     dev   1        vec        vec     100     100     8  32    2    2   1   4   3.82e-07   1.46e-08   1.36e-07   1.44e-07      0.211        0.0191  pass    
   s     dev     dev   1        vec        vec     100     100     8  32    2    2   1   4   4.17e-07   1.10e-08   1.38e-07   1.46e-07      0.187        0.0186  pass    
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0xffffffffb876d258
[ 0] /lib64/libc.so.6(+0x54df0)[0x7ff641c16df0]
[ 1] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-x3joar7jwecznna2oshzb3yhf74gthie/lib/libmpi.so.40(PMPI_Comm_size+0x37)[0x7ff6421d7407]
[ 2] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_blacs_openmpi_lp64.so.2(MKLMPI_Comm_size+0x2a)[0x7ff66170451a]
[ 3] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(PB_CpgemmMPI+0x15c)[0x7ff661c6db5c]
[ 4] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(psgemm_+0xdc0)[0x7ff661cc2810]
[ 5] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(psgebrd_+0x863)[0x7ff661795613]
[ 6] /spack/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/intel-oneapi-mkl-2023.2.0-lfqiyi7r5utbch4skfb7a7sp4c7xuien/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2(psgesvd_+0x596)[0x7ff6617a41c6]
[ 7] ./tester[0x79376a]
[ 8] ./tester[0x4c3e6a]
[ 9] ./tester[0x4c4633]
[10] /lib64/libc.so.6(+0x3feb0)[0x7ff641c01eb0]
[11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7ff641c01f60]
[12] ./tester[0x4579c5]
*** End of error message ***

On ROCm, geqrf has one accuracy issue:

mpirun -np 4 ./gpu_bind.sh ./tester  --origin d --target d --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 --dim 100x50 --dim 50x100 geqrf
local_rank 0, gpu_kind rocm, available_gpus 0 1, ndev 2, dev 0, visible_devices 0, rank_var MPI_LOCALRANKID
local_rank 1, gpu_kind rocm, available_gpus 0 1, ndev 2, dev 1, visible_devices 1, rank_var MPI_LOCALRANKID
local_rank 2, gpu_kind rocm, available_gpus 0 1, ndev 2, dev 0, visible_devices 0, rank_var MPI_LOCALRANKID
local_rank 3, gpu_kind rocm, available_gpus 0 1, ndev 2, dev 1, visible_devices 1, rank_var MPI_LOCALRANKID
% SLATE version 2023.08.25, id 4ee7fd2
% input: ./tester --origin d --target d --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 --dim 100x50 --dim 50x100 geqrf
% 2023-09-28 03:07:40, 4 MPI ranks, CPU-only MPI, 8 OpenMP threads, 1 GPU devices per MPI rank
                                                                                                                                                      
type  origin  target  cholQR   A       m       n    nb  ib    p    q  la  pt      error   time (s)       gflop/s  ref time (s)   ref gflop/s  status  
   s     dev     dev    auto   1     100     100     8  32    2    2   1   4   5.03e-09      1.393      0.000972            NA            NA  pass    
   s     dev     dev    auto   1     100      50     8  32    2    2   1   4   4.45e-09     0.0440       0.00965            NA            NA  pass    
   s     dev     dev    auto   1      50     100     8  32    2    2   1   4   2.45e-02     0.0661       0.00649            NA            NA  FAILED  

@mgates3 mgates3 mentioned this pull request Oct 25, 2023
@mgates3 mgates3 force-pushed the ci_mpi branch 2 times, most recently from b4cc33f to 22dec52 Compare February 9, 2024 15:24
@mgates3
Copy link
Collaborator Author

mgates3 commented Feb 9, 2024

Current failures:

** Tests **
> mpirun -np 4 ./gpu_bind.sh ./tester  --origin s --target h --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 potri
terminate called after throwing an instance of 'std::out_of_range'

> mpirun -np 4 ./gpu_bind.sh ./tester  --origin s --target h --ref n --nb 8 --type s,d,c,z --lookahead 1 --dim 100 --uplo l,u --diag n,u trtri
terminate called after throwing an instance of 'std::out_of_range'

2 routines FAILED: potri, trtri


** Examples **
> mpirun -np 4 ./ex08_linear_system_indefinite s d c z
terminate called after throwing an instance of 'std::out_of_range'

> mpirun -np 4 ./ex12_generalized_hermitian_eig s d c z
terminate called after throwing an instance of 'std::out_of_range'

> mpirun -np 4 c_api/ex06_linear_system_lu s d c z
mpi_size 4, grid_p 2, grid_q 2
rank 0: test_lu_r32
rank 0: test_lu_inverse_r32

rank 0: test_lu_r64
rank 0: test_lu_inverse_r64

rank 0: test_lu_c32
rank 0: test_lu_inverse_c32

rank 0: test_lu_c64
rank 0: test_lu_inverse_c64
terminate called after throwing an instance of 'std::out_of_range'

3 routines FAILED: ./ex08_linear_system_indefinite, ./ex12_generalized_hermitian_eig, c_api/ex06_linear_system_lu

@mgates3 mgates3 merged commit b89a59d into icl-utk-edu:master Jun 27, 2024
8 checks passed
@mgates3 mgates3 mentioned this pull request Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant