-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding CI scripts #45
Adding CI scripts #45
Conversation
Thanks for setting this up, let me know if you run into any issues. |
One question so far: the intel-mkl package in spack has been deprecated in favor of the oneapi version (intel-oneapi-mkl). We were previously using intel-mkl for the MKL heffte backend. Is it sufficient to use just the ONEAPI backend and deprecate the MKL backend, or do they serve different purposes? |
The MKL backend is CPU only and doesn't require the Intel SYCL compiler. This is needed for many platforms regardless of the spack deprecation. However, we don't need a separate test for all possible backend combinations (simply not practical). It is Ok to just test MKL + OneAPI in spack, especially if the stand-alone MKL package has been deperecated. We can even deprecate the standalone |
Maybe the +mkl variant in the spack package should be changed to use intel-oneapi-mkl instead of intel-mkl? |
Another possibility, haven't tested it but it should work. |
I have most of the checks passing now. Do you think there is a good chance that we can get the two failing checks to pass, or would you be OK with implementing the CI the way it is now? |
Generally speaking, so long as most of the tests are working, then the CI is good. The Intel test seems a problem with GPU, it is complaining about missing fp64 capabilities. Is it detecting the correct device? The ROCm is also puzzling, it is failing to launch a basic scaling kernel, which is just that, scales a vector by a fixed number. Both of those could be some issue with the environment selecting the correct device and/or permissions. But this is just a guess, I'm not familiar with the environment. |
For the intel gpu, I think it is actually missing the fp64 capability. The GPU is an Arc770 (consumer card). We had to disable the double precision tests in slate for this. Is that a possibility here? |
Unfortunately, heFFTe doesn't have that option at the moment. I could add it in time, but it is a significant intrusion into the code. It's probably best to either disable the test for now, or don't use the GPU, just run SYCL on the CPU (fallback mode). That still tests the GPU kernels and it will work with all precision. |
Just FYI: I'm keeping this PR as a draft until we finalize the slurm queue policies at our site. The new CI system runs the jobs through our queues, and I need to make sure the CI jobs don't interfere with normal users. |
No worries, I won't merge anything until you say it's ready (at which point you can merge it yourself). |
The PR is @G-Ragghianti's work as it is connected to the machines at IC. Given the lack of post-ECP funding, I don't know what the status is here or if this is even feasible at the moment. I am doing now is I have a series of automated scripts that run docker, same as a CI would except it is not automatically triggered from github. That's always been a challenge due to security and authentication. The free github runners are a bit deficient, no GPU support of any kind and heFFTe needs at least 12 MPI ranks to have a non-symmetric data layout of boxes. I can try to setup one or two free github tests, would that be sufficient? Or do I need to find some alternatives. |
We intend to run these on our site-hosted runners. They are currently running as dedicated runners, but this PR was created specifically to use an on-demand runner system that we have yet to fully deploy. I can convert this PR to use the dedicated runners that we currently have. We intend to support this despite the end of ECP funding. |
@G-Ragghianti it would be of great help to support this, even if the tests don't span the entire range of supported GPUs and build configurations. I don't know if @ax3l has a time-frame in mind. |
I've updated the PR to use the currently available self-hosted runners that we have. There is still a small problem with FFTW module not being compatible with our module for openmpi. I will fix this, and then it can be merged. |
450fd01
to
17894e8
Compare
Yay! All green :) |
No description provided.