Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When working #157, I ran into a few errors when using a column-major device distribution with the LU routines.
The first issue is when using GPU->GPU copies, we weren't ensuring the correct synchronization. Specifically, we were just assumed that any previous copies into
src_device
were complete. We had ensured this when the previous copy was GPU->GPU, but not if the previous copy was Host->GPU.fc0a5c
fixes this by added some extra queue synchronizations. On Saturn's 16 gtx1060 GPUs w/ single precision, this caused about a 5% performance hit, but is needed for correctness. (The issue and fix only affects multiple GPUs/proc, so it doesn't affect Frontier.) Using CUDA streams et al., we might be able to improve that by preventing the host synchronizations. But, I didn't think I'd have enough time to do that this week with the other things I need to finish up.The second issue is that
tileLayoutConvert(set<ij_tuple> tiles, int device, ...)
uses different locking fromtileGet
, so doing two calls totileGet(set<ij_tuple> tiles, int device, ...)
simultaneously can result in the layout of a tile instance being converted while another tile is trying to copy it.26ef41
fixes this by replacing the batch transpose with individual, per tile transposes withintileCopyDataLayout
. On Frontier, this doesn't seem to affect the performance ofgemm
orgesv
. (Interestingly, it seems to give a few percent speedup on Saturn's 16 gtx1060 GPUs w/ single precision.)