-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hip] Added hip_device_group_device to the runtime. #18790
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A doozy! It'll take me a bit to go through all of this but I've sprinkled a few comments in to start.
How much of this code would change if you weren't trying to keep the old hip_device around? Are there any simplifications you could do? If so, do you think you'll remember them or should you tag them with TODO(#XXX)
and track that in an issue? Given the complexity here I want to ensure we don't end up with both copies or shadows of the old copy living forever. It doesn't feel like much here would or should change if you have 1 device or N devices and if you're just not sure about your code yet it's ok to let this sit in a branch for a bit while you get it to a point of stability. It's better that then it getting context switched out of your head after it lands and then we end up with lingering design decisions that were made for short term staging.
The major thing I'm concerned about is the several extra vtables as they imply a level of decoupling that brings about a lot of complexity in the code. Updating any signature for any call now requires traversing several layers of indirection in several files (including shared utils and given that it's in utils/ across other backends) and reading the code becomes more difficult. Given that I'm hell-bent on deleting HIP it feels like additional baggage for something that is unlikely to be reused. I know there's a hope of sharing this with CUDA but since that's not currently in the plans and CUDA would be the only mid-term/long-term user of it (maybe) the added cost feels hard to swallow for the project as a whole. Avoiding the vtables and keeping things simple is going to add the least overhead to the project followed second by moving this out of utils/ and keeping it local to the hip target would be best. Shared utils dirs should be for durable things that we want to ossify and be heavily reused both in-tree and out-of-tree - we may need this now but we don't want this forever :)
A good way to reason about HAL code is that is should be optimized for deletion/rewrites/refactorings: what we have will be deleted and rewritten several more times, the API will change as new devices/device types/features are introduced, and it's almost always better to have some duplication than it is to have things tightly coupled across the deletion/rewrite boundaries. A bulk of what's happening here in particular is plumbing, and plumbing pays the highest cost of spaghettification and the lowest cost of duplication (as find/replace can solve the duplication but can't solve the spaghetti).
Happy to chat more about this - I think we can simplify things and keep the scope small to unblock the work requiring this without adding too much extra complexity to the rest of the system. Anything that adds complexity just to HIP is fine and it's just the stuff that bleeds out of hip/ that is my concern.
// iree_hal_hip_device_group_device_t | ||
//===----------------------------------------------------------------------===// | ||
|
||
typedef enum iree_hip_device_group_device_commandbuffer_type_e { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
command_buffer
cd7443a
to
aaa52bc
Compare
3fe45ae
to
f3019a6
Compare
3675196
to
e1d13e4
Compare
IREE_ASSERT_ARGUMENT(base_driver); | ||
IREE_ASSERT_ARGUMENT(out_device); | ||
|
||
uint64_t multi_count = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not iree_host_size_t
|
||
#include "iree/hal/drivers/hip/util/queue.h" | ||
|
||
#include "iree/base/api.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to include something in a .c already included in the header
#include "iree/base/api.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we are (at least loosely) following the google style guide, but just want to make sure we are intentionally ignoring it here.
Anywhere else we are ignoring it that I should know about?
https://google.github.io/styleguide/cppguide.html#Include_What_You_Use
d974379
to
c0494aa
Compare
2b0b0da
to
496d56a
Compare
fd21354
to
a043143
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two minor fixes and then lgtm if you're happy with it and have confirmed the key use cases work - I'd maybe try shopping around to the folks working on sharded things (e.g. @aviator19941 as seen in #19428) and letting them sanity check the branch once you rebase (I don't think any of those things are covered in a CI today and if things don't work for it there may be a risk of rollback given the size of this PR - best to get ahead of it :)
|
||
iree_hal_hip_memory_pools_t memory_pools; | ||
|
||
// Used in any place we need an event that is already signaled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this comment for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh hah, removed the event that I had there, but forgot the comment that goes with it.
// iree_hal_command_buffer_t deferred record/replay wrapper | ||
//===----------------------------------------------------------------------===// | ||
|
||
// Records a command buffer that records into multiple command buffers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Records a command buffer that records into multiple command buffers | |
// Creates a command buffer that records into multiple command buffers |
751ff0c
to
7d69b5b
Compare
Instead of rebasing each of the individual 30+ changes, rebase the entire thing, because there were a number of conflicts against main. Signed-off-by: Andrew Woloszyn <[email protected]>
It was submitting command buffers with an empty affinity. Changed to IREE_HAL_QUEUE_AFFINITY_ANY instead. Signed-off-by: Andrew Woloszyn <[email protected]>
This allows us to allocate/deallocate async so long as we are using the default hip allocator. Based on iree-org#19074 --------- Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
This prevents the long calls to iree_hal_deferred_command_buffer_apply and hipGraphLaunch from blocking the main thread and makes using using muliple devices/streams from a single thread more reasonable. Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
They don't actually work very well as the underlying allocator has an unbounded amount of slack in assigning allocations, so depending on the allocation patterns you can end up using significantly more memory than necessary. Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
We returned the event after we added it to the semaphore, but if another thread ended up waiting on the event before we recorded we would incorrectly read "hipSuccess" thinking the event was complete, and advancing the semaphore prematurely. Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
These are disabled before this change, continue to leave them disabled. Signed-off-by: Andrew Woloszyn <[email protected]>
Signed-off-by: Andrew Woloszyn <[email protected]>
43cf8ac
to
b6fcb58
Compare
@AWoloszyn @benvanik @ScottTodd hey we noticed this PR carried a llvm-project submodule change. Was this intended? It changed quite a lot of the history. |
That was not intended thanks for bringing that up! I will revert the submodule change. |
…18790 Signed-off-by: Andrew Woloszyn <[email protected]>
#19476 There is the revert |
…19476) Signed-off-by: Andrew Woloszyn <[email protected]>
This gives us an interface for creating a logical device from a set of physical hip devices. In a future PR I plan on removing the normal hip_device ut for now, until the device_group_device is completed and hardened, I am keeping the original around. There are also some optimizations to do for when we have a single device in our device group.
This implementation currently passes CTS (as well as the new CTS tests added for device groups), but there is some work to complete.
iree_hal_hip_device_queue_flush
which should no longer try and use the work queue.