[hip] Added hip_device_group_device to the runtime. #18790

AWoloszyn · 2024-10-16T14:36:31Z

This gives us an interface for creating a logical device from a set of physical hip devices. In a future PR I plan on removing the normal hip_device ut for now, until the device_group_device is completed and hardened, I am keeping the original around. There are also some optimizations to do for when we have a single device in our device group.

This implementation currently passes CTS (as well as the new CTS tests added for device groups), but there is some work to complete.

Fix memory pooling (Will be a follow-up PR)
Make sure that collectives work as expected. (Follow-up PR)
Optimize our synchronization.
- Currently synchronization across physical GPUs goes through the host, we should be able to avoid that, but it will take some additional work.
Rework the CTS tests a bit so that they are just normal CTS tests that get ignored if needed.
Move any cuda-specific bits out back into cuda.
fix iree_hal_hip_device_queue_flush which should no longer try and use the work queue.
audit all new functions and make sure static is used where necessary.

benvanik

A doozy! It'll take me a bit to go through all of this but I've sprinkled a few comments in to start.

How much of this code would change if you weren't trying to keep the old hip_device around? Are there any simplifications you could do? If so, do you think you'll remember them or should you tag them with TODO(#XXX) and track that in an issue? Given the complexity here I want to ensure we don't end up with both copies or shadows of the old copy living forever. It doesn't feel like much here would or should change if you have 1 device or N devices and if you're just not sure about your code yet it's ok to let this sit in a branch for a bit while you get it to a point of stability. It's better that then it getting context switched out of your head after it lands and then we end up with lingering design decisions that were made for short term staging.

The major thing I'm concerned about is the several extra vtables as they imply a level of decoupling that brings about a lot of complexity in the code. Updating any signature for any call now requires traversing several layers of indirection in several files (including shared utils and given that it's in utils/ across other backends) and reading the code becomes more difficult. Given that I'm hell-bent on deleting HIP it feels like additional baggage for something that is unlikely to be reused. I know there's a hope of sharing this with CUDA but since that's not currently in the plans and CUDA would be the only mid-term/long-term user of it (maybe) the added cost feels hard to swallow for the project as a whole. Avoiding the vtables and keeping things simple is going to add the least overhead to the project followed second by moving this out of utils/ and keeping it local to the hip target would be best. Shared utils dirs should be for durable things that we want to ossify and be heavily reused both in-tree and out-of-tree - we may need this now but we don't want this forever :)

A good way to reason about HAL code is that is should be optimized for deletion/rewrites/refactorings: what we have will be deleted and rewritten several more times, the API will change as new devices/device types/features are introduced, and it's almost always better to have some duplication than it is to have things tightly coupled across the deletion/rewrite boundaries. A bulk of what's happening here in particular is plumbing, and plumbing pays the highest cost of spaghettification and the lowest cost of duplication (as find/replace can solve the duplication but can't solve the spaghetti).

Happy to chat more about this - I think we can simplify things and keep the scope small to unblock the work requiring this without adding too much extra complexity to the rest of the system. Anything that adds complexity just to HIP is fine and it's just the stuff that bleeds out of hip/ that is my concern.

runtime/src/iree/hal/utils/deferred_command_buffer.c

runtime/src/iree/hal/utils/deferred_work_queue.c

benvanik · 2024-10-16T14:44:25Z

runtime/src/iree/hal/drivers/hip/hip_device_group_device.c

+// iree_hal_hip_device_group_device_t
+//===----------------------------------------------------------------------===//
+
+typedef enum iree_hip_device_group_device_commandbuffer_type_e {


command_buffer

runtime/src/iree/hal/drivers/hip/native_executable.c

runtime/src/iree/hal/drivers/hip/per_device_information.h

runtime/src/iree/hal/drivers/hip/event_semaphore.h

runtime/src/iree/hal/drivers/local_task/cts/CMakeLists.txt

runtime/src/iree/hal/cts/device_group_copy_test.h

runtime/src/iree/hal/cts/CMakeLists.txt

runtime/src/iree/hal/queue.h

runtime/src/iree/hal/utils/stream_tracing.c

runtime/src/iree/hal/drivers/hip/hip_driver.c

runtime/src/iree/hal/drivers/hip/hip_device.h

runtime/src/iree/base/tree.h

runtime/src/iree/hal/cts/multi_queue_dispatch_test.h

runtime/src/iree/base/tree.h

runtime/src/iree/base/queue.c

runtime/src/iree/base/queue_test.cc

benvanik · 2024-11-18T23:00:49Z

runtime/src/iree/hal/drivers/hip/hip_driver.c

+  IREE_ASSERT_ARGUMENT(base_driver);
+  IREE_ASSERT_ARGUMENT(out_device);
+
+  uint64_t multi_count = 0;


still not iree_host_size_t

benvanik · 2024-11-18T23:02:28Z

runtime/src/iree/hal/drivers/hip/util/queue.c

+
+#include "iree/hal/drivers/hip/util/queue.h"
+
+#include "iree/base/api.h"


no need to include something in a .c already included in the header

Suggested change

#include "iree/base/api.h"

I know we are (at least loosely) following the google style guide, but just want to make sure we are intentionally ignoring it here.

Anywhere else we are ignoring it that I should know about?

https://google.github.io/styleguide/cppguide.html#Include_What_You_Use

runtime/src/iree/hal/drivers/hip/util/queue.c

runtime/src/iree/hal/drivers/hip/util/queue.h

runtime/src/iree/hal/drivers/hip/hip_device.c

runtime/src/iree/hal/drivers/hip/stream_command_buffer.c

runtime/src/iree/hal/drivers/hip/hip_allocator.c

runtime/src/iree/hal/drivers/hip/event_pool.c

runtime/src/iree/hal/drivers/hip/native_executable.c

runtime/src/iree/hal/drivers/hip/memory_pools.c

runtime/src/iree/hal/drivers/hip/hip_device.c

runtime/src/iree/hal/drivers/hip/hip_allocator.c

runtime/src/iree/hal/drivers/hip/event_semaphore.h

runtime/src/iree/hal/drivers/hip/event_semaphore.c

runtime/src/iree/hal/drivers/hip/event_pool.c

runtime/src/iree/hal/drivers/hip/dispatch_thread.h

runtime/src/iree/hal/drivers/hip/dispatch_thread.c

runtime/src/iree/hal/drivers/hip/cleanup_thread.c

benvanik

two minor fixes and then lgtm if you're happy with it and have confirmed the key use cases work - I'd maybe try shopping around to the folks working on sharded things (e.g. @aviator19941 as seen in #19428) and letting them sanity check the branch once you rebase (I don't think any of those things are covered in a CI today and if things don't work for it there may be a risk of rollback given the size of this PR - best to get ahead of it :)

benvanik · 2024-12-10T17:13:57Z

runtime/src/iree/hal/drivers/hip/per_device_information.h

+
+  iree_hal_hip_memory_pools_t memory_pools;
+
+  // Used in any place we need an event that is already signaled.


what is this comment for?

Oh hah, removed the event that I had there, but forgot the comment that goes with it.

benvanik · 2024-12-10T17:16:16Z

runtime/src/iree/hal/drivers/hip/hip_multi_queue_command_buffer.h

+// iree_hal_command_buffer_t deferred record/replay wrapper
+//===----------------------------------------------------------------------===//
+
+// Records a command buffer that records into multiple command buffers


Suggested change

// Records a command buffer that records into multiple command buffers

// Creates a command buffer that records into multiple command buffers

Instead of rebasing each of the individual 30+ changes, rebase the entire thing, because there were a number of conflicts against main. Signed-off-by: Andrew Woloszyn <[email protected]>

It was submitting command buffers with an empty affinity. Changed to IREE_HAL_QUEUE_AFFINITY_ANY instead. Signed-off-by: Andrew Woloszyn <[email protected]>

This allows us to allocate/deallocate async so long as we are using the default hip allocator. Based on iree-org#19074 --------- Signed-off-by: Andrew Woloszyn <[email protected]>

Signed-off-by: Andrew Woloszyn <[email protected]>

This prevents the long calls to iree_hal_deferred_command_buffer_apply and hipGraphLaunch from blocking the main thread and makes using using muliple devices/streams from a single thread more reasonable. Signed-off-by: Andrew Woloszyn <[email protected]>

Signed-off-by: Andrew Woloszyn <[email protected]>

They don't actually work very well as the underlying allocator has an unbounded amount of slack in assigning allocations, so depending on the allocation patterns you can end up using significantly more memory than necessary. Signed-off-by: Andrew Woloszyn <[email protected]>

Signed-off-by: Andrew Woloszyn <[email protected]>

We returned the event after we added it to the semaphore, but if another thread ended up waiting on the event before we recorded we would incorrectly read "hipSuccess" thinking the event was complete, and advancing the semaphore prematurely. Signed-off-by: Andrew Woloszyn <[email protected]>

Signed-off-by: Andrew Woloszyn <[email protected]>

These are disabled before this change, continue to leave them disabled. Signed-off-by: Andrew Woloszyn <[email protected]>

Signed-off-by: Andrew Woloszyn <[email protected]>

chrsmcgrr · 2024-12-12T16:24:19Z

@AWoloszyn @benvanik @ScottTodd

hey we noticed this PR carried a llvm-project submodule change. Was this intended? It changed quite a lot of the history.

AWoloszyn · 2024-12-12T16:27:15Z

That was not intended thanks for bringing that up! I will revert the submodule change.

…18790 Signed-off-by: Andrew Woloszyn <[email protected]>

AWoloszyn · 2024-12-12T16:36:44Z

#19476 There is the revert

…19476) Signed-off-by: Andrew Woloszyn <[email protected]>

AWoloszyn requested review from antiagainst, ScottTodd, nithinsubbiah and benvanik as code owners October 16, 2024 14:36

benvanik requested changes Oct 16, 2024

View reviewed changes

ScottTodd added the hal/hip Runtime HIP HAL backend label Oct 17, 2024

AWoloszyn force-pushed the multidevice branch from cd7443a to aaa52bc Compare October 19, 2024 00:39

AWoloszyn force-pushed the multidevice branch 3 times, most recently from 3fe45ae to f3019a6 Compare November 4, 2024 20:01

benvanik requested changes Nov 5, 2024

View reviewed changes

benvanik requested changes Nov 6, 2024

View reviewed changes

AWoloszyn force-pushed the multidevice branch from 3675196 to e1d13e4 Compare November 18, 2024 14:14

benvanik self-requested a review November 18, 2024 22:58

benvanik requested changes Nov 19, 2024

View reviewed changes

AWoloszyn force-pushed the multidevice branch from d974379 to c0494aa Compare November 19, 2024 21:45

benvanik requested changes Dec 2, 2024

View reviewed changes

AWoloszyn force-pushed the multidevice branch from 2b0b0da to 496d56a Compare December 6, 2024 14:29

AWoloszyn mentioned this pull request Dec 7, 2024

Enable rocm and vulkan build in CI workflow for PJRT plugin #19279

Draft

benvanik reviewed Dec 9, 2024

View reviewed changes

runtime/src/iree/hal/drivers/hip/native_executable.c Show resolved Hide resolved

benvanik requested changes Dec 9, 2024

View reviewed changes

aviator19941 mentioned this pull request Dec 9, 2024

Llama 3.1 8b f16 sharded TP8 compiles but fails to run #19428

Closed

AWoloszyn force-pushed the multidevice branch from fd21354 to a043143 Compare December 10, 2024 17:05

benvanik self-requested a review December 10, 2024 17:19

benvanik approved these changes Dec 10, 2024

View reviewed changes

AWoloszyn force-pushed the multidevice branch from 751ff0c to 7d69b5b Compare December 10, 2024 19:23

AWoloszyn added 3 commits December 11, 2024 10:05

[hip] Multidevice rebase ontop of main.

3be9535

Instead of rebasing each of the individual 30+ changes, rebase the entire thing, because there were a number of conflicts against main. Signed-off-by: Andrew Woloszyn <[email protected]>

Fixed bug in SemaphoreSubmissionTest.

b0b753f

It was submitting command buffers with an empty affinity. Changed to IREE_HAL_QUEUE_AFFINITY_ANY instead. Signed-off-by: Andrew Woloszyn <[email protected]>

[hip] Reimplement async allocation/deallocation.

440ee21

This allows us to allocate/deallocate async so long as we are using the default hip allocator. Based on iree-org#19074 --------- Signed-off-by: Andrew Woloszyn <[email protected]>

AWoloszyn added 23 commits December 11, 2024 10:07

Updated the ordering in some files.

2ca1f3e

Signed-off-by: Andrew Woloszyn <[email protected]>

[hip] Updated the control flow in event_semhpore.

356e988

Signed-off-by: Andrew Woloszyn <[email protected]>

Removed IREE_UNLIKELY where not needed.

f9817f5

Signed-off-by: Andrew Woloszyn <[email protected]>

Some minor naming and whitespace fixes.

5bb2519

Signed-off-by: Andrew Woloszyn <[email protected]>

More minor fixups.

2d2b521

Signed-off-by: Andrew Woloszyn <[email protected]>

Rework the control flow in hip_device.c

f5c0a99

Signed-off-by: Andrew Woloszyn <[email protected]>

Fix build with tracing on.

1f1c7c3

Signed-off-by: Andrew Woloszyn <[email protected]>

Fix bad memcpy in queue.c

59a3cf5

Signed-off-by: Andrew Woloszyn <[email protected]>

Retain the semaphores while they are waiting for completion.

5e53fbc

Signed-off-by: Andrew Woloszyn <[email protected]>

Fix a missing condition that snuck in during a refactor.

15e1809

Signed-off-by: Andrew Woloszyn <[email protected]>

Some fixes to error propagation.

105a6dc

Signed-off-by: Andrew Woloszyn <[email protected]>

re-added memory pools

e625ca2

Signed-off-by: Andrew Woloszyn <[email protected]>

Responded to some PR comments

d15f23d

Signed-off-by: Andrew Woloszyn <[email protected]>

Update for hip_memory_pools_merge from upstream

79a1f78

Signed-off-by: Andrew Woloszyn <[email protected]>

Address some comments:

4c9dc06

Signed-off-by: Andrew Woloszyn <[email protected]>

Updated based on PR comments.

ed0aacf

Signed-off-by: Andrew Woloszyn <[email protected]>

Fixed some outdated comments.

1d4fdc5

Signed-off-by: Andrew Woloszyn <[email protected]>

Small fix after rebase.

0bc922e

Signed-off-by: Andrew Woloszyn <[email protected]>

Re-disable some of the hip tests.

89a9358

These are disabled before this change, continue to leave them disabled. Signed-off-by: Andrew Woloszyn <[email protected]>

Fixed clang-format that only showed up on the bots.

b6fcb58

Signed-off-by: Andrew Woloszyn <[email protected]>

AWoloszyn force-pushed the multidevice branch from 43cf8ac to b6fcb58 Compare December 11, 2024 15:14

AWoloszyn merged commit 0e71e72 into iree-org:main Dec 11, 2024
37 of 39 checks passed

AWoloszyn added a commit to AWoloszyn/iree that referenced this pull request Dec 12, 2024

Revert llvm submodule change that was accidentally added in iree-org#…

ea0eab3

…18790 Signed-off-by: Andrew Woloszyn <[email protected]>

Groverkss pushed a commit that referenced this pull request Dec 12, 2024

Revert llvm submodule change that was accidentally added in #18790 (#…

9b8595d

…19476) Signed-off-by: Andrew Woloszyn <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hip] Added hip_device_group_device to the runtime. #18790

[hip] Added hip_device_group_device to the runtime. #18790

AWoloszyn commented Oct 16, 2024 •

edited

Loading

benvanik left a comment

benvanik Oct 16, 2024

benvanik Nov 18, 2024

benvanik Nov 18, 2024

AWoloszyn Nov 19, 2024

benvanik left a comment

benvanik Dec 10, 2024

AWoloszyn Dec 10, 2024

benvanik Dec 10, 2024

chrsmcgrr commented Dec 12, 2024

AWoloszyn commented Dec 12, 2024

AWoloszyn commented Dec 12, 2024


		#include "iree/hal/drivers/hip/util/queue.h"

		#include "iree/base/api.h"


		iree_hal_hip_memory_pools_t memory_pools;

		// Used in any place we need an event that is already signaled.

	// Records a command buffer that records into multiple command buffers
	// Creates a command buffer that records into multiple command buffers

[hip] Added hip_device_group_device to the runtime. #18790

[hip] Added hip_device_group_device to the runtime. #18790

Conversation

AWoloszyn commented Oct 16, 2024 • edited Loading

benvanik left a comment

Choose a reason for hiding this comment

benvanik Oct 16, 2024

Choose a reason for hiding this comment

benvanik Nov 18, 2024

Choose a reason for hiding this comment

benvanik Nov 18, 2024

Choose a reason for hiding this comment

AWoloszyn Nov 19, 2024

Choose a reason for hiding this comment

benvanik left a comment

Choose a reason for hiding this comment

benvanik Dec 10, 2024

Choose a reason for hiding this comment

AWoloszyn Dec 10, 2024

Choose a reason for hiding this comment

benvanik Dec 10, 2024

Choose a reason for hiding this comment

chrsmcgrr commented Dec 12, 2024

AWoloszyn commented Dec 12, 2024

AWoloszyn commented Dec 12, 2024

AWoloszyn commented Oct 16, 2024 •

edited

Loading