Host and Device Memory Pool with initial CUDA support + port of existing code. #201

gbin · 2025-01-05T19:04:39Z

This is a new API to enable heterogeneous memory pool support.

mik90

Leaving some general comments as requested

mik90 · 2025-01-18T14:56:36Z

components/sources/cu_v4l/src/v4lstream.rs

 use v4l::{v4l2, Device};

 // A specialized V4L stream that uses Copper Buffers for memory management.
 pub struct CuV4LStream {
    v4l_handle: Arc<Handle>,
    v4l_buf_type: Type,
-    memory_pool: Rc<CuHostMemoryPool>,
+    memory_pool: CuHostMemoryPool<Vec<u8>>,


Should streams own pools or borrow them from an application? Maybe creating a singular big pool and splitting it out per device would be helpful. Granted, that's what the OS is doing when you ask to create a pool in the first place but if we have all these sizes done at compile time, it's a nice way to measure memory footprint

mik90 · 2025-01-18T18:56:29Z

core/cu29_runtime/src/pool.rs

-    pool: Weak<CuHostMemoryPool>,
+/// A shareable handle to an Array coming from a pool (either host or device).
+#[derive(Clone, Debug)]
+pub struct CuHandle<T: ArrayLike>(Arc<Mutex<CuHandleInner<T>>>);


I imagine that handles are doing to have low contention, right?

They're shared just between the pool and the user?

yes. Maybe the exception is if you have some driver trying to do some async operation on it.
so my jam here is: if you have a 4MB matrix or image, a lock of few microsec for a multi ms action on it will be negligible. If you need totally lockfree action then you use the copperlist (but then you might need to copy). There is no free lunch :P

mik90 · 2025-01-18T19:01:09Z

core/cu29_runtime/src/pool.rs

+    }
+
+    fn copy_from<O: ArrayLike<Element = T::Element>>(&self, from: &mut CuHandle<O>) -> CuHandle<T> {
+        let to_handle = self.acquire().expect("No available buffers in the pool");


copy_from should probably return a result, i imagine that running out of buffers isn't hard to do. A try and non-try function would make it accessible. The non-try could be implemented as a default on top of the try since it'd just be doing an expect on the error

good call, I need to do a pass on the error handling

mik90 · 2025-01-18T19:05:23Z

core/cu29_runtime/src/pool.rs

+    {
+        type Target = [E];
+
+        fn deref(&self) -> &Self::Target {


Haven't really thought this one through, but: It'd be nice to have some way to avoid implementing deref for this. Maybe we can have some other operation that supplies a host memory pool as an arg so it could be implemented for the cuda wrapper. For a host pool, it'd just refer to itself.

core/cu29_runtime/src/pool.rs

mik90 · 2025-01-18T19:14:44Z

core/cu29_runtime/src/pool.rs

+                        CuHandleInner::Pooled(ref mut destination) => {
+                            self.device
+                                .dtoh_sync_copy_into(source.as_cuda_slice(), destination)
+                                .expect("Failed to copy data to device");


We're copying to the host in this function

mik90 · 2025-01-18T19:17:29Z

core/cu29_runtime/src/pool.rs

+
+    /// Takes a handle to a device buffer and copies it into a host buffer pool.
+    /// It returns a new handle from the host pool with the data from the device handle given.
+    fn copy_into(


If this was named copy_to_host it'd avoid the need to write /// Copy from device to host for the implementations on it. Ditto for the opposite route. I'd read "copy_into" as copying into the device pool as opposed to copying into a host pool

ha yes, initially I put it in the generic pool but it might move to a "DevicePool" at some point, renaming that for cuda is good enough for now.

mik90 · 2025-01-18T19:22:52Z

core/cu29_runtime/src/pool.rs

-            .into_boxed_slice();
+#[derive(Debug)]
+/// A buffer that is aligned to a specific size with the Element of type E.
+pub struct AlignedBuffer<E: ElementType> {


Do we plan on aligning to sizes other than the size of E? In https://users.rust-lang.org/t/how-can-i-allocate-aligned-memory-in-rust/33293/2, it's claimed that Vec already aligns to the size of the instantiated type. The vec can be realigned with align_to https://doc.rust-lang.org/nightly/std/vec/struct.Vec.html#method.align_to if the type size isn't known till later

I'm wondering if we can get rid of AlignedBuffer and just use Vec

yes, for example if we start to have shared memory buffers, GPU alignments are big.
Also if we want to memory map them, it has to be 16K on ARM with large pages etc...

mik90 · 2025-01-18T19:25:42Z

core/cu29_runtime/src/pool.rs

-            let mut handle = CuHostMemoryPool::allocate(&pool).unwrap();
-            handle.as_slice_mut()[0] = 10 - i;
-            handles.push(handle);
+    #[ignore] // Can only be executed if a real CUDA device is present


is the ignore needed for these tests? I imagine that the cfg macro is already limiting it to cuda-only builds

Also, we could just only configure this for cuda-enabled builds and only use cuda on linux. It's certainly possible to use it on windows if anyone feels like fleshing that out

This is because we want to test if the CUDA features at least build on CI but there is no guarantees that there is an actual Nvidia GPU on those machines.

mik90 · 2025-01-18T19:45:47Z

components/sources/cu_v4l/src/lib.rs

@@ -67,7 +66,7 @@ mod linux_impl {
    }

    impl<'cl> CuSrcTask<'cl> for V4l {
-        type Output = output_msg!('cl, CuImage);
+        type Output = output_msg!('cl, CuImage<Vec<u8>>);


It'd be interesting to have a memory pool as a task itself where requests for memory come in and handles come out. The lack of synchronicity would be a bit awkward but the messaging system would handle concurrent buffer requests coming in at the same time

hmm, it would be very awkward, but I agree with a kind of "factory" that we need also for monitoring purposes. I am implementing that now.

Thanks!

gbin mentioned this pull request Jan 5, 2025

Add Memory Buffer tracking and monitoring in the runtime #190

Open

gbin added 3 commits January 13, 2025 08:12

wip

4d815d1

wip ...

9554df9

wip

c0bfb83

gbin force-pushed the gbin/pool_abstraction branch from 2925041 to c0bfb83 Compare January 13, 2025 14:12

gbin added 8 commits January 14, 2025 22:28

Host -> Cuda -> Host roundtrip works.

bd6045e

Merge branch 'master' into gbin/pool_abstraction

2885ba1

merge snafu

4b46676

Cuda / Non-Cuda issue.

4fd781e

dep fix

7ab0e4b

WIP

293b61a

Alright, this starts to look like something.

6d1ad33

using the API ...

6728066

gbin changed the title ~~WIP Pool abstraction~~ Host and Device Memory Pool with CUDA support + port of existing code. Jan 16, 2025

Merge branch 'master' into gbin/pool_abstraction

6810079

gbin changed the title ~~Host and Device Memory Pool with CUDA support + port of existing code.~~ Host and Device Memory Pool with initial CUDA support + port of existing code. Jan 16, 2025

gbin added 14 commits January 16, 2025 17:20

fix for non linux platforms

d56710a

disabling from --all-features on macos

fc0f4af

still try to exclude cudarc for good

01f76e9

makes the slices like more idiomatics

ad6fc8b

excludes also mac in the impl

7790574

Try to install cuda

f74c5b0

try to exclude darwin

722578b

macOS instead of Darwin?

8e81a51

missed a macOS spot

993d777

try to remove the artifact conflict, shooting a little in the dark

91ca118

removed unnecessary trait import

0bce6d7

continue to clip at clippy + refactor

ba67a64

put the tests where a real cuda device needs to be tehre as ignore

0e14cc0

Try freeing disk space

c0bcc7a

gbin added 3 commits January 17, 2025 14:49

false -> true

a206ed0

overzealous maybe?

1ab5201

removed also windows for cuda for now

bf912c0

gbin merged commit 0c34cce into master Jan 17, 2025
7 checks passed

gbin deleted the gbin/pool_abstraction branch January 17, 2025 23:07

mik90 reviewed Jan 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Host and Device Memory Pool with initial CUDA support + port of existing code. #201

Host and Device Memory Pool with initial CUDA support + port of existing code. #201

gbin commented Jan 5, 2025 •

edited

Loading

mik90 left a comment

mik90 Jan 18, 2025

mik90 Jan 18, 2025

gbin Jan 20, 2025

mik90 Jan 18, 2025

gbin Jan 20, 2025

mik90 Jan 18, 2025

mik90 Jan 18, 2025

mik90 Jan 18, 2025

gbin Jan 20, 2025

mik90 Jan 18, 2025

gbin Jan 20, 2025 •

edited

Loading

mik90 Jan 18, 2025

gbin Jan 20, 2025

mik90 Jan 18, 2025

gbin Jan 20, 2025

Host and Device Memory Pool with initial CUDA support + port of existing code. #201

Host and Device Memory Pool with initial CUDA support + port of existing code. #201

Conversation

gbin commented Jan 5, 2025 • edited Loading

mik90 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gbin Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gbin commented Jan 5, 2025 •

edited

Loading

gbin Jan 20, 2025 •

edited

Loading