Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host and Device Memory Pool with initial CUDA support + port of existing code. #201

Merged
merged 29 commits into from
Jan 17, 2025

Conversation

gbin
Copy link
Collaborator

@gbin gbin commented Jan 5, 2025

This is a new API to enable heterogeneous memory pool support.

@gbin gbin force-pushed the gbin/pool_abstraction branch from 2925041 to c0bfb83 Compare January 13, 2025 14:12
@gbin gbin changed the title WIP Pool abstraction Host and Device Memory Pool with CUDA support + port of existing code. Jan 16, 2025
@gbin gbin changed the title Host and Device Memory Pool with CUDA support + port of existing code. Host and Device Memory Pool with initial CUDA support + port of existing code. Jan 16, 2025
@gbin gbin merged commit 0c34cce into master Jan 17, 2025
7 checks passed
@gbin gbin deleted the gbin/pool_abstraction branch January 17, 2025 23:07
Copy link
Collaborator

@mik90 mik90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some general comments as requested

use v4l::{v4l2, Device};

// A specialized V4L stream that uses Copper Buffers for memory management.
pub struct CuV4LStream {
v4l_handle: Arc<Handle>,
v4l_buf_type: Type,
memory_pool: Rc<CuHostMemoryPool>,
memory_pool: CuHostMemoryPool<Vec<u8>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should streams own pools or borrow them from an application? Maybe creating a singular big pool and splitting it out per device would be helpful. Granted, that's what the OS is doing when you ask to create a pool in the first place but if we have all these sizes done at compile time, it's a nice way to measure memory footprint

pool: Weak<CuHostMemoryPool>,
/// A shareable handle to an Array coming from a pool (either host or device).
#[derive(Clone, Debug)]
pub struct CuHandle<T: ArrayLike>(Arc<Mutex<CuHandleInner<T>>>);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine that handles are doing to have low contention, right?

They're shared just between the pool and the user?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Maybe the exception is if you have some driver trying to do some async operation on it.
so my jam here is: if you have a 4MB matrix or image, a lock of few microsec for a multi ms action on it will be negligible. If you need totally lockfree action then you use the copperlist (but then you might need to copy). There is no free lunch :P

}

fn copy_from<O: ArrayLike<Element = T::Element>>(&self, from: &mut CuHandle<O>) -> CuHandle<T> {
let to_handle = self.acquire().expect("No available buffers in the pool");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy_from should probably return a result, i imagine that running out of buffers isn't hard to do. A try and non-try function would make it accessible. The non-try could be implemented as a default on top of the try since it'd just be doing an expect on the error

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, I need to do a pass on the error handling

{
type Target = [E];

fn deref(&self) -> &Self::Target {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't really thought this one through, but: It'd be nice to have some way to avoid implementing deref for this. Maybe we can have some other operation that supplies a host memory pool as an arg so it could be implemented for the cuda wrapper. For a host pool, it'd just refer to itself.

core/cu29_runtime/src/pool.rs Show resolved Hide resolved
CuHandleInner::Pooled(ref mut destination) => {
self.device
.dtoh_sync_copy_into(source.as_cuda_slice(), destination)
.expect("Failed to copy data to device");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're copying to the host in this function


/// Takes a handle to a device buffer and copies it into a host buffer pool.
/// It returns a new handle from the host pool with the data from the device handle given.
fn copy_into(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this was named copy_to_host it'd avoid the need to write /// Copy from device to host for the implementations on it. Ditto for the opposite route. I'd read "copy_into" as copying into the device pool as opposed to copying into a host pool

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha yes, initially I put it in the generic pool but it might move to a "DevicePool" at some point, renaming that for cuda is good enough for now.

.into_boxed_slice();
#[derive(Debug)]
/// A buffer that is aligned to a specific size with the Element of type E.
pub struct AlignedBuffer<E: ElementType> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan on aligning to sizes other than the size of E? In https://users.rust-lang.org/t/how-can-i-allocate-aligned-memory-in-rust/33293/2, it's claimed that Vec already aligns to the size of the instantiated type. The vec can be realigned with align_to https://doc.rust-lang.org/nightly/std/vec/struct.Vec.html#method.align_to if the type size isn't known till later

I'm wondering if we can get rid of AlignedBuffer and just use Vec

Copy link
Collaborator Author

@gbin gbin Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, for example if we start to have shared memory buffers, GPU alignments are big.
Also if we want to memory map them, it has to be 16K on ARM with large pages etc...

let mut handle = CuHostMemoryPool::allocate(&pool).unwrap();
handle.as_slice_mut()[0] = 10 - i;
handles.push(handle);
#[ignore] // Can only be executed if a real CUDA device is present
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the ignore needed for these tests? I imagine that the cfg macro is already limiting it to cuda-only builds

Also, we could just only configure this for cuda-enabled builds and only use cuda on linux. It's certainly possible to use it on windows if anyone feels like fleshing that out

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we want to test if the CUDA features at least build on CI but there is no guarantees that there is an actual Nvidia GPU on those machines.

@@ -67,7 +66,7 @@ mod linux_impl {
}

impl<'cl> CuSrcTask<'cl> for V4l {
type Output = output_msg!('cl, CuImage);
type Output = output_msg!('cl, CuImage<Vec<u8>>);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be interesting to have a memory pool as a task itself where requests for memory come in and handles come out. The lack of synchronicity would be a bit awkward but the messaging system would handle concurrent buffer requests coming in at the same time

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, it would be very awkward, but I agree with a kind of "factory" that we need also for monitoring purposes. I am implementing that now.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants