-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Host and Device Memory Pool with initial CUDA support + port of existing code. #201
Conversation
2925041
to
c0bfb83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some general comments as requested
use v4l::{v4l2, Device}; | ||
|
||
// A specialized V4L stream that uses Copper Buffers for memory management. | ||
pub struct CuV4LStream { | ||
v4l_handle: Arc<Handle>, | ||
v4l_buf_type: Type, | ||
memory_pool: Rc<CuHostMemoryPool>, | ||
memory_pool: CuHostMemoryPool<Vec<u8>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should streams own pools or borrow them from an application? Maybe creating a singular big pool and splitting it out per device would be helpful. Granted, that's what the OS is doing when you ask to create a pool in the first place but if we have all these sizes done at compile time, it's a nice way to measure memory footprint
pool: Weak<CuHostMemoryPool>, | ||
/// A shareable handle to an Array coming from a pool (either host or device). | ||
#[derive(Clone, Debug)] | ||
pub struct CuHandle<T: ArrayLike>(Arc<Mutex<CuHandleInner<T>>>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine that handles are doing to have low contention, right?
They're shared just between the pool and the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Maybe the exception is if you have some driver trying to do some async operation on it.
so my jam here is: if you have a 4MB matrix or image, a lock of few microsec for a multi ms action on it will be negligible. If you need totally lockfree action then you use the copperlist (but then you might need to copy). There is no free lunch :P
} | ||
|
||
fn copy_from<O: ArrayLike<Element = T::Element>>(&self, from: &mut CuHandle<O>) -> CuHandle<T> { | ||
let to_handle = self.acquire().expect("No available buffers in the pool"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy_from should probably return a result, i imagine that running out of buffers isn't hard to do. A try and non-try function would make it accessible. The non-try could be implemented as a default on top of the try since it'd just be doing an expect
on the error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, I need to do a pass on the error handling
{ | ||
type Target = [E]; | ||
|
||
fn deref(&self) -> &Self::Target { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't really thought this one through, but: It'd be nice to have some way to avoid implementing deref for this. Maybe we can have some other operation that supplies a host memory pool as an arg so it could be implemented for the cuda wrapper. For a host pool, it'd just refer to itself.
CuHandleInner::Pooled(ref mut destination) => { | ||
self.device | ||
.dtoh_sync_copy_into(source.as_cuda_slice(), destination) | ||
.expect("Failed to copy data to device"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're copying to the host in this function
|
||
/// Takes a handle to a device buffer and copies it into a host buffer pool. | ||
/// It returns a new handle from the host pool with the data from the device handle given. | ||
fn copy_into( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this was named copy_to_host
it'd avoid the need to write /// Copy from device to host
for the implementations on it. Ditto for the opposite route. I'd read "copy_into" as copying into the device pool as opposed to copying into a host pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha yes, initially I put it in the generic pool but it might move to a "DevicePool" at some point, renaming that for cuda is good enough for now.
.into_boxed_slice(); | ||
#[derive(Debug)] | ||
/// A buffer that is aligned to a specific size with the Element of type E. | ||
pub struct AlignedBuffer<E: ElementType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan on aligning to sizes other than the size of E? In https://users.rust-lang.org/t/how-can-i-allocate-aligned-memory-in-rust/33293/2, it's claimed that Vec already aligns to the size of the instantiated type. The vec can be realigned with align_to
https://doc.rust-lang.org/nightly/std/vec/struct.Vec.html#method.align_to if the type size isn't known till later
I'm wondering if we can get rid of AlignedBuffer and just use Vec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, for example if we start to have shared memory buffers, GPU alignments are big.
Also if we want to memory map them, it has to be 16K on ARM with large pages etc...
let mut handle = CuHostMemoryPool::allocate(&pool).unwrap(); | ||
handle.as_slice_mut()[0] = 10 - i; | ||
handles.push(handle); | ||
#[ignore] // Can only be executed if a real CUDA device is present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the ignore
needed for these tests? I imagine that the cfg
macro is already limiting it to cuda-only builds
Also, we could just only configure this for cuda-enabled builds and only use cuda on linux. It's certainly possible to use it on windows if anyone feels like fleshing that out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because we want to test if the CUDA features at least build on CI but there is no guarantees that there is an actual Nvidia GPU on those machines.
@@ -67,7 +66,7 @@ mod linux_impl { | |||
} | |||
|
|||
impl<'cl> CuSrcTask<'cl> for V4l { | |||
type Output = output_msg!('cl, CuImage); | |||
type Output = output_msg!('cl, CuImage<Vec<u8>>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be interesting to have a memory pool as a task itself where requests for memory come in and handles come out. The lack of synchronicity would be a bit awkward but the messaging system would handle concurrent buffer requests coming in at the same time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, it would be very awkward, but I agree with a kind of "factory" that we need also for monitoring purposes. I am implementing that now.
Thanks!
This is a new API to enable heterogeneous memory pool support.