Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in remap #134

Open
liamkinne opened this issue Sep 16, 2024 · 3 comments
Open

Memory leak in remap #134

liamkinne opened this issue Sep 16, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@liamkinne
Copy link
Contributor

liamkinne commented Sep 16, 2024

There is some amount of memory being allocated during the remap function that isn't getting cleaned up.

This is with the latest commit on main. So I imagine it's something to do with rayon or the new tensor implementation.

This simple program will slowly grow in memory usage:

fn main() {
    let size = ImageSize {
        width: 1440,
        height: 1440,
    };

    let correction_map = correction_map(&size);

    loop {
        let input = Image::<u8, 3>::from_size_val(size, 0).unwrap();
        let mut output = Image::<f32, 3>::from_size_val(size, 0.0).unwrap();

        remap(
            &input.cast().unwrap(),
            &mut output,
            &correction_map.0,
            &correction_map.1,
            InterpolationMode::Bilinear,
        )
        .unwrap();
    }
}
@edgarriba
Copy link
Member

That’s weird, however the idea with the new function signature allowing mutable outputs is that you allocate once outside of the for loop. If you check in the benches is how I’m doing.

If you’re up to, would be great to provide a bench for remap using also [flamegraph](https://github.com/flamegraph-rs/flamegraph) or similar to understand better the memory usage pattern.

BTW, remap is not the fastest implementation for now, it needs some optimisations as eg. the interpolation function it´s computing everytime the offset indices of the tensor from where to interpolate which could be done as a pre-step by iterating row by row with step 1. A second one, would be using simd to compute the weighted interpolation.

@emilmgeorge
Copy link
Contributor

I couldn't reproduce the growing memory usage issue while running for about 5 minutes. But, when testing with valgrind, an unrelated leak is found in the correction_map I used (included below). (No confirmed leaks within the original snippet).

This is the correction_map that I used.
use kornia::image::{Image, ImageSize};
use kornia::core::Tensor2;
use kornia::imgproc::interpolation::{remap, InterpolationMode, grid::meshgrid_from_fn};

fn correction_map(size: &ImageSize) -> (Tensor2<f32>, Tensor2<f32>) {
    // Identity mapping
    let (map_x, map_y) = meshgrid_from_fn(size.width, size.height, |x, y| {
        Ok((x as f32, y as f32))
    }).unwrap();
    (map_x, map_y)
}
Valgrind log
==419002== Memcheck, a memory error detector
==419002== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==419002== Using Valgrind-3.23.0 and LibVEX; rerun with -h for copyright info
==419002== Command: target/debug/mem_test
==419002== Parent PID: 418734
==419002== 
==419002== 
==419002== HEAP SUMMARY:
==419002==     in use at exit: 16,653,640 bytes in 129 blocks
==419002==   total heap usage: 230 allocs, 101 frees, 229,673,704 bytes allocated
==419002== 
==419002== 2,432 bytes in 8 blocks are possibly lost in loss record 16 of 23
==419002==    at 0x484BC13: calloc (vg_replace_malloc.c:1675)
==419002==    by 0x4011003: calloc (rtld-malloc.h:44)
==419002==    by 0x4011003: allocate_dtv (dl-tls.c:395)
==419002==    by 0x4011AE1: _dl_allocate_tls (dl-tls.c:664)
==419002==    by 0x4A44F14: allocate_stack (allocatestack.c:429)
==419002==    by 0x4A44F14: pthread_create@@GLIBC_2.34 (pthread_create.c:655)
==419002==    by 0x18AA61: std::sys::pal::unix::thread::Thread::new (thread.rs:87)
==419002==    by 0x14F581: std::thread::Builder::spawn_unchecked_ (mod.rs:577)
==419002==    by 0x14EA5E: std::thread::Builder::spawn_unchecked (mod.rs:456)
==419002==    by 0x15013D: std::thread::Builder::spawn (mod.rs:388)
==419002==    by 0x13D92E: <rayon_core::registry::DefaultSpawn as rayon_core::registry::ThreadSpawn>::spawn (registry.rs:98)
==419002==    by 0x13EDFD: rayon_core::registry::Registry::new (registry.rs:304)
==419002==    by 0x13DD5B: rayon_core::registry::default_global_registry (registry.rs:201)
==419002==    by 0x13B9E0: core::ops::function::FnOnce::call_once (function.rs:250)
==419002== 
==419002== 8,294,400 bytes in 1 blocks are definitely lost in loss record 22 of 23
==419002==    at 0x48447A8: malloc (vg_replace_malloc.c:446)
==419002==    by 0x162139: std::sys::pal::unix::alloc::<impl core::alloc::global::GlobalAlloc for std::alloc::System>::alloc (alloc.rs:14)
==419002==    by 0x161F3E: <kornia_core::allocator::CpuAllocator as kornia_core::allocator::TensorAllocator>::alloc (allocator.rs:58)
==419002==    by 0x12FA86: kornia_core::storage::TensorStorage<T,A>::new (storage.rs:55)
==419002==    by 0x12F817: kornia_core::tensor::Tensor<T,_,A>::new_uninitialized (tensor.rs:112)
==419002==    by 0x121EE3: kornia_imgproc::interpolation::grid::meshgrid_from_fn (grid.rs:43)
==419002==    by 0x11A3FF: mem_test::correction_map (main.rs:7)
==419002==    by 0x11A567: mem_test::main (main.rs:19)
==419002==    by 0x11E1FA: core::ops::function::FnOnce::call_once (function.rs:250)
==419002==    by 0x1241ED: std::sys::backtrace::__rust_begin_short_backtrace (backtrace.rs:152)
==419002==    by 0x1253D0: std::rt::lang_start::{{closure}} (rt.rs:162)
==419002==    by 0x18093F: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:284)
==419002==    by 0x18093F: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:557)
==419002==    by 0x18093F: try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (panicking.rs:521)
==419002==    by 0x18093F: catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panic.rs:350)
==419002==    by 0x18093F: {closure#2} (rt.rs:141)
==419002==    by 0x18093F: do_call<std::rt::lang_start_internal::{closure_env#2}, isize> (panicking.rs:557)
==419002==    by 0x18093F: try<isize, std::rt::lang_start_internal::{closure_env#2}> (panicking.rs:521)
==419002==    by 0x18093F: catch_unwind<std::rt::lang_start_internal::{closure_env#2}, isize> (panic.rs:350)
==419002==    by 0x18093F: std::rt::lang_start_internal (rt.rs:141)
==419002== 
==419002== 8,294,400 bytes in 1 blocks are definitely lost in loss record 23 of 23
==419002==    at 0x48447A8: malloc (vg_replace_malloc.c:446)
==419002==    by 0x162139: std::sys::pal::unix::alloc::<impl core::alloc::global::GlobalAlloc for std::alloc::System>::alloc (alloc.rs:14)
==419002==    by 0x161F3E: <kornia_core::allocator::CpuAllocator as kornia_core::allocator::TensorAllocator>::alloc (allocator.rs:58)
==419002==    by 0x12FA86: kornia_core::storage::TensorStorage<T,A>::new (storage.rs:55)
==419002==    by 0x12F817: kornia_core::tensor::Tensor<T,_,A>::new_uninitialized (tensor.rs:112)
==419002==    by 0x121FEE: kornia_imgproc::interpolation::grid::meshgrid_from_fn (grid.rs:44)
==419002==    by 0x11A3FF: mem_test::correction_map (main.rs:7)
==419002==    by 0x11A567: mem_test::main (main.rs:19)
==419002==    by 0x11E1FA: core::ops::function::FnOnce::call_once (function.rs:250)
==419002==    by 0x1241ED: std::sys::backtrace::__rust_begin_short_backtrace (backtrace.rs:152)
==419002==    by 0x1253D0: std::rt::lang_start::{{closure}} (rt.rs:162)
==419002==    by 0x18093F: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:284)
==419002==    by 0x18093F: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:557)
==419002==    by 0x18093F: try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (panicking.rs:521)
==419002==    by 0x18093F: catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panic.rs:350)
==419002==    by 0x18093F: {closure#2} (rt.rs:141)
==419002==    by 0x18093F: do_call<std::rt::lang_start_internal::{closure_env#2}, isize> (panicking.rs:557)
==419002==    by 0x18093F: try<isize, std::rt::lang_start_internal::{closure_env#2}> (panicking.rs:521)
==419002==    by 0x18093F: catch_unwind<std::rt::lang_start_internal::{closure_env#2}, isize> (panic.rs:350)
==419002==    by 0x18093F: std::rt::lang_start_internal (rt.rs:141)
==419002== 
==419002== LEAK SUMMARY:
==419002==    definitely lost: 16,588,800 bytes in 2 blocks
==419002==    indirectly lost: 0 bytes in 0 blocks
==419002==      possibly lost: 2,432 bytes in 8 blocks
==419002==    still reachable: 62,408 bytes in 119 blocks
==419002==         suppressed: 0 bytes in 0 blocks
==419002== Reachable blocks (those to which a pointer was found) are not shown.
==419002== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==419002== 
==419002== For lists of detected and suppressed errors, rerun with: -s
==419002== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)

The leak detected by valgrind is here:

Buffer::from_custom_allocation(
NonNull::new_unchecked(ptr),
len * std::mem::size_of::<T>(),
Arc::new(Vec::<T>::with_capacity(len)),
)

The issue is that the memory allocated in ptr won't be freed when the Arc::new(Vec::<T>::with_capacity(len)) is freed.

@edgarriba edgarriba added the bug Something isn't working label Sep 20, 2024
@edgarriba
Copy link
Member

@emilmgeorge, thanks for your investigation. It seems there’s definitely an issue with Buffer::from_custom_allocation or possibly the way I implemented it initially. I've opened a ticket to address this: kornia-rs/issues/127.

It might be worthwhile to revisit the Tensor constructor API and the use of the Allocator.

Here are my thoughts on the design:

  • Make the Tensor struct extensible via Buffer::Allocation, allowing it to support data from various backends, such as CUDA, Wpu, etc.
  • For instance, one could implement a CudaAllocator and initialize a Tensor by allocating CUDA memory, or by using a pre-allocated CUDA pointer. This could be particularly useful for implementing image operations directly on CUDA, like a remap function.
  • Ideally, the kornia-core crate and the Tensor struct should be lightweight, acting primarily as a managed data container. This would keep it decoupled from operations and allow for interchangeable use with other library containers at zero copy cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants