-
Notifications
You must be signed in to change notification settings - Fork 399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making memray third-party allocator-aware #577
Comments
Thanks @pitrou for bringing this to us. This is a very interesting problem indeed. The key to either method is that we need:
If we have these two things, we could offer a way to either override automatically by having constant symbol names or to offer some kind of dynamic naming via some configuration. I suppose that the next step is for us to investigate how some of the applications/libraries out there are interacting with this allocators. Do you think you can give us some example with |
Here is a quick REPL example: >>> import pyarrow as pa
# mimalloc
>>> pool = pa.mimalloc_memory_pool()
>>> a = pa.array([0]*1_000_000, memory_pool=pool)
>>> pool.bytes_allocated()
8000000
# jemalloc
>>> pool = pa.jemalloc_memory_pool()
>>> a = pa.array([0]*1_000_000, memory_pool=pool)
>>> pool.bytes_allocated()
8000000 Note that You'll find the corresponding C++ code here:
Note that jemalloc symbols are mangled to avoid polluting the standard libc namespace (
Ah, interesting. So it must appear in For example: $ nm libarrow.so.1500 | rg -w mi_malloc
0000000001d3c210 t mi_malloc
$ nm --dynamic libarrow.so.1500 | rg -w mi_malloc
$
$ nm libarrow.so.1500 | rg "je_arrow_" | head -n 4
0000000001cc7bb0 t je_arrow_aligned_alloc
0000000001cc8180 t je_arrow_calloc
0000000001ccd070 t je_arrow_dallocx
0000000001cc9a30 t je_arrow_free
$ nm --dynamic libarrow.so.1500 | rg "je_arrow_"
$ |
That is a sufficient condition but not necessary. The other option is that it should have a symbol called |
Hmm. How would you do that using gcc or clang? Is there a function attribute (preferably) or perhaps compiler/linker flag? Also, yes, we are statically compiling mimalloc and jemalloc. |
I think you can do it with |
Hmm, actually, a function attribute wouldn't work, because we would have to patch the mimalloc source code for that... (also, we use |
An alternative view of this problem is that code with
That deactivates PLT entries for intra-calls in the shared library. This means that if the definition of the symbol it's inside the executable/shared lib there won't be a PLT entry, which is faster and maybe inalienable but it means it cannot be interposed. |
It looks like if you statically compile the allocator and use I am afraid this is the classic compromise between performance and observability. |
I agree. We could definitely make an exception for mimalloc and jemalloc calls, however, it's just that I don't know how to do that without affecting other symbols. Also, a radical solution might be to first try |
I think trying to use a |
A quick check you can do when trying things out is to load a library with the same definition via |
I thought so, but I realized it required patching the mimalloc or jemalloc source, something we'd like to avoid if possible (also, it could be pre-compiled and we would be linking against an existing That said, the |
Some interesting info: Apparently the way QT does this is to use
|
I think that won't work for profilers that attach or that don't use LD_PRELOAD because the interposition will happen at arbitrary late points (after the initial relocation has been made). |
Maybe you can wrap the allocator in some call that's exported and use that internally and mark that wrapper as |
I might misunderstanding how relocation works, but do these profilers patch all call sites at runtime? |
No, they patch the Global Offset Table at runtime. All call sites point to a PLT entry. For calls that have a PLT/GOT pair, the code normally trampolines through a small assembly code that grabs an address from the Global Offset Table and calls that. Call sites point to the trampoline and the trampoline grabs the address on every call. At first, the address in the GOT is in the linker resolution routine and once the linker finds the real address (lazy loading) the GOT is updated. Profilers like memray and heap track work by locating the GOT and rewriting the address with their own functions. This can be done at runtime so it allows attaching and activating/deactivating. LD_PRELOAD works the same except that interposes the symbol when the linker resolves it so it ends in the first GOT update, but it has several disadvantages (like it cannot be deactivated and attaching won't work). The mechanism needs your function to have a PLT/GOT pair. |
With this explanation you can see the cost: PLT trampolines require an extra read from the GOT and an extra jump, which makes every call a bit more inefficient. |
|
Ok, so
I think this might work, though it would be worse performance-wise:
|
You may need to mark it as |
Ok, I've got a PR which creates such interposable wrappers in Arrow. I've checked that they can be interposed using |
Ok I will discuss with @godlygeek whats the best way to support something like this soon |
Also note you can download prebuilt wheels from the aforementioned PR using these links. Click on one of the green "Crossbow" badges, then click on the "Summary" link on the Github Actions page, then download the artifact at the bottom of the summary page. |
Is there an existing proposal for this?
Is your feature request related to a problem?
It seems that memray currently reports the different "kinds" of allocations based on which libc function was called (
malloc
,mmap
...). (*) However, third-party allocators such as mimalloc and jemalloc are growing in use because of their desirable performance characteristics. When those are used instead of the system allocator, allocations which are logically malloc-like are reported as mmap calls with very large allocation sizes.There is an example in this issue report where a bunch of 64MiB blocks are reported by memray as allocated (one per thread, roughly), resulting in a large reported footprint of more than 1GiB, while those are the page reservations by mimalloc and the corresponding allocations on the application side are tiny (1kiB each).
This is a problem that is bound to produce many user reports of memory leaks or overconsumption, while actually the program is operating at normal.
(*) I may be wrong in this interpretation of mine, in which case please do correct me.
Describe the solution you'd like
Ideally, memray would also detect calls to third-party allocator routines and report a
mi_malloc(1024)
as allocating 1024 bytes, not 64 MiB :-)Several technical solutions can be considered and I'm not an expert in the field. Here are two that comes to mind:
Hard-code support for the most popular 3rd-party allocators, by looking at their respective API names. This seems conceptually easy but will have limited benefits, because those allocators are often privately vendored and sometimes their symbols are mangled to avoid symbol clashes. Also, this means that less popular allocators will not get any coverage.
Devise some sort of runtime protocol where the allocator themselves may tag API functions (how? I have no idea :-)) as being malloc-like, realloc-like, etc. This is obviously more complex technically and requires cooperation to come up with a suitable protocol, but would work better in the long term.
Alternatives you considered
No response
The text was updated successfully, but these errors were encountered: