-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Only convert in parallel for the ConsolidatedBlockCreator class for large data #40301
Comments
I’ve taken a look at the code.
~My first solution/idea: add another option to Easy, but it’s an extra option that users have to remember to set.~ EDIT 1: I had a better idea. AlternativeAdd more parameters to parallel-for(columns) that modify the loop that submits tasks to the thread-pool: Start with a low minimum number of submitted tasks (eg 2. each column is a task).
EDIT 2: the better solution is too allow the same thread to continue converting more columns if available with a loop in the main thread increasing the number of threads used if necessary |
@felipecrv Is modifying However, if you think I prefer the work-stealing approach because, ideally, we wouldn't require the user to know about the existence of an option to set. Folks might not know that the memory usage has to do with the spawning of individual threads. They might not even know why |
### Rationale for this change Mimalloc and jemalloc can allocate a [relatively large amount of memory for the ScalarMemoTable](#40301). For this reason, the ScalarMemoTable should only be allocated when it is used (when `options.deduplicate_objects=True`). I tested this change, and for small tables it does improve memory allocation. `options.deduplicate_objects=False` After this change: 📦 Total memory allocated: 174.422MB 📊 Histogram of allocation size: min: 1.000B -------------------------------------------- < 6.000B : 3064 ▇▇ < 36.000B : 7533 ▇▇▇▇ < 222.000B : 9974 ▇▇▇▇▇ < 1.319KB : 53264 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ < 7.999KB : 5188 ▇▇▇ < 48.503KB : 742 ▇ < 294.066KB: 102 ▇ < 1.741MB : 22 ▇ < 10.556MB : 1 ▇ <=64.000MB : 1 ▇ -------------------------------------------- max: 64.000MB Before this change: 📦 Total memory allocated: 1.295GB 📊 Histogram of allocation size: min: 1.000B -------------------------------------------- < 6.000B : 3064 ▇▇ < 36.000B : 7543 ▇▇▇▇ < 222.000B : 10009 ▇▇▇▇▇ < 1.319KB : 53269 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ < 7.999KB : 5192 ▇▇▇ < 48.503KB : 761 ▇ < 294.066KB: 102 ▇ < 1.741MB : 22 ▇ < 10.556MB : 1 ▇ <=64.000MB : 19 ▇ -------------------------------------------- max: 64.000MB ### What changes are included in this PR? The allocation of `memo_table` and `unique_values` have been moved underneath an `if (options.deduplicate_objects)` block. Since they are used within a lambda, they have been changed to shared pointers, so that their values exist for the lifetime needed. ### Are these changes tested? `deduplicate_objects` has extensive existing tests: https://github.com/apache/arrow/blob/b235f83ed10bcad174b267113479a24ca045def5/python/pyarrow/tests/test_pandas.py#L3211 and https://github.com/apache/arrow/blob/b235f83ed10bcad174b267113479a24ca045def5/python/benchmarks/convert_pandas.py#L71 ### Are there any user-facing changes? Nope. * GitHub Issue: #40316 Lead-authored-by: anjakefala <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Sharing some progress here in the open: @anjakefala will test a version of this that uses a version of In |
I'm honestly not sure what this has to do with work stealing. If I read correctly:
What I would like to know is:
|
I think the original reported case (converting a tiny table of a few kilobytes to pandas can give a spike of several hundred MBs in memory usage) is something people can certainly run into. And although it will often not be concerning (typically when working with smaller data, memory usage is not an issue, and when actually working with larger tables and memory usage becomes relevant, this overhead will disappear), it is definitely surprising and can lead to confusion. So I think it is worth "fixing". But the potential fix I was thinking of could also be something much simpler, like with some heuristic decide to just not do the conversion in parallel for smaller data. arrow/python/pyarrow/pandas_compat.py Lines 573 to 581 in a6e577d
|
Yes, I agree this could just be fixed with a heuristic, especially as parallelizing will not be performant on very small data. |
It's the inspiration for this, but if it seems to make the understanding harder, just ignore it.
The practical concern is that
And that is exactly what I'm doing, but instead of looking at number of rows I look at how long it takes to convert the columns on average, while doing the conversions. Never blocking [1] or sleeping. It's an adaptive solution: you pass a duration, if it takes more than that duration, a few more threads are allowed to start. // If the average wall-clock time per task is greater than the
// small_task_duration_secs, grow the number of workers geometrically.
if (static_cast<double>(elapsed_time.count() * steady_clock::period::num) >
(small_task_duration_secs * steady_clock::period::den) * num_tasks_completed) {
num_workers = std::min(
((num_workers * GrowthRatio::num - 1) / GrowthRatio::den) + 1, max_workers);
} And instead of having these issues for every kind of parallelization we do in the future, this function can be used on any problem where the size of the tasks can range from trivial to very expensive. This what the fix looks like at the callsite: - return OptionalParallelFor(options_.use_threads, num_columns_, WriteColumn);
+ if (options_.use_threads) {
+ const double kSmallTaskDurationSecs = 0.008; // 8ms
+ return ::arrow::internal::ParallelForWithBackOff(num_columns_, WriteColumn,
+ kSmallTaskDurationSecs);
+ } else {
+ for (int i = 0; i < num_columns_; ++i) {
+ RETURN_NOT_OK(WriteColumn(i));
+ }
+ return Status::OK();
+ } [1] except for the lock-free fetch/add of the |
For the concrete example here, it is still surprising that allocating 26 memo tables, all with only one (!) element, would result in 200+MB of memory allocations on jemalloc, and 1+GB on mimalloc. Our hash table is presized for 32 elements, which shouldn't probably result in so much memory being allocated. (caveat: I don't know what memray is accounting exactly here) So before trying to add heuristics for parallelization, perhaps we should investigate this? |
Edit: this is wrong, I was testing with 15.0.2 but #40316 is only in git main. |
Ok, so this is really the initial pre-sizing of the memo tables. However, the problem here is wrongly interpreting the observations. The 1+GB are not so much "allocated" than pre-reserved by the allocator (mimalloc or jemalloc). Those empty pages are most probably not backed by physical memory until they are actually used by the application (Arrow). The only thing allocated is virtual address space, and virtual address space is virtually infinite (up to 2**48 bytes per process on x86-64). The actual amount of allocated physical memory is the amount necessary for each 32-element memo table, and that is bound to be very small. |
Users tend to worry about such figures because they can give a hint that something is wrong (a leak? a performance issue?) while everything is actually operating as normal. Ideally, profilers such as memray would be able to detect the different kinds of "allocations" but that's a non-trivial problem, especially when private copies of e.g. jemalloc or mimalloc are involved. |
I've opened a feature request issue (probably a bit pie-in-the-sky) on the memray side: bloomberg/memray#577 |
Also a couple more observations:
We can therefore confirm that the amount of memory committed (not just reserved) is reasonably small: $ ARROW_DEFAULT_MEMORY_POOL=mimalloc MIMALLOC_SHOW_STATS=1 python -c "import pyarrow as pa; table = pa.table({'A': 'a', 'B': 'b', 'C': 'c', 'D': 'd', 'E': 'e', 'F': 'f', 'G': 'g', 'H':'h', 'I': 'i', 'J': 'j', 'K':'k', 'L':'l', 'M':'m', 'N':'n', 'O':'o', 'P':'p', 'Q':'q', 'R':'r', 'S':'s', 'T':'t', 'U':'u', 'V':'v', 'W':'w', 'X':'x','Y':'y','Z':'z'}); table.to_pandas()"
heap stats: peak total freed current unit count
reserved: 1.5 GiB 1.5 GiB 128.5 KiB 1.5 GiB not all freed!
committed: 112.1 MiB 115.1 MiB 115.1 MiB 64.2 KiB not all freed!
reset: 0 0 0 0 ok
touched: 3.1 MiB 3.1 MiB 3.2 MiB -100.3 KiB ok
segments: 25 25 25 0 ok
-abandoned: 0 0 0 0 ok
-cached: 0 0 0 0 ok
pages: 0 0 25 -25 ok
-abandoned: 0 0 0 0 ok
-extended: 0
-noretire: 0
mmaps: 0
commits: 48
threads: 24 24 48 -24 ok
searches: 0.0 avg
numa nodes: 1
elapsed: 0.255 s
process: user: 0.232 s, system: 0.038 s, faults: 2, rss: 90.6 MiB, commit: 112.1 MiB |
We further discussed this in the aforementioned memray issue, and I created a PR that exposes dynamic symbols for memray to interpose: #41128 We'll see whether the memray devs decide they want to support them, since this is Arrow-specific. Ideally, there would be a "standard" way for libraries and applications to signal that certain dynamic symbols are really malloc-like calls. |
Different people take different approaches to problems. I saw You saw this problem and decided to fix how the profiler measures memory consumption. Which is a valid strategy, but not the only one, and certainly an even bigger rabbit hole for me personally if I were to take it. |
I think there's still value to Felipe's approach, and I'd like to support testing it. I did mis-interpret results, but it does seem better to ramp up allocation with the need, instead of, by default, spinning up a thread per column. |
Note that it's not really spinning up new threads, unless it's the first time the thread pool is being used :-) |
I tried to understand the behavior of Is that the case? |
It's a bit more complicated. Off the top of my head:
|
Seem fare to assume that a very eager loop like
will reach capacity threads very quickly unless completing a task takes less time than the cost of making |
Yes, indeed. I was answering the more general question. |
We ran benchmarks on @felipecrv's I'm going to go ahead and close this issue for now! Thanks everyone for helping me look into this. @jorisvandenbossche @felipecrv and @pitrou ❤️ |
Describe the enhancement requested
The Consolidated Block Creator runs the column conversion in parallel, creating a Scalar Memo Table for each column up until
pa.cpu_count()
.For performance reasons, jemalloc and mimalloc maintain allocations on a per-memory segment level to reduce contention between threads.
What this means is that if a user calls
table.to_pandas(split_blocks=False)
for a small table, a disproportionately large amount of memory gets allocated to build theScalar Memo Table
. Bothjemalloc
andmimalloc
will essentially allocate a chunk of memory per column.Here is some code:
Here are the resulting memory allocations summarised with memray:
jemalloc with 7 columns:
jemalloc with 26 columns:
mimalloc with 7 columns:
mimalloc with 26 columns:
You can see how dramatically the memory increases even for a very small table.
My proposal is that we only do the conversion in parallel when it might make a substantial performance difference for a table of a certain size. I'm not quite sure which size, but once the code has been refactored, we can run experiments to come to a data-informed decision.
Component(s)
C++
The text was updated successfully, but these errors were encountered: