-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Improvement for Writing Large Datasets #6
Comments
Hi Michal, Thank you so much for the positive feedback! An option was recently added to the I honestly don't remember why buffering was added in the first place, but maybe something changed further down the stack in Stream or in zip that we are running into by chunking the output. If that doesn't solve the problem for you, let me know. If it does, I'll release a new version with buffering defaulting to false, so that people can have a better default experience. |
Actually before you retest, I have a fix that seems to run even faster than disabling the buffering. On my phone at the moment, but should be able to push it later today when I get my laptop on an internet connection. |
Hey @sax, I retested with buffer: false. First with less columns (2) it was great if I added 10 or then 17 it slows down! 2 columns 17 columns Benchmark Code:
end` |
I finally got to a place where I could push my changes. Please try with the latest commit on main. I combined some of your benchmark code into what I was using on my flight yesterday into this: defmodule Test do
require Logger
def buffered(column_count \\ 10, row_count \\ 100_000) do
file = File.stream!("/tmp/workbook.xlsx")
headers = headers(column_count)
stream = stream(column_count, row_count)
benchmark(column_count, row_count, fn ->
Exceed.Workbook.new("Creator Name")
|> Exceed.Workbook.add_worksheet(Exceed.Worksheet.new("Sheet Name", headers, stream))
|> Exceed.stream!()
|> Stream.into(file)
|> Stream.run()
end)
end
def unbuffered(column_count \\ 10, row_count \\ 100_000) do
file = File.stream!("/tmp/workbook.xlsx")
headers = headers(column_count)
stream = stream(column_count, row_count)
benchmark(column_count, row_count, fn ->
Exceed.Workbook.new("Creator Name")
|> Exceed.Workbook.add_worksheet(Exceed.Worksheet.new("Sheet Name", headers, stream))
|> Exceed.stream!(buffer: false)
|> Stream.into(file)
|> Stream.run()
end)
end
defp headers(column_count) do
Stream.iterate(1, &(&1 + 1))
|> Stream.map(&"Header #{&1}")
|> Enum.take(column_count)
end
defp benchmark(column_count, batch_size, fun) do
{duration, _} = :timer.tc(fun, :millisecond)
rate_per_row = Float.round(batch_size / (duration / 1_000), 2)
Logger.info("Batch size #{column_count}*#{batch_size} completed in #{duration}ms, rate: #{rate_per_row} rows/sec")
end
def stream(column_count, row_count) do
Stream.iterate(1, &(&1 + 1))
|> Stream.chunk_every(column_count)
|> Stream.take(row_count)
end
end I'm finding that with unbuffered output on my latest commit:
With buffered output:
So at the very least it's linear over row count. However when I increase the column count it degrades:
I think because of the XML generation there will be some expected performance cost to increasing the column count, but it's surprising to me that there is such a dramatic impact. I'll see if I can get better tracing into where the slowdown is happening, whether it's in the XML generation or in the zipping. |
I think that this might be hitting the limit of what I can improve without rewriting the XML generation, for which I'm currently relying on the XmlStream library. I'd be very happy to be proven wrong, if you can find ways to rewrite the chunking in mix profile.tprof -r benchmark/bench.exs -e "Benchmark.buffered(10)"
|
I just release v0.7.0 with my updates for buffered output. I also shipped some benchmark code in I'll leave this issue open until you have a chance to test, and am also happy to leave it open as a place for discussion if you can think of other ways to make things faster or more performant for spreadsheets with more columns. |
Hi @sax, I'm impressed by your exceptional engagement and outstanding work on developing Exceed. Your ability to solve complex problems and find innovative solutions is truly remarkable, and I'm grateful for the opportunity to learn from you and work with you. I hope that we can work together to make Exceed an even better tool, and I look forward to your help and support in this endeavor. I've run some tests to compare the performance of Exceed and Elixlsx, and I wanted to share the results with you. Here they are: Elixlsx Version (Reference): Total time: 3942ms Total time: 15129ms However, I've also run additional tests using 1,000,000 database records. Here are the results: Excel generation comparison with 1,000,000 database records: Elixlsx Version (Reference): Total time: 102579ms Total time: 105207ms Thank you for your time and support. I look forward to hearing from you and working together to find a solution. |
If there is a way to measure memory allocation as well, that's going to be the true measure of why you'd want to use a tool like Exceed on large data sets. With Elixlsx you need to load the entire dataset and generate the xml files in memory. I'm pretty sure benchee will measure memory. Just need some time and a better workspace to knock it out... I'm mostly on my phone this week with short periods with internet connection on my laptop. Elixlsx writes most strings as shared strings, using a ets table to track them if memory serves. Exceed does not, in the interest of less complex code and the fact that the shared strings file needs to precede all worksheets in the zipped xlsx file. That will some if not all of the file size difference. I am also surprised at how much slower it is than Elixlsx, though. My suspicion is that it's the sheer number of stream transformations, between iterating over rows, transforming to xml, then transforming to zip. It would be helpful if I could find a way to measure each part of it—tprof and like just give function names, and a stream is a stream is a stream. |
Hey @sax, |
Another thing I was thinking was to try to move the zipping to another process. I have no clue what scope of changes is required for that. |
Hi synchronal
First of all, I want to express my sincere appreciation for the work you have done on Exceed. The library is incredibly powerful and offers a unique set of features that make it stand out in the field. It has been a valuable tool for our projects, and we are truly grateful for your efforts.
However, we have encountered a performance issue when writing large datasets. While Exceed excels in many areas, it becomes quite slow when processing large amounts of data. In comparison, Elixlsx handles the same task much more efficiently, which has led us to consider using it as an alternative in scenarios where performance is critical.
We understand that optimizing performance for large datasets can be a complex task, but we believe that this improvement would significantly enhance the usability of Exceed in a broader range of applications.
Steps to Reproduce:
Create a dataset with 100,000 records, each with 10 columns.
Write this dataset to an Excel file using Exceed.
Observe the time it takes to complete the operation.
Expected Behavior: The operation should complete in a reasonable amount of time, comparable to Elixlsx.
Actual Behavior: The operation takes significantly longer than expected, making it impractical for real-world use cases with large datasets.
LOG Exceed:
[info] Batch size 1 completed in 14ms, rate: 71.43 rows/sec
[info] Benchmarking with batch size: 10
[info] Batch size 10 completed in 8ms, rate: 1250.0 rows/sec
[info] Benchmarking with batch size: 100
[info] Batch size 100 completed in 64ms, rate: 1562.5 rows/sec
[info] Benchmarking with batch size: 1000
[info] Batch size 1000 completed in 652ms, rate: 1533.74 rows/sec
[info] Benchmarking with batch size: 10000
[info] Batch size 10000 completed in 6401ms, rate: 1562.26 rows/sec
[info] Benchmarking with batch size: 20000
[info] Batch size 20000 completed in 12949ms, rate: 1544.52 rows/sec
[
{1, 14, 71.43},
{10, 8, 1250.0},
{100, 64, 1562.5},
{1000, 652, 1533.74},
{10000, 6401, 1562.26},
{20000, 12949, 1544.52}
]
LOG ELIXLSX:
[info] Processed 684771 rows at 5939.81 rows/sec
Best regards,
Michal Forster
The text was updated successfully, but these errors were encountered: