-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for LZ4 compression #1181
Conversation
} | ||
CompressionCodec::Zstd(level) => { | ||
let encoder = zstd::Encoder::new(output, *level)?; | ||
let mut arrow_writer = StreamWriter::try_new(encoder, &batch.schema())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not really familiar with the code, but shouldn't StreamWriter
and encoder be created only once per stream instead of per batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is something I have been thinking about as well. We have the cost of writing the schema for each batch currently, and the schema is guaranteed to be the same for each batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add some more context here, we buffer rows per partition until we reach the desired batch size and then need to serialize that batch to bytes that can be read as one block by CometBlockStoreShuffleReader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides writing the schema each time, I guess it will also have higher overhead of flushing each time (less efficient buffering), having lower compressibility, etc. ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #1186 for re-using the writer across many batches. It looks like a big perf win.
Here are my findings from hacking on this today. LZ4 provides two compression formats: Spark uses the Java library https://github.com/lz4/lz4-java and specifically uses
edit: Apache Commons provides |
switched to lz4_flex crate:
|
c15692d
to
b4e10cd
Compare
The lz4_flex crate should get even faster once PSeitz/lz4_flex#175 is merged |
41602c1
to
291d8f1
Compare
This reverts commit 76e0d71.
LZ4 support is now part of #1192 |
Which issue does this PR close?
Closes #1178
Builds on #1192 so we need to merge that PR first
Rationale for this change
LZ4 may provide faster compression than ZSTD, which could help with shuffle performance.
What changes are included in this PR?
How are these changes tested?