feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support #1192

andygrove · 2024-12-21T15:52:55Z

Which issue does this PR close?

Part of #1123

Rationale for this change

We currently perform encoding + compression in native code and decoding + decompression in JVM code. There are some downsides to this approach:

We need compatible and efficient JVM and Rust compression libraries (this seems challenging for LZ4)
We need compatible and efficient JVM and Rust encoding libraries (we use Arrow IPC currently)
We cannot have unit tests for roundtrip encoding and compression; only integration tests (results in slow dev cycles)
Makes it difficult to experiment with different encoding and compression libraries and techniques
We are missing out on performance, potentially

What changes are included in this PR?

Call native code for decompression + decoding
Add metrics for decode + decompress
Add support for LZ4
Implement unit tests for round-trip

ZSTD

LZ4

Microbenchmarks

shuffle_writer/shuffle_writer: encode (no compression))
                        time:   [10.044 µs 10.371 µs 10.698 µs]
shuffle_writer/shuffle_writer: encode and compress (lz4)
                        time:   [132.60 µs 133.01 µs 133.51 µs]
shuffle_writer/shuffle_writer: encode and compress (zstd level 1)
                        time:   [217.81 µs 218.01 µs 218.25 µs]

TPC-H

edit: Results update on 1/1/2025 after making compression configurable for columnar shuffle as well as native shuffle.

How are these changes tested?

Existing tests

spark/src/test/scala/org/apache/comet/exec/CometExecSuite.scala

codecov-commenter · 2024-12-21T23:09:24Z

Codecov Report

Attention: Patch coverage is 77.67857% with 25 lines in your changes missing coverage. Please review.

Project coverage is 34.71%. Comparing base (103f82f) to head (7dd2ff6).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...execution/shuffle/NativeBatchDecoderIterator.scala	71.60%	12 Missing and 11 partials ⚠️
...t/execution/shuffle/CometShuffleExchangeExec.scala	66.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1192      +/-   ##
============================================
+ Coverage     34.06%   34.71%   +0.64%     
- Complexity      925      958      +33     
============================================
  Files           115      115              
  Lines         43569    43614      +45     
  Branches       9528     9534       +6     
============================================
+ Hits          14843    15141     +298     
+ Misses        25777    25507     -270     
- Partials       2949     2966      +17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andygrove · 2024-12-22T20:02:53Z

The status is that this is working, but I am running into some executor OOMs when trying to run complete benchmarks. I will pick this up again after the holidays.

andygrove · 2024-12-23T17:43:35Z

@Dandandan you may be interested in the benchmark results

Dandandan · 2024-12-23T18:23:00Z

native/core/benches/shuffle_writer.rs

+        b.iter(|| {
+            buffer.clear();
+            let mut cursor = Cursor::new(&mut buffer);
+            write_ipc_compressed(&batch, &mut cursor, &CompressionCodec::Zstd(1), &ipc_time)


I think zstd should have faster negative levels as well (-4 or -5 might come close), would be interesting to see how it compares. Not sure if it is available in the rust bindings.

andygrove · 2024-12-27T20:10:29Z

Added Snappy.

shuffle_writer/shuffle_writer: encode and compress (snappy)
                        time:   [87.556 µs 88.328 µs 89.092 µs]

andygrove · 2024-12-28T22:58:40Z

I enabled the unsafe version of lz4_flex, and the time reduced from 340s to 336s.

andygrove · 2025-01-02T13:22:03Z

@viirya @kazuyukitanimura @parthchandra @mbutrovich This is ready for review now

mbutrovich · 2025-01-02T16:51:36Z

native/core/Cargo.toml

@@ -52,6 +52,8 @@ serde = { version = "1", features = ["derive"] }
 lazy_static = "1.4.0"
 prost = "0.12.1"
 jni = "0.21"
+snap = "1.1"
+lz4_flex = { version = "0.11.3", default-features = false }


This is a somewhat confusingly named feature flag. Can we leave a comment here that we're enabling unsafe encode and decode for performance?

mbutrovich · 2025-01-02T20:46:19Z

native/core/src/execution/shuffle/shuffle_writer.rs

        let batch = create_batch(8192);
        let mut output = vec![];
        let mut cursor = Cursor::new(&mut output);
-        write_ipc_compressed(
+        let length = write_ipc_compressed(
            &batch,
            &mut cursor,
            &CompressionCodec::Zstd(1),


Could we modify this test so instead of being named round_trip_zstd it iterates through all the compression schemes?

Good idea. I have implemented this in 1fc3d49 and also fixed the bug that it exposed 🙂

mbutrovich · 2025-01-02T21:33:56Z

native/core/src/execution/shuffle/shuffle_writer.rs

@@ -1570,13 +1588,41 @@ pub fn write_ipc_compressed<W: Write + Seek>(
    // write ipc_length placeholder
    output.write_all(&[0u8; 8])?;


I know this isn't your code, but it jumped out at me: can we replace this 8-byte write with a seek?

// seek past ipc_length placeholder output.seek_relative(8)?;

comphead · 2025-01-02T22:15:20Z

native/core/Cargo.toml

@@ -52,6 +52,9 @@ serde = { version = "1", features = ["derive"] }
 lazy_static = "1.4.0"
 prost = "0.12.1"
 jni = "0.21"
+snap = "1.1"


awesome, its really challenging to find a well maintained snappy Rust library.

comphead · 2025-01-02T22:30:38Z

native/core/src/execution/shuffle/shuffle_writer.rs

+        CompressionCodec::Lz4Frame => {
+            output.write_all(b"LZ4_")?;
+            let mut wtr = lz4_flex::frame::FrameEncoder::new(output);
+            let mut arrow_writer = StreamWriter::try_new(&mut wtr, &batch.schema())?;


probably dumb question, looks like the writers gets created for every input.

In real life example it will be recreated for every batch, probably the writer can be reused until the data stream has completed?

Spark's approach is to write shuffled output data in blocks where each block is independent (contains schema + data for that batch), so it doesn't allow for a streaming approach. We accumulate data in the shuffle writer until we reach the configured batch size and then write that batch out to bytes. There is also no guarantee of the order in which the shuffle reader will read these blocks, so that is another reason why we cannot keep a writer open for multiple batches.

do you think it makes any sense if I play with reusable writers to check if it gives any benefit on multiple bacthes? I feel we can have a map of it where key is schema hash + compression algorithm. During real job runtime its probably not expected to have more than 50-100 writers

Feel free to experiment, and let's see what else we can learn. I'm not sure exactly what you are planning on trying but it seems that we could just have one writer per PartitionBuffer and write multiple batches to it. It may complicate the spill logic but you can just disable that for testing the idea. I wonder how this would differ from just increasing the shuffle batch size so that we write larger batches?

yeah, thats another way to increase batch_size so the compression writer will be created less often.
The test idea was to find out how expensive is the writer creation, to see if this can be an issue.

Perhaps I can just try to create a bench and quantify it so it gives us some ideas.

what I found is

Time elapsed in writer_create is: 107.125µs Time elapsed in total is: 1.699042ms

The average for creation is 7% of overall time. However to make writer a singleton is not a trivial task...

In the PR that follows this one (#1190) I propose replacing ArrowWriter with a faster proprietary version.

common/src/main/scala/org/apache/comet/CometConf.scala

native/core/src/execution/shuffle/shuffle_writer.rs

...src/main/scala/org/apache/spark/sql/comet/execution/shuffle/NativeBatchDecoderIterator.scala

andygrove · 2025-01-06T18:38:42Z

@viirya @kazuyukitanimura @mbutrovich @comphead Thanks for the reviews so far. I believe I have addressed all feedback now.

...src/main/scala/org/apache/spark/sql/comet/execution/shuffle/NativeBatchDecoderIterator.scala

kazuyukitanimura

Thanks @andygrove

kazuyukitanimura · 2025-01-06T21:24:34Z

common/src/main/scala/org/apache/comet/CometConf.scala

+          "spark.shuffle.compress=false.")
+      .stringConf
+      .checkValues(Set("zstd", "lz4", "snappy"))
+      .createWithDefault("lz4")

  val COMET_EXEC_SHUFFLE_COMPRESSION_LEVEL: ConfigEntry[Int] =


nit since the config name now has zstd, the constant name should ideally reflect it, but optional

andygrove mentioned this pull request Dec 21, 2024

feat: Add support for LZ4 compression #1181

Closed

andygrove commented Dec 21, 2024

View reviewed changes

spark/src/test/scala/org/apache/comet/exec/CometExecSuite.scala Outdated Show resolved Hide resolved

andygrove marked this pull request as ready for review December 21, 2024 21:57

andygrove marked this pull request as draft December 21, 2024 22:52

This was referenced Dec 21, 2024

minor: move shuffle classes from common to spark #1193

Merged

minor: refactor prepare_output so that it does not require an ExecutionContext #1194

Merged

minor: refactor to move decodeBatches to broadcast exchange code as private function #1195

Merged

Implement native decoding and decompression

8ce9bb5

andygrove force-pushed the native-decode branch from 577880a to 8ce9bb5 Compare December 22, 2024 18:44

andygrove added 3 commits December 22, 2024 11:46

revert some variable renaming for smaller diff

a9a0593

fix oom issues?

11320a5

upmerge

e2f28f9

andygrove added 4 commits December 23, 2024 08:01

make NativeBatchDecoderIterator more consistent with ArrowReaderIterator

c97eb58

fix oom and prep for review

4ffe47d

format

68d2331

Add LZ4 support

a3fb105

andygrove changed the title ~~feat: Move shuffle block decompression and decoding to native code~~ feat: Move shuffle block decompression and decoding to native code and add LZ4 support Dec 23, 2024

clippy, new benchmark

b593e80

andygrove marked this pull request as ready for review December 23, 2024 17:43

Dandandan reviewed Dec 23, 2024

View reviewed changes

andygrove added 3 commits December 27, 2024 11:15

rename metrics, clean up lz4 code

4078551

update test

f286309

Add support for snappy

fbc2124

andygrove changed the title ~~feat: Move shuffle block decompression and decoding to native code and add LZ4 support~~ feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support Dec 27, 2024

andygrove added 2 commits December 27, 2024 13:18

format

bed543a

change default back to lz4

e13d72f

andygrove added 2 commits December 28, 2024 15:19

upmerge

b56089b

use faster unsafe version of lz4_flex

f66bced

andygrove added 4 commits January 1, 2025 12:41

Make compression codec configurable for columnar shuffle

0b2f0e9

clippy

ad1adc1

fix bench

3e15b12

fmt

1c08a4b

mbutrovich reviewed Jan 2, 2025

View reviewed changes

address feedback

7dd2ff6

mbutrovich reviewed Jan 2, 2025

View reviewed changes

address feedback

1fc3d49

mbutrovich reviewed Jan 2, 2025

View reviewed changes

andygrove added 2 commits January 2, 2025 14:59

address feedback

5a3fb2e

minor code simplification

2b88bbd

comphead reviewed Jan 2, 2025

View reviewed changes

andygrove added 2 commits January 3, 2025 12:57

upmerge

b4b6aff

cargo fmt

78340a1

kazuyukitanimura reviewed Jan 3, 2025

View reviewed changes

andygrove added 2 commits January 3, 2025 16:44

overflow check

69b54d9

rename compression level config

f41180d

kazuyukitanimura reviewed Jan 6, 2025

View reviewed changes

...src/main/scala/org/apache/spark/sql/comet/execution/shuffle/NativeBatchDecoderIterator.scala Outdated Show resolved Hide resolved

andygrove added 2 commits January 6, 2025 12:54

address feedback

9f176c9

address feedback

dd4f259

kazuyukitanimura approved these changes Jan 6, 2025

View reviewed changes

rename constant

7e4ddc9

andygrove merged commit 74a6a8d into apache:main Jan 7, 2025
74 checks passed

andygrove deleted the native-decode branch January 7, 2025 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support #1192

feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support #1192

andygrove commented Dec 21, 2024 •

edited

Loading

codecov-commenter commented Dec 21, 2024 •

edited

Loading

andygrove commented Dec 22, 2024

andygrove commented Dec 23, 2024

Dandandan Dec 23, 2024 •

edited

Loading

andygrove commented Dec 27, 2024 •

edited

Loading

andygrove commented Dec 28, 2024

andygrove commented Jan 2, 2025

mbutrovich Jan 2, 2025

mbutrovich Jan 2, 2025

andygrove Jan 2, 2025

mbutrovich Jan 2, 2025

comphead Jan 2, 2025

comphead Jan 2, 2025

andygrove Jan 2, 2025

comphead Jan 2, 2025

andygrove Jan 2, 2025

comphead Jan 3, 2025

comphead Jan 3, 2025

andygrove Jan 3, 2025

andygrove commented Jan 6, 2025

kazuyukitanimura left a comment

kazuyukitanimura Jan 6, 2025

		@@ -1570,13 +1588,41 @@ pub fn write_ipc_compressed<W: Write + Seek>(
		// write ipc_length placeholder
		output.write_all(&[0u8; 8])?;

feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support #1192

feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support #1192

Conversation

andygrove commented Dec 21, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

ZSTD

LZ4

Microbenchmarks

TPC-H

How are these changes tested?

codecov-commenter commented Dec 21, 2024 • edited Loading

Codecov Report

andygrove commented Dec 22, 2024

andygrove commented Dec 23, 2024

Dandandan Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

andygrove commented Dec 27, 2024 • edited Loading

andygrove commented Dec 28, 2024

andygrove commented Jan 2, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Jan 6, 2025

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Dec 21, 2024 •

edited

Loading

codecov-commenter commented Dec 21, 2024 •

edited

Loading

Dandandan Dec 23, 2024 •

edited

Loading

andygrove commented Dec 27, 2024 •

edited

Loading