feat: Support to specify partition key order in TableWrite operator #12355

JkSelf · 2025-02-17T08:59:41Z

The HiveDataSink generates partition directories based on the partition index derived from the input RowVector. Currently, the partition index is constructed by traversing the RowVector to determine if a column is partitioned. This approach can lead to a mismatch in partition key order if it differs from the order in the RowVector. For instance, if the input RowVector is (a, b, c) and the partition key is set as (b, a), Velox will create directories as(a={}/b={})based on the existing logic. This does not align with Spark's partition directory format, which would be (b={}/a={}). To address this, we have introduced a partitionKey parameter in the HiveInsertTableHandle. This allows us to generate the partition index according to the specified partitionKey order, ensuring alignment with the user's desired partition key sequence.

netlify · 2025-02-17T09:00:08Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`fd74b97`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/67b6a7f895c1340008ad642e

JkSelf · 2025-02-17T09:00:59Z

@majetideepak Can you help to review this PR? Thanks for your help.

Yuhta · 2025-02-17T16:37:26Z

velox/connectors/hive/HiveDataSink.h

@@ -300,6 +306,7 @@ class HiveInsertTableHandle : public ConnectorInsertTableHandle {
  const std::unordered_map<std::string, std::string> serdeParameters_;
  const std::shared_ptr<dwio::common::WriterOptions> writerOptions_;
  const bool ensureFiles_;
+  const std::vector<std::string> partitionKeys_;


Where is this used? I don't think Velox needs to be aware of these, it's not Velox who is creating the directories.

@Yuhta Thanks for your review.

The HiveDataSink generates partition directories based on the partition index derived from the input RowVector. Currently, the partition index is constructed by traversing the RowVector to determine if a column is partitioned. This approach can lead to a mismatch in partition key order if it differs from the order in the RowVector. For instance, if the input RowVector is (a, b, c) and the partition key is set as (b, a), Velox will create directories as (a={}/b={}) based on the existing logic. This does not align with Spark's partition directory format, which would be (b={}/a={}). To address this, we have introduced a partitionKey parameter in the HiveInsertTableHandle. This allows us to generate the partition index according to the specified partitionKey order, ensuring alignment with the user's desired partition key sequence.

I see, can we rename this method to partitionKeyOrder()? And comment that if specified, we use this order, otherwise we use the column order

@Yuhta Yes. Updated the method name to partitionKeyOrder(). And also added the related comments.

Yuhta · 2025-02-18T15:53:01Z

velox/connectors/hive/HiveDataSink.h

@@ -300,6 +306,7 @@ class HiveInsertTableHandle : public ConnectorInsertTableHandle {
  const std::unordered_map<std::string, std::string> serdeParameters_;
  const std::shared_ptr<dwio::common::WriterOptions> writerOptions_;
  const bool ensureFiles_;
+  const std::vector<std::string> partitionKeys_;


I see, can we rename this method to partitionKeyOrder()? And comment that if specified, we use this order, otherwise we use the column order

velox/connectors/hive/HiveDataSink.h

JkSelf · 2025-02-19T02:09:55Z

@Yuhta I have resolved all your comments. Can you help to review again? Thanks.

majetideepak · 2025-02-19T14:27:29Z

@JkSelf There are build failures. Can you take a look?

Yuhta · 2025-02-19T16:40:45Z

velox/connectors/hive/HiveDataSink.h

+    if (partitionKeyOrder_.size() > 0) {
+      // Ensure the partitionKeyOrder contains all the partition keys in
+      // inputColumns_.
+      std::string partitionKeyNames;


Better to also check there is no repetition. Something like

folly::F14FastSet<std::string> partitionKeyNames(partitionKeyOrder_.begin(), partitionKeyOrder_.end()); //... VELOX_CHECK(partitionKeyNames.erase(inputColumn->name()) == 1); //... VELOX_CHECK(partitionKeyNames.empty());

Also in Presto we have convention to put all partitioning columns at the beginning of the RowVector in the right order. Is there anything in Spark preventing us from doing so?

@Yuhta There is no such convention in Spark to put the partition columns at the beginning of the input columns.

JkSelf · 2025-02-20T06:14:11Z

@Yuhta @majetideepak Can you help to review again? Thanks.

JkSelf requested a review from majetideepak as a code owner February 17, 2025 08:59

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 17, 2025

JkSelf mentioned this pull request Feb 17, 2025

Partitioned writes with multiple columns creates wrong directory structure if child output columns is not in same order apache/incubator-gluten#8663

Open

Specify partition key order in TableWrite operator

bb9c90b

Yuhta reviewed Feb 17, 2025

View reviewed changes

Yuhta reviewed Feb 18, 2025

View reviewed changes

JkSelf force-pushed the partitionKey branch from 9d0cf7d to 9e25d80 Compare February 19, 2025 02:08

Resolve comments

9e25d80

Yuhta reviewed Feb 19, 2025

View reviewed changes

Resolve comments

fd74b97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support to specify partition key order in TableWrite operator #12355

feat: Support to specify partition key order in TableWrite operator #12355

JkSelf commented Feb 17, 2025 •

edited

Loading

netlify bot commented Feb 17, 2025 •

edited

Loading

JkSelf commented Feb 17, 2025

Yuhta Feb 17, 2025

JkSelf Feb 18, 2025

Yuhta Feb 18, 2025

JkSelf Feb 19, 2025

Yuhta Feb 18, 2025

JkSelf commented Feb 19, 2025

majetideepak commented Feb 19, 2025

Yuhta Feb 19, 2025

JkSelf Feb 20, 2025

JkSelf commented Feb 20, 2025

feat: Support to specify partition key order in TableWrite operator #12355

Are you sure you want to change the base?

feat: Support to specify partition key order in TableWrite operator #12355

Conversation

JkSelf commented Feb 17, 2025 • edited Loading

netlify bot commented Feb 17, 2025 • edited Loading

✅ Deploy Preview for meta-velox canceled.

JkSelf commented Feb 17, 2025

Yuhta Feb 17, 2025

Choose a reason for hiding this comment

JkSelf Feb 18, 2025

Choose a reason for hiding this comment

Yuhta Feb 18, 2025

Choose a reason for hiding this comment

JkSelf Feb 19, 2025

Choose a reason for hiding this comment

Yuhta Feb 18, 2025

Choose a reason for hiding this comment

JkSelf commented Feb 19, 2025

majetideepak commented Feb 19, 2025

Yuhta Feb 19, 2025

Choose a reason for hiding this comment

JkSelf Feb 20, 2025

Choose a reason for hiding this comment

JkSelf commented Feb 20, 2025

JkSelf commented Feb 17, 2025 •

edited

Loading

netlify bot commented Feb 17, 2025 •

edited

Loading