Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add config to throw exception for duplicate keys in Spark map_concat function #12379

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rui-mo
Copy link
Collaborator

@rui-mo rui-mo commented Feb 18, 2025

Spark has two policies EXCEPTION and LAST_WIN to deal with duplicate keys
in map functions like CreateMap, MapFromArrays, MapFromEntries, StringToMap,
MapConcat and TransformKeys.

EXCEPTION behaviour: throws exception when a duplicate key is found in map.
LAST_WIN behaviour: the result value comes from the last inserted element.

Velox by default treats duplicates keys as "LAST_WIN" policy. This PR adds
config spark.throw_exception_on_duplicate_map_keys to enable/disable the
behaviour of throwing exception when duplicate keys are found in map.
Currently, this flag is being utilised in MapConcat function. It will be reused
in other map functions as well.

Based on #9562 from @Surbhi-Vijay.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 18, 2025
Copy link

netlify bot commented Feb 18, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 7da1426
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67b737d13e9310000891ae90

@@ -831,6 +837,10 @@ class QueryConfig {
return get<bool>(kSparkLegacyDateFormatter, false);
}

bool sparkThrowExceptionOnDuplicateMapKeys() const {
return get<bool>(kSparkThrowExceptionOnDuplicateMapKeys, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Spark uses EXCEPTION policy by default, while keeping LAST_WIN as default in Velox is compatible with Presto. In Gluten, we can always set this configuration according the Spark's config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name indicates it is only used for Spark, is it also used for Presto? If not, I think align with Spark default value might be more reasonable.
Not a big issue, just mention it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or remove "spark." from the config name.

queryCtx_->testingOverrideConfigUnsafe({
{core::QueryConfig::kSparkThrowExceptionOnDuplicateMapKeys, "true"},
});
VELOX_ASSERT_THROW(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VELOX_ASSERT_USER_THROW

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants