-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add config to throw exception for duplicate keys in Spark map_concat function #12379
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
7f0ca1f
to
c457006
Compare
c457006
to
11cb0ba
Compare
@@ -831,6 +837,10 @@ class QueryConfig { | |||
return get<bool>(kSparkLegacyDateFormatter, false); | |||
} | |||
|
|||
bool sparkThrowExceptionOnDuplicateMapKeys() const { | |||
return get<bool>(kSparkThrowExceptionOnDuplicateMapKeys, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default value here should be true.
https://github.com/apache/spark/blob/26febf763c1bc386a719c819c0bdfbee0abd948b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L4498C42-L4498C51
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Spark uses EXCEPTION policy by default, while keeping LAST_WIN as default in Velox is compatible with Presto. In Gluten, we can always set this configuration according the Spark's config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name indicates it is only used for Spark, is it also used for Presto? If not, I think align with Spark default value might be more reasonable.
Not a big issue, just mention it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or remove "spark." from the config name.
queryCtx_->testingOverrideConfigUnsafe({ | ||
{core::QueryConfig::kSparkThrowExceptionOnDuplicateMapKeys, "true"}, | ||
}); | ||
VELOX_ASSERT_THROW( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VELOX_ASSERT_USER_THROW
11cb0ba
to
7da1426
Compare
Spark has two policies
EXCEPTION
andLAST_WIN
to deal with duplicate keysin map functions like CreateMap, MapFromArrays, MapFromEntries, StringToMap,
MapConcat and TransformKeys.
EXCEPTION behaviour: throws exception when a duplicate key is found in map.
LAST_WIN behaviour: the result value comes from the last inserted element.
Velox by default treats duplicates keys as "LAST_WIN" policy. This PR adds
config
spark.throw_exception_on_duplicate_map_keys
to enable/disable thebehaviour of throwing exception when duplicate keys are found in map.
Currently, this flag is being utilised in
MapConcat
function. It will be reusedin other map functions as well.
Based on #9562 from @Surbhi-Vijay.