-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28675: Maximize the removal of redundant columns from GROUP BY clauses #5586
Conversation
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRelFieldTrimmer.java
Outdated
Show resolved
Hide resolved
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
LGTM +1. Just left a minor comment for my understanding, please feel free to merge if this question is irrelevant. :) |
EXPLAIN CBO SELECT passport, COUNT(1) FROM passenger GROUP BY id, fname, lname, passport; | ||
EXPLAIN CBO SELECT fname, COUNT(1) FROM passenger GROUP BY id, fname, lname, passport; | ||
EXPLAIN CBO SELECT lname, COUNT(1) FROM passenger GROUP BY id, fname, lname, passport; | ||
EXPLAIN CBO SELECT fname, lname, COUNT(1) FROM passenger GROUP BY id, fname, lname, passport; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, is having group by fname,lname always a better plan? -- even if there is a different aggregate function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change the aggregate function the query changes as well.
If we assume that the set of redundant columns remains the same after changing the aggregate function (e.g., MAX(fname)
) then it's always a good idea to drop unnecessary processing so grouping on fname, lname
is the best option.
If the set of redundant columns changes (e.g., MAX(passport)
) then this optimization will not trigger in the same way (passport
column cannot be dropped).
What changes were proposed in this pull request?
Enhance
HiveRelFieldTrimmer
to remove the maximum number of redundant columns from theGROUP BY
clause.Why are the changes needed?
For more see HIVE-28675.
Does this PR introduce any user-facing change?
More efficient query plans.
Is the change a dependency upgrade?
No
How was this patch tested?