-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PHOENIX-7489 Add all partition ids internally to optimize full CDC Index scan queries #2070
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,10 +18,14 @@ | |
|
||
package org.apache.phoenix.util; | ||
|
||
import java.sql.Connection; | ||
import java.sql.ResultSet; | ||
import java.sql.SQLException; | ||
import java.sql.Types; | ||
import java.util.ArrayList; | ||
import java.util.Base64; | ||
import java.util.HashSet; | ||
import java.util.List; | ||
import java.util.Map; | ||
import java.util.NavigableSet; | ||
import java.util.Set; | ||
|
@@ -30,9 +34,12 @@ | |
import org.apache.hadoop.hbase.client.Scan; | ||
import org.apache.hadoop.hbase.util.Bytes; | ||
import org.apache.hadoop.util.StringUtils; | ||
import org.apache.hbase.thirdparty.org.apache.commons.collections4.CollectionUtils; | ||
import org.apache.phoenix.exception.SQLExceptionCode; | ||
import org.apache.phoenix.exception.SQLExceptionInfo; | ||
import org.apache.phoenix.execute.DescVarLengthFastByteComparisons; | ||
import org.apache.phoenix.parse.ParseNode; | ||
import org.apache.phoenix.parse.PartitionIdParseNode; | ||
import org.apache.phoenix.schema.PTable; | ||
import org.apache.phoenix.schema.types.PDataType; | ||
import org.apache.phoenix.schema.types.PVarchar; | ||
|
@@ -155,6 +162,73 @@ public static boolean isBinaryType(PDataType dataType) { | |
|| dataType.getSqlType() == PDataType.VARBINARY_ENCODED_TYPE); | ||
} | ||
|
||
/** | ||
* Return true if the parseNode or any of its children contains PARTITION_ID() function. | ||
* | ||
* @param parseNode The parseNode from Where clause. | ||
* @return True if the parseNode or any of its children contains PARTITION_ID() | ||
* function. False otherwise. | ||
*/ | ||
public static boolean isPartitionIdIncludedInTree(ParseNode parseNode) { | ||
if (parseNode instanceof PartitionIdParseNode) { | ||
return true; | ||
} | ||
if (parseNode == null || CollectionUtils.isEmpty(parseNode.getChildren())) { | ||
return false; | ||
} | ||
return parseNode.getChildren().stream() | ||
.anyMatch(CDCUtil::isPartitionIdIncludedInTree); | ||
} | ||
|
||
/** | ||
* Add IN Operator for PARTITION_ID() so that the full table scan CDC query can be | ||
* optimized to be range scan. | ||
* | ||
* @param conn The Connection. | ||
* @param cdcName CDC Object name. | ||
* @param query SQL Query Statement. | ||
* @return Updated query including PartitionId with IN operator. | ||
* @throws SQLException If the distinct partition ids retrieval fails. | ||
*/ | ||
public static String addPartitionInList(final Connection conn, final String cdcName, | ||
final String query) throws SQLException { | ||
ResultSet rs = conn.createStatement().executeQuery("SELECT DISTINCT PARTITION_ID() FROM " | ||
+ cdcName); | ||
List<String> partitionIds = new ArrayList<>(); | ||
while (rs.next()) { | ||
partitionIds.add(rs.getString(1)); | ||
} | ||
if (partitionIds.isEmpty()) { | ||
return query; | ||
} | ||
StringBuilder builder; | ||
boolean queryHasWhere = query.contains(" WHERE "); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As @haridsv mentioned this where check is also fragile as you cannot assume that there will always be fixed whitespace around the WHERE keyword so you will have to do again a regex match if we do the string re-writing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The thing I don't like about doing this change right at the top level in executeQuery() is the special handling you are doing of CDC use case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can have several Statement implementations but we can't have several executeQuery() implementations right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But we already have several queryplan implementations right ? This is what you are essentially doing. Building a query plan for a CDC query. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes it is building the query plan for sure but the reason why I think it should be done here is because during query compilation phase, we don't execute queries. Combining those two would make it look worse. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do you think executing some queries during query compilation phase is a concern ? Maybe you could elaborate on that. The goal is to generate the plan. Now if we need too execute some query to generate the plan what is the issue there ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, why |
||
if (queryHasWhere) { | ||
builder = new StringBuilder(query); | ||
builder.append(" AND PARTITION_ID() IN ("); | ||
} else { | ||
builder = new StringBuilder(query.split(cdcName)[0]); | ||
builder.append(cdcName); | ||
builder.append(" WHERE PARTITION_ID() IN ("); | ||
} | ||
boolean initialized = false; | ||
for (String partitionId : partitionIds) { | ||
if (!initialized) { | ||
builder.append("'"); | ||
initialized = true; | ||
} else { | ||
builder.append(",'"); | ||
} | ||
builder.append(partitionId); | ||
builder.append("'"); | ||
} | ||
builder.append(")"); | ||
if (!queryHasWhere) { | ||
builder.append(query.split(cdcName)[1]); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is very risky, the CDC name can in theory appear anywhere in the query, say a column name or a where clause. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The first appearance should be "FROM " right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In a CDC query, one can select specific PK columns, so if CDC name happens to appear in one of the column names, the string will be split in the wrong place. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You mean the regex string |
||
} | ||
return builder.toString(); | ||
} | ||
|
||
public enum CdcStreamStatus { | ||
ENABLED("ENABLED"), | ||
ENABLING("ENABLING"), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, a tab or a newline can also be there around
WHERE
so you need to check for whitespace, not just space.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible approach is to manipulate the parse tree and run the partition id query as a subquery something like
WHERE PARTITION_ID IN (SELECT PARTITION_ID() FROM ....)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to have subquery but PARTITION_ID() function does not work with subquery, resolving it turns out to be more complicated without significant changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I recommended even with the current query syntax, see my overall comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I came across it but
WHERE PARTITION_ID() IN (SELECT DISTINCT PARTITION_ID() FROM ....)
does not work, it needs more involved change, it does not matter whether we add it to the parse tree or re-compile from statement.The reason why I liked recompiling statement within executeQuery() is because it allows one to execute sub-query, get the result and then recompile original statement. It would not be prudent to execute query as part of ParseNode creations.
On the other hand, if things could work without us having to execute a query and use result for original query (e.g. having sub-query or using ORDER BY or additional WHERE clause expressions that use static ParseNode values), it makes more sense to make changes while generating nodes in QueryCompiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is what I was suggesting. Get the distinct values and generate the IN clause with the static values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's doable but then we need QueryCompiler to execute query which would seem odd for CDC use case.
On the other hand, executeQuery already can execute any query, it's the appropriate stage for query execution whereas query compilation should only compile and generate ParseNode rather than execute anything additional.
Query manipulation with something like ORDER BY or adding sub-query can all be done in compilation phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WHERE surrounded by \s+ should be good.
Again, all of this happens only if user has not provided PARTITION_ID() IN clause and this is very rare use case.
ParseNode check is already done before the execution flow comes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. How about executing the query here, but passing the partition IDs via a field on the statement object? We could make this a generic "postProcessPlan" sort of a concept for which the base statement class does nothing, but a specialized statement class meant for CDC will override the method and modify the plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, before calling into the second compilePlan, this new statement object of custom class will be constructed which is basically a clone of the original, except for this extra list of partition IDs and that is what goes into the second call.