-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PHOENIX-7489 Add all partition ids internally to optimize full CDC Index scan queries #2070
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string manipulation code was adequate for the ITs, but not robust enough for general purpose. I would suggest to manipulate the parse tree for accuracy. I played around with this approach for inserting projections and ORDER BY
clause that you can refer.
return query; | ||
} | ||
StringBuilder builder; | ||
boolean queryHasWhere = query.contains(" WHERE "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, a tab or a newline can also be there around WHERE
so you need to check for whitespace, not just space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible approach is to manipulate the parse tree and run the partition id query as a subquery something like WHERE PARTITION_ID IN (SELECT PARTITION_ID() FROM ....)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to have subquery but PARTITION_ID() function does not work with subquery, resolving it turns out to be more complicated without significant changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
manipulate the parse tree
This is what I recommended even with the current query syntax, see my overall comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I came across it but WHERE PARTITION_ID() IN (SELECT DISTINCT PARTITION_ID() FROM ....)
does not work, it needs more involved change, it does not matter whether we add it to the parse tree or re-compile from statement.
The reason why I liked recompiling statement within executeQuery() is because it allows one to execute sub-query, get the result and then recompile original statement. It would not be prudent to execute query as part of ParseNode creations.
On the other hand, if things could work without us having to execute a query and use result for original query (e.g. having sub-query or using ORDER BY or additional WHERE clause expressions that use static ParseNode values), it makes more sense to make changes while generating nodes in QueryCompiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additional WHERE clause expressions that use static ParseNode values
Yes, this is what I was suggesting. Get the distinct values and generate the IN clause with the static values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's doable but then we need QueryCompiler to execute query which would seem odd for CDC use case.
On the other hand, executeQuery already can execute any query, it's the appropriate stage for query execution whereas query compilation should only compile and generate ParseNode rather than execute anything additional.
Query manipulation with something like ORDER BY or adding sub-query can all be done in compilation phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WHERE surrounded by \s+ should be good.
Again, all of this happens only if user has not provided PARTITION_ID() IN clause and this is very rare use case.
ParseNode check is already done before the execution flow comes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need QueryCompiler to execute query which would seem odd for CDC use case
I agree. How about executing the query here, but passing the partition IDs via a field on the statement object? We could make this a generic "postProcessPlan" sort of a concept for which the base statement class does nothing, but a specialized statement class meant for CDC will override the method and modify the plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, before calling into the second compilePlan, this new statement object of custom class will be constructed which is basically a clone of the original, except for this extra list of partition IDs and that is what goes into the second call.
} | ||
builder.append(")"); | ||
if (!queryHasWhere) { | ||
builder.append(query.split(cdcName)[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very risky, the CDC name can in theory appear anywhere in the query, say a column name or a where clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first appearance should be "FROM " right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a CDC query, one can select specific PK columns, so if CDC name happens to appear in one of the column names, the string will be split in the wrong place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using "FROM//s+<cdc-name>"
should be good, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the regex string "FROM\\s+<cdc-name>"
? I guess it is good enough if we are still going with the string manipulation only.
return query; | ||
} | ||
StringBuilder builder; | ||
boolean queryHasWhere = query.contains(" WHERE "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @haridsv mentioned this where check is also fragile as you cannot assume that there will always be fixed whitespace around the WHERE keyword so you will have to do again a regex match if we do the string re-writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing I don't like about doing this change right at the top level in executeQuery() is the special handling you are doing of CDC use case. QueryPlan
is an interface. There are 10's of implementations of this interface. All of them are behind the abstraction. There is no special handling of any particular usage. Any special handling should happen under the covers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have several Statement implementations but we can't have several executeQuery() implementations right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we already have several queryplan implementations right ? This is what you are essentially doing. Building a query plan for a CDC query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is building the query plan for sure but the reason why I think it should be done here is because during query compilation phase, we don't execute queries. Combining those two would make it look worse.
If PARTITION_ID() in (SELECT DISTINCT PARTITION_ID() FROM <table>)
works, then I agree to using query compilation to manipulate parse nodes in the tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think executing some queries during query compilation phase is a concern ? Maybe you could elaborate on that. The goal is to generate the plan. Now if we need too execute some query to generate the plan what is the issue there ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why PARTITION_ID()
doesn't work in subquery ? What is the problem there and shouldn't that be fixed ?
Jira: PHOENIX-7489
Since PHOENIX-7425 introduced partitioned CDC Index to eliminate salting, it is important to include PARTITION_ID() in addition to PHOENIX_ROW_TIMESTAMP() with the WHERE clause of the CDC query. Before PHOENIX-7425, providing only PHOENIX_ROW_TIMESTAMP() was sufficient as it was the rowkey prefix of the CDC Index table. However, that is not the case anymore.
If the user only provides PHOENIX_ROW_TIMESTAMP() with the WHERE clause, it would result into the full table scan over the CDC Index. By providing both PARTITION_ID() and PHOENIX_ROW_TIMESTAMP(), it results into the range scan.
Not all the clients might be aware of all unique partition ids present in the CDC Index. Hence, even if a client only provides the timestamp range with the CDC query, the list of partition ids should be internally retrieved and used alongside the timestamp range for the efficient range scan performance.