Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHOENIX-7489 Add all partition ids internally to optimize full CDC Index scan queries #2070

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

virajjasani
Copy link
Contributor

Jira: PHOENIX-7489

Since PHOENIX-7425 introduced partitioned CDC Index to eliminate salting, it is important to include PARTITION_ID() in addition to PHOENIX_ROW_TIMESTAMP() with the WHERE clause of the CDC query. Before PHOENIX-7425, providing only PHOENIX_ROW_TIMESTAMP() was sufficient as it was the rowkey prefix of the CDC Index table. However, that is not the case anymore.

If the user only provides PHOENIX_ROW_TIMESTAMP() with the WHERE clause, it would result into the full table scan over the CDC Index. By providing both PARTITION_ID() and PHOENIX_ROW_TIMESTAMP(), it results into the range scan.

Not all the clients might be aware of all unique partition ids present in the CDC Index. Hence, even if a client only provides the timestamp range with the CDC query, the list of partition ids should be internally retrieved and used alongside the timestamp range for the efficient range scan performance.

Copy link
Contributor

@haridsv haridsv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string manipulation code was adequate for the ITs, but not robust enough for general purpose. I would suggest to manipulate the parse tree for accuracy. I played around with this approach for inserting projections and ORDER BY clause that you can refer.

return query;
}
StringBuilder builder;
boolean queryHasWhere = query.contains(" WHERE ");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, a tab or a newline can also be there around WHERE so you need to check for whitespace, not just space.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible approach is to manipulate the parse tree and run the partition id query as a subquery something like WHERE PARTITION_ID IN (SELECT PARTITION_ID() FROM ....)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to have subquery but PARTITION_ID() function does not work with subquery, resolving it turns out to be more complicated without significant changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manipulate the parse tree

This is what I recommended even with the current query syntax, see my overall comment.

Copy link
Contributor Author

@virajjasani virajjasani Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I came across it but WHERE PARTITION_ID() IN (SELECT DISTINCT PARTITION_ID() FROM ....) does not work, it needs more involved change, it does not matter whether we add it to the parse tree or re-compile from statement.

The reason why I liked recompiling statement within executeQuery() is because it allows one to execute sub-query, get the result and then recompile original statement. It would not be prudent to execute query as part of ParseNode creations.

On the other hand, if things could work without us having to execute a query and use result for original query (e.g. having sub-query or using ORDER BY or additional WHERE clause expressions that use static ParseNode values), it makes more sense to make changes while generating nodes in QueryCompiler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional WHERE clause expressions that use static ParseNode values

Yes, this is what I was suggesting. Get the distinct values and generate the IN clause with the static values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's doable but then we need QueryCompiler to execute query which would seem odd for CDC use case.
On the other hand, executeQuery already can execute any query, it's the appropriate stage for query execution whereas query compilation should only compile and generate ParseNode rather than execute anything additional.
Query manipulation with something like ORDER BY or adding sub-query can all be done in compilation phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHERE surrounded by \s+ should be good.
Again, all of this happens only if user has not provided PARTITION_ID() IN clause and this is very rare use case.

ParseNode check is already done before the execution flow comes here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need QueryCompiler to execute query which would seem odd for CDC use case

I agree. How about executing the query here, but passing the partition IDs via a field on the statement object? We could make this a generic "postProcessPlan" sort of a concept for which the base statement class does nothing, but a specialized statement class meant for CDC will override the method and modify the plan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, before calling into the second compilePlan, this new statement object of custom class will be constructed which is basically a clone of the original, except for this extra list of partition IDs and that is what goes into the second call.

}
builder.append(")");
if (!queryHasWhere) {
builder.append(query.split(cdcName)[1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very risky, the CDC name can in theory appear anywhere in the query, say a column name or a where clause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first appearance should be "FROM " right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a CDC query, one can select specific PK columns, so if CDC name happens to appear in one of the column names, the string will be split in the wrong place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using "FROM//s+<cdc-name>" should be good, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the regex string "FROM\\s+<cdc-name>"? I guess it is good enough if we are still going with the string manipulation only.

return query;
}
StringBuilder builder;
boolean queryHasWhere = query.contains(" WHERE ");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @haridsv mentioned this where check is also fragile as you cannot assume that there will always be fixed whitespace around the WHERE keyword so you will have to do again a regex match if we do the string re-writing.

Copy link
Contributor

@tkhurana tkhurana Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing I don't like about doing this change right at the top level in executeQuery() is the special handling you are doing of CDC use case. QueryPlan is an interface. There are 10's of implementations of this interface. All of them are behind the abstraction. There is no special handling of any particular usage. Any special handling should happen under the covers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have several Statement implementations but we can't have several executeQuery() implementations right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we already have several queryplan implementations right ? This is what you are essentially doing. Building a query plan for a CDC query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is building the query plan for sure but the reason why I think it should be done here is because during query compilation phase, we don't execute queries. Combining those two would make it look worse.
If PARTITION_ID() in (SELECT DISTINCT PARTITION_ID() FROM <table>) works, then I agree to using query compilation to manipulate parse nodes in the tree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think executing some queries during query compilation phase is a concern ? Maybe you could elaborate on that. The goal is to generate the plan. Now if we need too execute some query to generate the plan what is the issue there ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why PARTITION_ID() doesn't work in subquery ? What is the problem there and shouldn't that be fixed ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants