Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TableScanBuilder in PlanBuilder #7391

Closed
majetideepak opened this issue Nov 2, 2023 · 10 comments
Closed

Add TableScanBuilder in PlanBuilder #7391

majetideepak opened this issue Nov 2, 2023 · 10 comments
Labels
enhancement New feature or request

Comments

@majetideepak
Copy link
Collaborator

majetideepak commented Nov 2, 2023

Description

We currently have 4 APIs for TableScan in PlanBuilder
This was likely to enable optional arguments. The connectorId is still hard-coded.

We should instead have a TableScanBuilder on the lines of HiveConnectorSplitBuilder
The API will look like
tableScanBuilder().outputType(rowTypePtr).subfieldFilters({"c1 = true"}).remainingFilter{"c0 < 100"}.build();
This is longer but less confusing and simplifies the API. It is also more descriptive compared to tableScan(rowTypePtr, {"c1 = true"}, "c0 < 100").

@majetideepak majetideepak added the enhancement New feature or request label Nov 2, 2023
@majetideepak majetideepak changed the title Add TableScanBuilder Add TableScanBuilder in PlanBuilder Nov 2, 2023
@majetideepak
Copy link
Collaborator Author

@mbasmanova any thoughts?

@mbasmanova
Copy link
Contributor

At a high level, I like this idea, but can you show a working example? How will you add nodes on top of TableScan node built using tableScanBuilder?

@majetideepak
Copy link
Collaborator Author

majetideepak commented Nov 2, 2023

It will look like

auto tableScan = tableScanBuilder()
          .name("hive-table")
          .outputType(rowTypePtr)
          .subfieldFilters({"c1 = true"})
          .remainingFilter{"c0 < 100"}
          .build();
plan = PlanBuilder().add(tableScan).planNode();

@mbasmanova
Copy link
Contributor

I see, what about plans with multiple table scans (joins)? Would something like below be possible / makes sense?

plan = PlanBuilder(planNodeIdGenerator)
   .startTableScan()
          .table("hive-table")
          .outputType(rowTypePtr)
          .subfieldFilters({"c1 = true"})
          .remainingFilter("c0 < 100")
   .endTableScan()
   .planNode();

@majetideepak
Copy link
Collaborator Author

majetideepak commented Nov 3, 2023

tableScanBuilder() will return a PlanNodePtr of type TableScanNode. add() API in PlanBuilder will simply assign that to planNode_ field and return the reference. tableScanBuilder() is an independent class separate from PlanBuilder().
We can inline the builder. The problem with your proposal is that the API such as table(), outputType() will be part of the PlanBuilder class and we will have to add checks to ensure table() is invoked in the right context. That will complicate the API further.

Multi scans/joins will look like below. The following is an extract from TpchQueryBuilder

  auto customers = PlanBuilder(planNodeIdGenerator, pool_.get())
                       .add(TableScanBuilder()
                           .table(kCustomer)
                           .outputType(customerSelectedRowType)
                           .columnAliases(customerFileColumns)
                           .subfieldFilters({customerFilter})
                           .build())
                       .capturePlanNodeId(customerPlanNodeId)
                       .planNode();

  auto custkeyJoinNode =
      PlanBuilder(planNodeIdGenerator, pool_.get())
           .add(TableScanBuilder()
              .table(kOrders)
              .outputType(ordersSelectedRowType)
              .columnAliases(ordersFileColumns)
              .subfieldFilters({orderDateFilter})
             .build())
          .capturePlanNodeId(ordersPlanNodeId)
          .hashJoin(
              {"o_custkey"},
              {"c_custkey"},
              customers,
              "",
              {"o_orderdate", "o_shippriority", "o_orderkey"})
          .planNode(); 

@mbasmanova
Copy link
Contributor

@majetideepak Deepak, you need to make sure different TableScan nodes have different node IDs (in the same query plan).

The problem with your proposal is that the API such as table(), outputType() will be part of the PlanBuilder class

Not necessarily.

  • PlanBuilder::startTableScan() returns TableScanBuilder&.
  • TableScanBuilder::endTableScan() returns PlanBuilder&.

@majetideepak
Copy link
Collaborator Author

you need to make sure different TableScan nodes have different node IDs (in the same query plan)

I see we have nextPlanNodeId() as an argument. I guess it is convenient to have it sequential and not a random Id across plan nodes.
Thanks for clarifying your API. That makes sense!
I will work on that.

@majetideepak
Copy link
Collaborator Author

majetideepak commented Nov 3, 2023

We could also add a new API setId() to PlanNode and make it work inside add(). But startTableScan() and endTableScan() works.

@mbasmanova
Copy link
Contributor

add a new API setId()

It would be preferable to keep PlanNode classes immutable.

@majetideepak
Copy link
Collaborator Author

Completed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants