Add TableScanBuilder in PlanBuilder #7391

majetideepak · 2023-11-02T19:46:25Z

Description

We currently have 4 APIs for TableScan in PlanBuilder
This was likely to enable optional arguments. The connectorId is still hard-coded.

We should instead have a TableScanBuilder on the lines of HiveConnectorSplitBuilder
The API will look like
tableScanBuilder().outputType(rowTypePtr).subfieldFilters({"c1 = true"}).remainingFilter{"c0 < 100"}.build();
This is longer but less confusing and simplifies the API. It is also more descriptive compared to tableScan(rowTypePtr, {"c1 = true"}, "c0 < 100").

The text was updated successfully, but these errors were encountered:

majetideepak · 2023-11-02T19:46:56Z

@mbasmanova any thoughts?

mbasmanova · 2023-11-02T20:11:22Z

At a high level, I like this idea, but can you show a working example? How will you add nodes on top of TableScan node built using tableScanBuilder?

majetideepak · 2023-11-02T23:58:08Z

It will look like

auto tableScan = tableScanBuilder()
          .name("hive-table")
          .outputType(rowTypePtr)
          .subfieldFilters({"c1 = true"})
          .remainingFilter{"c0 < 100"}
          .build();
plan = PlanBuilder().add(tableScan).planNode();

mbasmanova · 2023-11-03T13:08:26Z

I see, what about plans with multiple table scans (joins)? Would something like below be possible / makes sense?

plan = PlanBuilder(planNodeIdGenerator)
   .startTableScan()
          .table("hive-table")
          .outputType(rowTypePtr)
          .subfieldFilters({"c1 = true"})
          .remainingFilter("c0 < 100")
   .endTableScan()
   .planNode();

majetideepak · 2023-11-03T14:48:43Z

tableScanBuilder() will return a PlanNodePtr of type TableScanNode. add() API in PlanBuilder will simply assign that to planNode_ field and return the reference. tableScanBuilder() is an independent class separate from PlanBuilder().
We can inline the builder. The problem with your proposal is that the API such as table(), outputType() will be part of the PlanBuilder class and we will have to add checks to ensure table() is invoked in the right context. That will complicate the API further.

Multi scans/joins will look like below. The following is an extract from TpchQueryBuilder

  auto customers = PlanBuilder(planNodeIdGenerator, pool_.get())
                       .add(TableScanBuilder()
                           .table(kCustomer)
                           .outputType(customerSelectedRowType)
                           .columnAliases(customerFileColumns)
                           .subfieldFilters({customerFilter})
                           .build())
                       .capturePlanNodeId(customerPlanNodeId)
                       .planNode();

  auto custkeyJoinNode =
      PlanBuilder(planNodeIdGenerator, pool_.get())
           .add(TableScanBuilder()
              .table(kOrders)
              .outputType(ordersSelectedRowType)
              .columnAliases(ordersFileColumns)
              .subfieldFilters({orderDateFilter})
             .build())
          .capturePlanNodeId(ordersPlanNodeId)
          .hashJoin(
              {"o_custkey"},
              {"c_custkey"},
              customers,
              "",
              {"o_orderdate", "o_shippriority", "o_orderkey"})
          .planNode();

mbasmanova · 2023-11-03T14:53:05Z

@majetideepak Deepak, you need to make sure different TableScan nodes have different node IDs (in the same query plan).

The problem with your proposal is that the API such as table(), outputType() will be part of the PlanBuilder class

Not necessarily.

PlanBuilder::startTableScan() returns TableScanBuilder&.
TableScanBuilder::endTableScan() returns PlanBuilder&.

majetideepak · 2023-11-03T14:57:07Z

you need to make sure different TableScan nodes have different node IDs (in the same query plan)

I see we have nextPlanNodeId() as an argument. I guess it is convenient to have it sequential and not a random Id across plan nodes.
Thanks for clarifying your API. That makes sense!
I will work on that.

majetideepak · 2023-11-03T14:59:41Z

We could also add a new API setId() to PlanNode and make it work inside add(). But startTableScan() and endTableScan() works.

mbasmanova · 2023-11-03T15:04:33Z

add a new API setId()

It would be preferable to keep PlanNode classes immutable.

majetideepak · 2025-02-18T18:52:00Z

Completed!

majetideepak added the enhancement New feature or request label Nov 2, 2023

majetideepak changed the title ~~Add TableScanBuilder~~ Add TableScanBuilder in PlanBuilder Nov 2, 2023

majetideepak mentioned this issue Nov 8, 2023

Add TableScanBuilder #7463

Closed

majetideepak closed this as completed Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TableScanBuilder in PlanBuilder #7391

Add TableScanBuilder in PlanBuilder #7391

majetideepak commented Nov 2, 2023 •

edited

Loading

majetideepak commented Nov 2, 2023

mbasmanova commented Nov 2, 2023

majetideepak commented Nov 2, 2023 •

edited

Loading

mbasmanova commented Nov 3, 2023

majetideepak commented Nov 3, 2023 •

edited

Loading

mbasmanova commented Nov 3, 2023

majetideepak commented Nov 3, 2023

majetideepak commented Nov 3, 2023 •

edited

Loading

mbasmanova commented Nov 3, 2023

majetideepak commented Feb 18, 2025

Add TableScanBuilder in PlanBuilder #7391

Add TableScanBuilder in PlanBuilder #7391

Comments

majetideepak commented Nov 2, 2023 • edited Loading

Description

majetideepak commented Nov 2, 2023

mbasmanova commented Nov 2, 2023

majetideepak commented Nov 2, 2023 • edited Loading

mbasmanova commented Nov 3, 2023

majetideepak commented Nov 3, 2023 • edited Loading

mbasmanova commented Nov 3, 2023

majetideepak commented Nov 3, 2023

majetideepak commented Nov 3, 2023 • edited Loading

mbasmanova commented Nov 3, 2023

majetideepak commented Feb 18, 2025

majetideepak commented Nov 2, 2023 •

edited

Loading

majetideepak commented Nov 2, 2023 •

edited

Loading

majetideepak commented Nov 3, 2023 •

edited

Loading

majetideepak commented Nov 3, 2023 •

edited

Loading