feat(python): TPC-H dbgen #12381

pedroerp · 2025-02-18T22:38:54Z

Summary:
Support generation of TPC-H datasets using the Python API. The idea is
to use this as a pre-step that generates datasets in the target file format,
then use the generates file for benchmarks.

Differential Revision: D69809891

facebook-github-bot · 2025-02-18T22:39:03Z

This pull request was exported from Phabricator. Differential Revision: D69809891

netlify · 2025-02-18T22:39:15Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`171438c`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/67b659ef6e08d30008b241c0

kostasxmeta

LGTM, just a few small comments.

kostasxmeta · 2025-02-19T16:42:50Z

velox/py/plan_builder/PyPlanBuilder.cpp

+  std::vector<std::shared_ptr<connector::ConnectorSplit>> splits;
+  if (inputFiles.has_value()) {
+    for (const auto& inputFile : *inputFiles) {
+      splits.push_back(std::make_shared<connector::hive::HiveConnectorSplit>(


Should we be checking what kind of connector::ConnectorSplit type the splits are before going here? It seems like we're always assuming HiveConnectorSplit for now? Shouldn't this be based on the connectorId in some capacity?

Good question. This may need to be better decoupled in the future, but for now we are making the same assumption that PlanBuilder does, that if you are using the tableScan() node, it implies you are using HiveConnector.

In this PR I didn't change the assumption, just moved the code where the split was generated.

kostasxmeta · 2025-02-19T17:56:10Z

velox/py/plan_builder/PyPlanBuilder.cpp

+      scaleFactor,
+      connectorId);
+
+  // Generate one splits per part.


Suggested change

// Generate one splits per part.

// Generate one split per part.

kostasxmeta · 2025-02-19T18:19:03Z

velox/py/plan_builder/PyPlanBuilder.h

+  /// Generates TPC-H data using dbgen. Note that generating data on the
+  /// fly is not terribly efficient, so one should generate TPC-H data,
+  /// write them to storage files, (Parquet, ORC, or similar), then
+  /// benchmark a query plan that reads those files.


It's not entirely clear from this wether this operator generates data on the fly or not (mainly because it also talks about generating files). Let's make it more clear.

Rephrased it.

kostasxmeta · 2025-02-19T18:20:01Z

velox/py/plan_builder/plan_builder.cpp

+        Generates TPC-H data using dbgen. Note that generating data on the
+        fly is not terribly efficient, so one should generate TPC-H data,
+        write them to storage files, (Parquet, ORC, or similar), then
+        benchmark a query plan that reads those files.


Based on the comment above we should probably update this as well.

kostasxmeta · 2025-02-19T18:21:56Z

velox/py/runner/PyLocalRunner.cpp

-      addFileSplit(inputFile, scanId, scanPair.first);
+  for (auto& [scanId, splits] : *scanFiles_) {
+    for (auto& split : splits) {
+      cursor_->task()->addSplit(scanId, exec::Split(std::move((split))));


Suggested change

cursor_->task()->addSplit(scanId, exec::Split(std::move((split))));

cursor_->task()->addSplit(scanId, exec::Split(std::move(split)));

pedroerp

@kostasxmeta thanks for the review!

pedroerp · 2025-02-19T22:17:15Z

velox/py/plan_builder/PyPlanBuilder.cpp

+  std::vector<std::shared_ptr<connector::ConnectorSplit>> splits;
+  if (inputFiles.has_value()) {
+    for (const auto& inputFile : *inputFiles) {
+      splits.push_back(std::make_shared<connector::hive::HiveConnectorSplit>(


Good question. This may need to be better decoupled in the future, but for now we are making the same assumption that PlanBuilder does, that if you are using the tableScan() node, it implies you are using HiveConnector.

In this PR I didn't change the assumption, just moved the code where the split was generated.

pedroerp · 2025-02-19T22:20:38Z

velox/py/plan_builder/PyPlanBuilder.h

+  /// Generates TPC-H data using dbgen. Note that generating data on the
+  /// fly is not terribly efficient, so one should generate TPC-H data,
+  /// write them to storage files, (Parquet, ORC, or similar), then
+  /// benchmark a query plan that reads those files.


Rephrased it.

Summary: Support generation of TPC-H datasets using the Python API. The idea is to use this as a pre-step that generates datasets in the target file format, then use the generates file for benchmarks. Reviewed By: kostasxmeta, kgpai Differential Revision: D69809891

facebook-github-bot · 2025-02-19T22:23:56Z

This pull request was exported from Phabricator. Differential Revision: D69809891

facebook-github-bot · 2025-02-20T10:50:08Z

This pull request has been merged in 5f0c48d.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 18, 2025

facebook-github-bot added the fb-exported label Feb 18, 2025

pedroerp requested a review from kgpai February 18, 2025 22:39

kostasxmeta approved these changes Feb 19, 2025

View reviewed changes

pedroerp commented Feb 19, 2025

View reviewed changes

pedroerp force-pushed the export-D69809891 branch from 5fe0948 to 171438c Compare February 19, 2025 22:23

facebook-github-bot closed this in 5f0c48d Feb 20, 2025

facebook-github-bot added the Merged label Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): TPC-H dbgen #12381

feat(python): TPC-H dbgen #12381

pedroerp commented Feb 18, 2025

facebook-github-bot commented Feb 18, 2025

netlify bot commented Feb 18, 2025 •

edited

Loading

kostasxmeta left a comment

kostasxmeta Feb 19, 2025

pedroerp Feb 19, 2025

kostasxmeta Feb 19, 2025

kostasxmeta Feb 19, 2025

pedroerp Feb 19, 2025

kostasxmeta Feb 19, 2025

kostasxmeta Feb 19, 2025

pedroerp left a comment

pedroerp Feb 19, 2025

pedroerp Feb 19, 2025

facebook-github-bot commented Feb 19, 2025

facebook-github-bot commented Feb 20, 2025

	// Generate one splits per part.
	// Generate one split per part.

	cursor_->task()->addSplit(scanId, exec::Split(std::move((split))));
	cursor_->task()->addSplit(scanId, exec::Split(std::move(split)));

feat(python): TPC-H dbgen #12381

feat(python): TPC-H dbgen #12381

Conversation

pedroerp commented Feb 18, 2025

facebook-github-bot commented Feb 18, 2025

netlify bot commented Feb 18, 2025 • edited Loading

✅ Deploy Preview for meta-velox canceled.

kostasxmeta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pedroerp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Feb 19, 2025

facebook-github-bot commented Feb 20, 2025

netlify bot commented Feb 18, 2025 •

edited

Loading