Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): TPC-H dbgen #12381

Closed
wants to merge 1 commit into from

Conversation

pedroerp
Copy link
Contributor

Summary:
Support generation of TPC-H datasets using the Python API. The idea is
to use this as a pre-step that generates datasets in the target file format,
then use the generates file for benchmarks.

Differential Revision: D69809891

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 18, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69809891

Copy link

netlify bot commented Feb 18, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 171438c
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67b659ef6e08d30008b241c0

Copy link

@kostasxmeta kostasxmeta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few small comments.

std::vector<std::shared_ptr<connector::ConnectorSplit>> splits;
if (inputFiles.has_value()) {
for (const auto& inputFile : *inputFiles) {
splits.push_back(std::make_shared<connector::hive::HiveConnectorSplit>(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be checking what kind of connector::ConnectorSplit type the splits are before going here? It seems like we're always assuming HiveConnectorSplit for now? Shouldn't this be based on the connectorId in some capacity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. This may need to be better decoupled in the future, but for now we are making the same assumption that PlanBuilder does, that if you are using the tableScan() node, it implies you are using HiveConnector.

In this PR I didn't change the assumption, just moved the code where the split was generated.

scaleFactor,
connectorId);

// Generate one splits per part.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Generate one splits per part.
// Generate one split per part.

Comment on lines 200 to 203
/// Generates TPC-H data using dbgen. Note that generating data on the
/// fly is not terribly efficient, so one should generate TPC-H data,
/// write them to storage files, (Parquet, ORC, or similar), then
/// benchmark a query plan that reads those files.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not entirely clear from this wether this operator generates data on the fly or not (mainly because it also talks about generating files). Let's make it more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased it.

Comment on lines 188 to 191
Generates TPC-H data using dbgen. Note that generating data on the
fly is not terribly efficient, so one should generate TPC-H data,
write them to storage files, (Parquet, ORC, or similar), then
benchmark a query plan that reads those files.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the comment above we should probably update this as well.

addFileSplit(inputFile, scanId, scanPair.first);
for (auto& [scanId, splits] : *scanFiles_) {
for (auto& split : splits) {
cursor_->task()->addSplit(scanId, exec::Split(std::move((split))));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cursor_->task()->addSplit(scanId, exec::Split(std::move((split))));
cursor_->task()->addSplit(scanId, exec::Split(std::move(split)));

Copy link
Contributor Author

@pedroerp pedroerp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kostasxmeta thanks for the review!

std::vector<std::shared_ptr<connector::ConnectorSplit>> splits;
if (inputFiles.has_value()) {
for (const auto& inputFile : *inputFiles) {
splits.push_back(std::make_shared<connector::hive::HiveConnectorSplit>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. This may need to be better decoupled in the future, but for now we are making the same assumption that PlanBuilder does, that if you are using the tableScan() node, it implies you are using HiveConnector.

In this PR I didn't change the assumption, just moved the code where the split was generated.

Comment on lines 200 to 203
/// Generates TPC-H data using dbgen. Note that generating data on the
/// fly is not terribly efficient, so one should generate TPC-H data,
/// write them to storage files, (Parquet, ORC, or similar), then
/// benchmark a query plan that reads those files.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased it.

Summary:

Support generation of TPC-H datasets using the Python API. The idea is
to use this as a pre-step that generates datasets in the target file format,
then use the generates file for benchmarks.

Reviewed By: kostasxmeta, kgpai

Differential Revision: D69809891
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69809891

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 5f0c48d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants