-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): TPC-H dbgen #12381
feat(python): TPC-H dbgen #12381
Conversation
This pull request was exported from Phabricator. Differential Revision: D69809891 |
✅ Deploy Preview for meta-velox canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a few small comments.
std::vector<std::shared_ptr<connector::ConnectorSplit>> splits; | ||
if (inputFiles.has_value()) { | ||
for (const auto& inputFile : *inputFiles) { | ||
splits.push_back(std::make_shared<connector::hive::HiveConnectorSplit>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be checking what kind of connector::ConnectorSplit
type the splits are before going here? It seems like we're always assuming HiveConnectorSplit for now? Shouldn't this be based on the connectorId in some capacity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. This may need to be better decoupled in the future, but for now we are making the same assumption that PlanBuilder does, that if you are using the tableScan() node, it implies you are using HiveConnector.
In this PR I didn't change the assumption, just moved the code where the split was generated.
scaleFactor, | ||
connectorId); | ||
|
||
// Generate one splits per part. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Generate one splits per part. | |
// Generate one split per part. |
/// Generates TPC-H data using dbgen. Note that generating data on the | ||
/// fly is not terribly efficient, so one should generate TPC-H data, | ||
/// write them to storage files, (Parquet, ORC, or similar), then | ||
/// benchmark a query plan that reads those files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not entirely clear from this wether this operator generates data on the fly or not (mainly because it also talks about generating files). Let's make it more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased it.
Generates TPC-H data using dbgen. Note that generating data on the | ||
fly is not terribly efficient, so one should generate TPC-H data, | ||
write them to storage files, (Parquet, ORC, or similar), then | ||
benchmark a query plan that reads those files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the comment above we should probably update this as well.
velox/py/runner/PyLocalRunner.cpp
Outdated
addFileSplit(inputFile, scanId, scanPair.first); | ||
for (auto& [scanId, splits] : *scanFiles_) { | ||
for (auto& split : splits) { | ||
cursor_->task()->addSplit(scanId, exec::Split(std::move((split)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cursor_->task()->addSplit(scanId, exec::Split(std::move((split)))); | |
cursor_->task()->addSplit(scanId, exec::Split(std::move(split))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kostasxmeta thanks for the review!
std::vector<std::shared_ptr<connector::ConnectorSplit>> splits; | ||
if (inputFiles.has_value()) { | ||
for (const auto& inputFile : *inputFiles) { | ||
splits.push_back(std::make_shared<connector::hive::HiveConnectorSplit>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. This may need to be better decoupled in the future, but for now we are making the same assumption that PlanBuilder does, that if you are using the tableScan() node, it implies you are using HiveConnector.
In this PR I didn't change the assumption, just moved the code where the split was generated.
/// Generates TPC-H data using dbgen. Note that generating data on the | ||
/// fly is not terribly efficient, so one should generate TPC-H data, | ||
/// write them to storage files, (Parquet, ORC, or similar), then | ||
/// benchmark a query plan that reads those files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased it.
Summary: Support generation of TPC-H datasets using the Python API. The idea is to use this as a pre-step that generates datasets in the target file format, then use the generates file for benchmarks. Reviewed By: kostasxmeta, kgpai Differential Revision: D69809891
5fe0948
to
171438c
Compare
This pull request was exported from Phabricator. Differential Revision: D69809891 |
This pull request has been merged in 5f0c48d. |
Summary:
Support generation of TPC-H datasets using the Python API. The idea is
to use this as a pre-step that generates datasets in the target file format,
then use the generates file for benchmarks.
Differential Revision: D69809891