-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(blog): Add Velox Primer Part1 Post #12348
base: main
Are you sure you want to change the base?
Conversation
Summary: Adding part of a series of blog posts introducing Velox concepts. Differential Revision: D69694896
✅ Deploy Preview for meta-velox ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
This pull request was exported from Phabricator. Differential Revision: D69694896 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice blog and helps me a log!
from which the group by reads its input. There are file splits | ||
(`velox::connector::ConnectorSplit`) and remote splits | ||
(`velox::exec::RemoteSplit`). The first identifies data to read, the second | ||
identifies a running Task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask a naive question, so the RemoteSplit
is only for input "task" which would used by exchange?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you would only add this specific type of split if the task in hand is supposed to read data from a shuffle. If you're reading it directly from a file (table scan) you would use a regular ConnectorSplit.
somewhere for a consumer to retrieve. This is typically a PartitionedOutput. | ||
The consumer of the PartitionedOutput is an Exchange in a different task, where | ||
the Exchange is in the source position. Operators that are neither sources or | ||
sinks are things like filterProject, HashProbe, HashAggregation and so on, more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sinks are things like filterProject, HashProbe, HashAggregation and so on, more | |
sinks are things like FilterProject, HashProbe, HashAggregation and so on, more |
Keep same for FilterProject?
the second stage Tasks (group by), their Splits identify the table scan Tasks | ||
from which the group by reads its input. There are file splits | ||
(`velox::connector::ConnectorSplit`) and remote splits | ||
(`velox::exec::RemoteSplit`). The first identifies data to read, the second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's RemoteConnectorSplit
in code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm overall. Very excited for this series and looking forward to the next ones.
An example of a Task with two pipelines is a hash join, with separate pipelines | ||
for the build and for the probe side. This makes sense because the build must | ||
be complete before the probe can proceed. We will talk more about this later. | ||
All vectors are subclasses of `velox::BaseVector`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move this to next paragraph after introducing vectors as a concept.
child vector for every column of the relation - it is the equivalent of | ||
RecordBatch in Arrow. | ||
|
||
## Operators Souces, Sinks, and State |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Operators Souces, Sinks, and State | |
## Operator Sources, Sinks and State |
Tasks. Tasks send back statistics, errors and other status information to the | ||
distributed engine. | ||
|
||
## Pipelines and Drivers and Operators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Pipelines and Drivers and Operators | |
## Pipelines, Drivers and Operators |
Inside a Task, there are *Pipelines*. Each pipeline is a linear sequence of | ||
operators (`velox::exec::Operator`), and operators are the objects that implement | ||
relational logic. In the case of the group by example, the first task has one | ||
pipeline, with a TableScan and a PartitionedOutput. The second Task too has one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link these two to code ref? Like velox::exec::TableScan etc.?
stages, and present Velox concepts such as Tasks, Splits, Pipelines, Drivers, | ||
and Operators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably?
stages, and present Velox concepts such as Tasks, Splits, Pipelines, Drivers, | |
and Operators. | |
stages, and present Velox concepts such as Tasks, Splits, Pipelines, Drivers, | |
and Operators that enable this in distributed compute engines. |
Summary: Adding part of a series of blog posts introducing Velox concepts.
Differential Revision: D69694896