Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(blog): Add Velox Primer Part1 Post #12348

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pedroerp
Copy link
Contributor

Summary: Adding part of a series of blog posts introducing Velox concepts.

Differential Revision: D69694896

Summary: Adding part of a series of blog posts introducing Velox concepts.

Differential Revision: D69694896
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 15, 2025
Copy link

netlify bot commented Feb 15, 2025

Deploy Preview for meta-velox ready!

Name Link
🔨 Latest commit 6f5fb52
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67b0064d87762d00089f5f3f
😎 Deploy Preview https://deploy-preview-12348--meta-velox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69694896

Copy link
Contributor

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice blog and helps me a log!

from which the group by reads its input. There are file splits
(`velox::connector::ConnectorSplit`) and remote splits
(`velox::exec::RemoteSplit`). The first identifies data to read, the second
identifies a running Task.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask a naive question, so the RemoteSplit is only for input "task" which would used by exchange?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you would only add this specific type of split if the task in hand is supposed to read data from a shuffle. If you're reading it directly from a file (table scan) you would use a regular ConnectorSplit.

somewhere for a consumer to retrieve. This is typically a PartitionedOutput.
The consumer of the PartitionedOutput is an Exchange in a different task, where
the Exchange is in the source position. Operators that are neither sources or
sinks are things like filterProject, HashProbe, HashAggregation and so on, more
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sinks are things like filterProject, HashProbe, HashAggregation and so on, more
sinks are things like FilterProject, HashProbe, HashAggregation and so on, more

Keep same for FilterProject?

the second stage Tasks (group by), their Splits identify the table scan Tasks
from which the group by reads its input. There are file splits
(`velox::connector::ConnectorSplit`) and remote splits
(`velox::exec::RemoteSplit`). The first identifies data to read, the second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's RemoteConnectorSplit in code?

Copy link
Contributor

@kKPulla kKPulla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm overall. Very excited for this series and looking forward to the next ones.

An example of a Task with two pipelines is a hash join, with separate pipelines
for the build and for the probe side. This makes sense because the build must
be complete before the probe can proceed. We will talk more about this later.
All vectors are subclasses of `velox::BaseVector`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this to next paragraph after introducing vectors as a concept.

child vector for every column of the relation - it is the equivalent of
RecordBatch in Arrow. 

## Operators Souces, Sinks, and State
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Operators Souces, Sinks, and State
## Operator Sources, Sinks and State

Tasks. Tasks send back statistics, errors and other status information to the
distributed engine. 

## Pipelines and Drivers and Operators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Pipelines and Drivers and Operators
## Pipelines, Drivers and Operators

Inside a Task, there are *Pipelines*. Each pipeline is a linear sequence of
operators (`velox::exec::Operator`), and operators are the objects that implement
relational logic. In the case of the group by example, the first task has one
pipeline, with a TableScan and a PartitionedOutput. The second Task too has one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link these two to code ref? Like velox::exec::TableScan etc.?

Comment on lines +11 to +12
stages, and present Velox concepts such as Tasks, Splits, Pipelines, Drivers,
and Operators.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably?

Suggested change
stages, and present Velox concepts such as Tasks, Splits, Pipelines, Drivers,
and Operators.
stages, and present Velox concepts such as Tasks, Splits, Pipelines, Drivers,
and Operators that enable this in distributed compute engines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants