Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate produces batch of data #31155

Merged
merged 1 commit into from
Jan 27, 2025

Conversation

antiguru
Copy link
Member

@antiguru antiguru commented Jan 23, 2025

Instead of flattening the outputs of consolidate, provide the output as a
whole batch, encoded as an Stream<_, Vec<C>>. This allows downstream operators
to act on a whole batch at once, without relying on undocumented properties
of how Timely channels behave. For operators that need to flatten the
output need slightly different logic.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@antiguru antiguru requested a review from a team as a code owner January 23, 2025 08:15
@antiguru antiguru requested review from petrosagg and teskje January 23, 2025 08:16
Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The looks reasonable to me, but I'll leave the approval to someone who knows the copy-to implementation.

@@ -777,7 +794,7 @@ pub fn consolidate_pact<Ba, P, G>(
stream: &StreamCore<G, Ba::Input>,
pact: P,
name: &str,
) -> StreamCore<G, Ba::Output>
) -> StreamCore<G, Rc<Vec<Ba::Output>>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the Rc isn't clear to me. Is it a safe-guard to avoid accidental expensive cloning of Vecs when the output stream is later forked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is so that a chain of updates is inseparable, no matter what channel it traverses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern being that Timely might otherwise try to pull the Vec apart while passing it through channels? Would be great to have a comment explaining this!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the PR such that consolidate_pact produces a Stream<G, Vec<Ba::Output>>, which means that the elements we're transferring are Vec<_>. This gives us the certainty that Timely will not look into the chains.

Instead of flattening the outputs of consolidate, provide the output as a
whole batch, encoded as an `Vec<C>`. This allows downstream operators
to act on a whole batch at once, without relying on undocumented properties
of how Timely channels behave. For operators that need to flatten the
output need slightly different logic.

The copy-to-s3 operator sees some changes, which are roughly:
* Accept the chain formed by `consolidate_pact`, and rely on it
  partitioning the data according to the exchange function.
* Skip encoding the batch, and recompute it when writing the data.

Signed-off-by: Moritz Hoffmann <[email protected]>
Copy link
Member

@ParkMyCar ParkMyCar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't totally understand the changes to consolidate_pact, but hopefully those are covered by Jan or Frank! But sending around batches of Rows instead of relying on the ordering of timely channels looks good to me!

@antiguru antiguru merged commit 4242bb3 into MaterializeInc:main Jan 27, 2025
79 checks passed
@antiguru antiguru deleted the consolidate_rc branch January 27, 2025 16:45
@antiguru
Copy link
Member Author

Thanks for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants