Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[copy_from] Proper cancelation via CancelOneshotIngestion message #31136

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ParkMyCar
Copy link
Member

This PR fixes the TODO(cf1) related to canceling oneshot ingestions. It adds a StorageCommand::CancelOneshotIngestion that reduces/compacts away a corresponding StorageCommand::RunOneshotIngestion, much like ComputeCommand::Peek and ComputeCommand::CancelPeek.

We send a StorageCommand::CancelOneshotIngestion whenever a user has canceled a COPY FROM statement, but also the storage controller will send one whenever a RunOneshotIngestion command completes.

Motivation

Fix TODO(cf1) related to cancelation

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@ParkMyCar ParkMyCar requested a review from teskje January 22, 2025 03:49
@ParkMyCar ParkMyCar requested review from a team as code owners January 22, 2025 03:49
@ParkMyCar ParkMyCar requested a review from aljoscha January 22, 2025 03:49
@ParkMyCar ParkMyCar force-pushed the copy/cancel branch 2 times, most recently from 68b82bd to f16fa82 Compare January 22, 2025 19:57
* add CancelOneshotIngestion message to the storage controller
* handle new message in 'reduce' and 'reconcile' in the storage-controller
* emit a CancelOntshotIngestion whenever an ingestion completes
Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. The thing I'm not clear about is what happens to oneshot ingestions when their target cluster is dropped. There are two places that have to deal with the possibility of a missing replica and both handle them differently (returning an error or gracefully ignoring, respectively). I think we should handle this consistently, but the right thing to do depends on whether or not we drop the controller state for pending oneshot ingestions when we drop an instance or not.

/// [`RunOneshotIngestion`]: crate::client::StorageCommand::RunOneshotIngestion
CancelOneshotIngestion {
ingestions: Vec<Uuid>,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to make these batched commands? In compute we at one point transformed all commands into unbatched ones because the batching made various things more cumbersome (mainly keeping statistics about the number of commands in the history) and it didn't provide any benefits wrt. protobuf encoding size. I think there are plans for also moving to unbatched commands for storage (either @aljoscha or @petrosagg mentioned that), so if that's still the case it'd make sense to introduce new commands as unbatched immediately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned it, yeah. If possible we should use a flattened field here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I made these batched commands is because the loop in fn reconcile(...) doesn't remove commands, instead of mutates the batch and removes relevant ones, so I decided to stick with this existing pattern.

Chatted with @petrosagg about this today though and I'll first try to refactor the loop and actually remove commands instead of just draining batched ones.


let instance = self.instances.get_mut(&pending.cluster_id).ok_or_else(|| {
// TODO(cf2): Refine this error.
StorageError::Generic(anyhow::anyhow!("missing cluster {}", pending.cluster_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get here that would be because of a bug in the storage controller, not because of a usage error, right? I wouldn't return an error here, but do a (soft) panic instead.

for (ingestion_id, batches) in batches {
match self.pending_oneshot_ingestions.remove(&ingestion_id) {
Some(pending) => {
// Send a cancel command so our command history is correct.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also do avoid duplicate work once we have active replication.)

match self.pending_oneshot_ingestions.remove(&ingestion_id) {
Some(pending) => {
// Send a cancel command so our command history is correct.
if let Some(instance) = self.instances.get_mut(&pending.cluster_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get here when the instance is already dropped? If not we should add a soft panic in the else branch.

.filter(|ingestion_id| {
let created = create_oneshot_ingestions.contains(ingestion_id);
let dropped = cancel_oneshot_ingestions.contains(ingestion_id);
!created && !dropped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check seems unnecessary. We shouldn't have any drop commands for ingestions we didn't previously create in the command stream, right? So !created && dropped shouldn't be possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants