[copy_from] Proper cancelation via `CancelOneshotIngestion` message #31136

ParkMyCar · 2025-01-22T03:49:49Z

This PR fixes the TODO(cf1) related to canceling oneshot ingestions. It adds a StorageCommand::CancelOneshotIngestion that reduces/compacts away a corresponding StorageCommand::RunOneshotIngestion, much like ComputeCommand::Peek and ComputeCommand::CancelPeek.

We send a StorageCommand::CancelOneshotIngestion whenever a user has canceled a COPY FROM statement, but also the storage controller will send one whenever a RunOneshotIngestion command completes.

Motivation

Fix TODO(cf1) related to cancelation

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

* add CancelOneshotIngestion message to the storage controller * handle new message in 'reduce' and 'reconcile' in the storage-controller * emit a CancelOntshotIngestion whenever an ingestion completes

teskje

Looks good overall. The thing I'm not clear about is what happens to oneshot ingestions when their target cluster is dropped. There are two places that have to deal with the possibility of a missing replica and both handle them differently (returning an error or gracefully ignoring, respectively). I think we should handle this consistently, but the right thing to do depends on whether or not we drop the controller state for pending oneshot ingestions when we drop an instance or not.

teskje · 2025-01-27T15:27:02Z

src/storage-client/src/client.rs

+    /// [`RunOneshotIngestion`]: crate::client::StorageCommand::RunOneshotIngestion
+    CancelOneshotIngestion {
+        ingestions: Vec<Uuid>,
+    },


Is there a reason to make these batched commands? In compute we at one point transformed all commands into unbatched ones because the batching made various things more cumbersome (mainly keeping statistics about the number of commands in the history) and it didn't provide any benefits wrt. protobuf encoding size. I think there are plans for also moving to unbatched commands for storage (either @aljoscha or @petrosagg mentioned that), so if that's still the case it'd make sense to introduce new commands as unbatched immediately.

I mentioned it, yeah. If possible we should use a flattened field here

The reason I made these batched commands is because the loop in fn reconcile(...) doesn't remove commands, instead of mutates the batch and removes relevant ones, so I decided to stick with this existing pattern.

Chatted with @petrosagg about this today though and I'll first try to refactor the loop and actually remove commands instead of just draining batched ones.

teskje · 2025-01-27T15:37:24Z

src/storage-controller/src/lib.rs

+
+        let instance = self.instances.get_mut(&pending.cluster_id).ok_or_else(|| {
+            // TODO(cf2): Refine this error.
+            StorageError::Generic(anyhow::anyhow!("missing cluster {}", pending.cluster_id))


If we get here that would be because of a bug in the storage controller, not because of a usage error, right? I wouldn't return an error here, but do a (soft) panic instead.

teskje · 2025-01-27T15:38:58Z

src/storage-controller/src/lib.rs

+                for (ingestion_id, batches) in batches {
+                    match self.pending_oneshot_ingestions.remove(&ingestion_id) {
+                        Some(pending) => {
+                            // Send a cancel command so our command history is correct.


(Also do avoid duplicate work once we have active replication.)

teskje · 2025-01-27T15:40:53Z

src/storage-controller/src/lib.rs

+                    match self.pending_oneshot_ingestions.remove(&ingestion_id) {
+                        Some(pending) => {
+                            // Send a cancel command so our command history is correct.
+                            if let Some(instance) = self.instances.get_mut(&pending.cluster_id) {


Is it possible to get here when the instance is already dropped? If not we should add a soft panic in the else branch.

teskje · 2025-01-27T15:58:30Z

src/storage/src/storage_state.rs

+            .filter(|ingestion_id| {
+                let created = create_oneshot_ingestions.contains(ingestion_id);
+                let dropped = cancel_oneshot_ingestions.contains(ingestion_id);
+                !created && !dropped


The check seems unnecessary. We shouldn't have any drop commands for ingestions we didn't previously create in the command stream, right? So !created && dropped shouldn't be possible.

ParkMyCar requested a review from teskje January 22, 2025 03:49

ParkMyCar requested review from a team as code owners January 22, 2025 03:49

ParkMyCar requested a review from aljoscha January 22, 2025 03:49

ParkMyCar force-pushed the copy/cancel branch 2 times, most recently from 68b82bd to f16fa82 Compare January 22, 2025 19:57

start, proper cancelation for Oneshot Ingestions

56ccb4d

* add CancelOneshotIngestion message to the storage controller * handle new message in 'reduce' and 'reconcile' in the storage-controller * emit a CancelOntshotIngestion whenever an ingestion completes

ParkMyCar force-pushed the copy/cancel branch from f16fa82 to 56ccb4d Compare January 23, 2025 14:13

teskje reviewed Jan 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[copy_from] Proper cancelation via `CancelOneshotIngestion` message #31136

[copy_from] Proper cancelation via `CancelOneshotIngestion` message #31136

ParkMyCar commented Jan 22, 2025

teskje left a comment

teskje Jan 27, 2025

petrosagg Jan 28, 2025

ParkMyCar Jan 28, 2025

teskje Jan 27, 2025

teskje Jan 27, 2025

teskje Jan 27, 2025

teskje Jan 27, 2025

[copy_from] Proper cancelation via CancelOneshotIngestion message #31136

Are you sure you want to change the base?

[copy_from] Proper cancelation via CancelOneshotIngestion message #31136

Conversation

ParkMyCar commented Jan 22, 2025

Motivation

Checklist

teskje left a comment

Choose a reason for hiding this comment

teskje Jan 27, 2025

Choose a reason for hiding this comment

petrosagg Jan 28, 2025

Choose a reason for hiding this comment

ParkMyCar Jan 28, 2025

Choose a reason for hiding this comment

teskje Jan 27, 2025

Choose a reason for hiding this comment

teskje Jan 27, 2025

Choose a reason for hiding this comment

teskje Jan 27, 2025

Choose a reason for hiding this comment

teskje Jan 27, 2025

Choose a reason for hiding this comment

[copy_from] Proper cancelation via `CancelOneshotIngestion` message #31136

[copy_from] Proper cancelation via `CancelOneshotIngestion` message #31136