-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stale volumesnapshot and volumesnapshotcontents created for volumegroupsnapshot #1050
Comments
I guess that an error like AlreadyExists on a Create() call should be handled as success (of a previous try), and if needed a Get() might be done instead. That is probably easier than garbage collecting created objects. |
external-snapshotter/pkg/sidecar-controller/groupsnapshot_helper.go Lines 767 to 775 in fa9de9c
|
Yes, tracking the timestamp of the initial request is required for this. All objects can use the same timestamp, possibly stored&retrieved in an annotation of the groupSnapshotContent object. |
How about if we generate the sha of func GetSnapshotNameForVolumeGroupSnapshotContent(groupSnapshotContentUUID, pvUUID string) string {
return fmt.Sprintf("snapshot-%x-%s", sha256.Sum256([]byte(groupSnapshotContentUUID+pvUUID+snapshotID)))
}
// GetSnapshotContentNameForVolumeGroupSnapshotContent returns a unique content name for the
// passed in VolumeGroupSnapshotContent.
func GetSnapshotContentNameForVolumeGroupSnapshotContent(groupSnapshotContentUUID, pvUUID string) string {
return fmt.Sprintf("snapcontent-%x-%s", sha256.Sum256([]byte(groupSnapshotContentUUID+pvUUID+snapshotID)))
} There could be a reason why timestamp was considered but not sure why, |
We use timestamp because that's usually how people check at which point in time they want to restore the data back. Also you could create multiple snapshots for the same groupSnapshotContentUUID+pvUUID+snapshotID combination. |
I think we need to do cleanups if steps 3, 4, 5 fail. |
@Madhu-1 Are the failures you encountered coming from VolumeSnapshot/VolumeSnapshotContent API object creation/update or from the creation of snapshot resource on the underlying storage system? |
@xing-yang Should users use the creationTime stamp on the volumesnapshotcontent to restore the data back? This timestamp represents when the volumesnapshot was created, not when the snapshot was created in the storage backend. |
i think it was due to the VolumeSnapshotContent update (i will run tests again and update the issue with logs) but again that is our source of the problem i was also thinking about the problem when the sidecar is restarted as well (this can also lead to the stale resources) |
@xing-yang here are the logs of csi-snapshotter
|
@Madhu-1 is it possible for the CephFS implementation to not return an error like this:
But instead return a VolumeGroupSnapshot that has |
@nixpanic i think its will contain only the groupID (it can change if something goes wrong and rollback happens) to avoid corner cases and as no snapshot operations are technically started yet i prefer to return internal error until the quiesce is completed. |
Ok, I was also not sure if the group/snapshot IDs would be available to return early. Other operations return |
Created a PR for that: container-storage-interface/spec#563 |
In my tests, I traced it down to this commit, which does not introduce the issue but increases the probability of triggering it by adding another call to And, unfortunately, every call to |
This patch replace the usage of `UpdateStatus` with a patch against the `status` subresource in the `updateSnapshotContentStatus`. The reason behind this change is to avoid conflicts that could potentially arise, as they pose a risk within the context of the calling function `createGroupSnapshotWrapper`. If `createGroupSnapshotWrapper` is called multiple times for the same GroupSnapshot, this will lead to multiple stale `VolumeSnapshot` and `VolumeSnapshotContent` being created. Partially closes: kubernetes-csi#1050
This patch replace the usage of `UpdateStatus` with a patch against the `status` subresource in the `updateSnapshotContentStatus`. The reason behind this change is to avoid conflicts that could potentially arise, as they pose a risk within the context of the calling function `createGroupSnapshotWrapper`. If `createGroupSnapshotWrapper` is called multiple times for the same GroupSnapshot, this will lead to multiple stale `VolumeSnapshot` and `VolumeSnapshotContent` being created. Related to: kubernetes-csi#1050
This patch replace the usage of `UpdateStatus` with a patch against the `status` subresource in the `updateSnapshotContentStatus`. The reason behind this change is to avoid conflicts that could potentially arise, as they pose a risk within the context of the calling function `createGroupSnapshotWrapper`. If `createGroupSnapshotWrapper` is called multiple times for the same GroupSnapshot, this will lead to multiple stale `VolumeSnapshot` and `VolumeSnapshotContent` being created. Related to: kubernetes-csi#1050
This patch replace the usage of `UpdateStatus` with a patch against the `status` subresource in the `updateSnapshotContentStatus`. The reason behind this change is to avoid conflicts that could potentially arise, as they pose a risk within the context of the calling function `createGroupSnapshotWrapper`. If `createGroupSnapshotWrapper` is called multiple times for the same GroupSnapshot, this will lead to multiple stale `VolumeSnapshot` and `VolumeSnapshotContent` being created. Related to: kubernetes-csi#1050
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
Hi, I tested the bug with latest builds,
Output for Volumegroupsnapshotcontent, volumesnapshotcontent and volumesnapshot
The above result is after following the bekow steps:
Result: total of 5 snapshotcontent can be seen at the end of the above testing which is expected. |
cc @Madhu-1 |
@yati1998 its still possible to reproduce this one which involves the restart of the snapshotter |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
external-snapshotter/pkg/sidecar-controller/groupsnapshot_helper.go
Lines 433 to 497 in fa9de9c
The above code seems to be buggy and i got around 200 volumesnapshots/volumesnapshotcontent created for a single volumegroupsnapshot with single PVC.
It works as below
Repeat steps 3,4,5 for all the snapshots in the volumegroupsnapshot RPC response and at last update the volumegroupsnapshotcontent status with VSC names.
If there are any issues in 3,4,5 steps or if the csi-snapshotter is restarted we might end up having stale resources that need to be garbage collected.
cc @xing-yang @RaunakShah @nixpanic
The text was updated successfully, but these errors were encountered: