Enable configuration to select backup deletion mechanism #18

akloss-cibo · 2024-12-12T05:53:18Z

Way back in kopeio/etcd-manager#325 it was observed that object versioning and deleting the object (which actually means creating a deletion marker in a versioned bucket) can create large storage costs as the backups are not removed by etcd.

However, in a bucket with S3 Object Lock enabled, it typically not possible to delete a version until its lock is released. For users in this situation, using regular S3 delete (which creates the delete marker) and then using S3 lifecycle to manage actually deleting the version is pretty desirable.

This manifests very similarly to kubernetes/kops#14031, where the cost of the tier 1 operations that are trying to delete objects gets very high.

rifelpet · 2024-12-17T02:16:45Z

Can you give an example error message and HTTP response code when etcd-manager fails to delete an object that has S3 Object Lock? We could implement some caching and backoff for failures of this type.

akloss-cibo · 2024-12-17T03:21:20Z

etcd-manager doesn't actually record an error at all, but you can see here it tries to delete the same thing more than once (because it is unable to delete it):

% logs etcd-manager-main-i-02ddfc9eebbeb1a96 | grep 'remov.* backup "2024-10-17T20:07:32Z-000525"'
Defaulted container "etcd-manager" out of: etcd-manager, kops-utils-cp (init), init-etcd-3-4-13 (init), init-etcd-3-5-9 (init), init-etcd-symlinks-3-4-13 (init), init-etcd-symlinks-3-5-9 (init)
I1217 02:00:14.525485    4775 cleanup.go:173] removing backup "2024-10-17T20:07:32Z-000525"
I1217 02:00:14.704003    4775 cleanup.go:177] removed backup "2024-10-17T20:07:32Z-000525"
I1217 03:10:46.573513    4775 cleanup.go:173] removing backup "2024-10-17T20:07:32Z-000525"
I1217 03:10:46.686841    4775 cleanup.go:177] removed backup "2024-10-17T20:07:32Z-000525"
%

The s3 cli does this (bucket and prefix obfuscated):

I1217 03:10:46.573513    4775 cleanup.go:173] removing backup "2024-10-17T20:07:32Z-000525"
% aws s3api delete-object --bucket my-bucket --key kops/my-cluster.example.com/backups/etcd/main/2024-12-17T02:59:52Z-000019/etcd.backup.gz --version-id oJwBuWF04gUN4jImBxIpXFnxNe5_3bM6

An error occurred (AccessDenied) when calling the DeleteObject operation: Access Denied because object protected by object lock.
%

This is the relevant snippet of output from the CLI with --debug on:

2024-12-16 21:19:01,849 - MainThread - urllib3.connectionpool - DEBUG - https://my-bucket.s3.us-east-1.amazonaws.com:443 "DELETE /kops/my-cluster/backups/etcd/main/2024-12-17T02%3A59%3A52Z-000019/etcd.backup.gz?versionId=oJwBuWF04gUN4jImBxIpXFnxNe5_3bM6 HTTP/1.1" 403 None
2024-12-16 21:19:01,850 - MainThread - botocore.hooks - DEBUG - Event before-parse.s3.DeleteObject: calling handler <function _handle_200_error at 0x102209ee0>
2024-12-16 21:19:01,850 - MainThread - botocore.hooks - DEBUG - Event before-parse.s3.DeleteObject: calling handler <function handle_expires_header at 0x102209d00>
2024-12-16 21:19:01,850 - MainThread - botocore.parsers - DEBUG - Response headers: {'x-amz-request-id': 'DXB3VZGZRM9BGQN8', 'x-amz-id-2': 'j0DaBvsJDX1u/c4zq5l/fXII/XL/G9ehoFLN0XlKcl8igUGlRc9J4E/iBKDj3YsCltQ+DC8c4sc=', 'Content-Type': 'application/xml', 'Transfer-Encoding': 'chunked', 'Date': 'Tue, 17 Dec 2024 03:19:01 GMT', 'Server': 'AmazonS3'}
2024-12-16 21:19:01,850 - MainThread - botocore.parsers - DEBUG - Response body:
b'<?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>AccessDenied</Code><Message>Access Denied because object protected by object lock.</Message><RequestId>DXB3VZGZRM9BGQN8</RequestId><HostId>j0DaBvsJDX1u/c4zq5l/fXII/XL/G9ehoFLN0XlKcl8igUGlRc9J4E/iBKDj3YsCltQ+DC8c4sc=</HostId></Error>'

You can see the 403 response code and the error in the XML body.

We update the last-run timestamp before running cleanup, so that we don't retry immediately if the cleanup fails. Issue kubernetes-sigs#18

akloss-cibo mentioned this issue Dec 16, 2024

Remove all versions of a file form the S3 bucket kubernetes/kops#9171

Merged

justinsb added a commit to justinsb/etcd-manager-1 that referenced this issue Dec 27, 2024

Don't run backup cleanup more often if it fails

890785f

We update the last-run timestamp before running cleanup, so that we don't retry immediately if the cleanup fails. Issue kubernetes-sigs#18

justinsb mentioned this issue Dec 27, 2024

Don't run backup cleanup more often if it fails #19

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable configuration to select backup deletion mechanism #18

Enable configuration to select backup deletion mechanism #18

akloss-cibo commented Dec 12, 2024

rifelpet commented Dec 17, 2024

akloss-cibo commented Dec 17, 2024

Enable configuration to select backup deletion mechanism #18

Enable configuration to select backup deletion mechanism #18

Comments

akloss-cibo commented Dec 12, 2024

rifelpet commented Dec 17, 2024

akloss-cibo commented Dec 17, 2024