Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document manual kopia maintenance cleanup with --safety=none #8374

Open
kaovilai opened this issue Nov 6, 2024 · 3 comments
Open

Document manual kopia maintenance cleanup with --safety=none #8374

kaovilai opened this issue Nov 6, 2024 · 3 comments
Assignees

Comments

@kaovilai
Copy link
Member

kaovilai commented Nov 6, 2024

Document that there is a way to cleanup faster, but it will have caveats and user will have to run it manually.

          > --safety=none could be documented for user as a workaround but not implemented in velero code. If agreed, we can open a documentation issue for that.

Originally posted by @reasonerjt in #8365 (comment)

@Lyndon-Li
Copy link
Contributor

I think about this again, once we document this, it means we allow users to do this unconditionally and Velero works with the repo after manually running maintenance with --safety=none.
However, we never tested it and we cannot say that maintenance with --safety=none never fails and Velero always works with it.

Since we've already had the solution for the original problem in #8364, we don't want to undergo the risk and extra work of testing or troubleshooting. Therefore, I would suggest we reconsider this. @kaovilai @reasonerjt @weshayutin

@weshayutin
Copy link
Contributor

I agree w/ @Lyndon-Li generally. couple thoughts:

  • I don't know how we would know for sure from a support perspective if a customer used --safety=none which is the biggest concern to me.
  • I can see situations where a cluster admin is pressured to reduce cloud costs and needs to run w/o safety.
  • Perhaps stating that restoring a backup after maint w/ --safety=none has been executed is not supported. Customers are highly encouraged to test restores immediately if --safety=none has been executed.
  • Question: if a customer ran w/o saftey ( works ), restore works, then additional incremental backups are taken are they back to a GOOD enough position for support?

At the end of the day I think I would rather see a customer run a fresh backup in a fresh/new backup repostitory and then deleted the old backup and respository than use --saftey=none and expect support.

@sseago
Copy link
Collaborator

sseago commented Nov 25, 2024

@weshayutin So I think the only problems with safety=none is if there were backups run while this happened. In other words, if I run a backup, then shutdown velero, then run maintanenace with safety=none -- if that backup wasn't deleted/expired, then none of its blobs should have been removed -- restoring it should work, and further backups are incremental. Basically full maintenance with safety=none should be similar to what happens with restic maintenance, since restic doesn't have this sort of safety mechanism built in to begin with

In any case, if we did document this, it would not be to suggest that it works unconditionally. We would absolutely need to recommend that velero be shut down while this is done. If I'm understanding things correctly, the potential for harm here is limited to what happens if velero tries to make a new backup while maintenance is running.

@weshayutin "fresh backup/repository" works as long as the user is able to delete all backups in the current repository without any data protection risk. I think the scenario that is most relevant here would be a repository with needed non-expired backups where there was an additional large backup taken that they want to get rid of right now. But maybe that's an edge case we don't need to support. "The only supported solution if there are other backups in the repository that you must retain is to delete the unnecessary backup and wait the time required for regular maintenance to clear it". While today that could be as long as 72 hours for a newly-created backup, once we implement the configurable full maint window, that will drop the worst-case scenario down to 36 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants