Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable Kopia Maintenance Interval #8364

Open
kaovilai opened this issue Oct 30, 2024 · 15 comments · May be fixed by #8581
Open

Configurable Kopia Maintenance Interval #8364

kaovilai opened this issue Oct 30, 2024 · 15 comments · May be fixed by #8581
Assignees
Milestone

Comments

@kaovilai
Copy link
Member

kaovilai commented Oct 30, 2024

Describe the problem/challenge you have

We want ability to configure maintenance interval to affect change to storage more quickly in some cases.

These can be configured in the repo-maintenance-job-configmap

Describe the solution you'd like

Anything else you would like to add:

Environment:

  • Velero version (use velero version):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "The project would be better with this feature added"
  • 👎 for "This feature will not enhance the project in a meaningful way"

cc: @shubham-pampattiwar @weshayutin

@kaovilai kaovilai added this to OADP Oct 30, 2024
@kaovilai
Copy link
Member Author

I can be assigned this issue

@sseago
Copy link
Collaborator

sseago commented Oct 30, 2024

@kaovilai I think what we want here is just a bool entry -- "alwaysUseFullMaintenance" or something.
Auto is the default (so bool is false), which results in Kopia doing one full maint per day, and the rest are quick. When we set this to true we'll want every maintenance full. I don't think we ever want to always to quick -- that would mean data is never cleaned up.

@kaovilai
Copy link
Member Author

kaovilai commented Oct 30, 2024

Sure. Bool entry if that works for everyone.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Oct 31, 2024

Could we clarify the scenarios why we want users to config the maintenance mode?

Basically, we don't want users to change the mode, because full maintenance and quick maintenance are very different from each other, they are designed to happen alternatively and the quick one is more frequent. Changing it manually may cause unexpected consequences:

  • Full maintenance deletes data that quick maintenance doesn't. Running full maintenance too frequent causing the data to be deleted earlier unnecessarily and may result in potential data lose.
  • There are sub tasks inside the maintenance and each sub task may have its own control on the schedule to run, manually changing it may make no effect but increase the burden because full maintenance does heavy checks and takes more resources.
  • Quick maintenance is used to prepare the data for full maintenance in many cases, forcibly changing it may cause the situation that full maintenance always fails(e.g., because of lack of resources) and quick maintenance has no chance to prepare the data which helps on the success of the full maintenance.

Keeping the data in a reasonable time is a policy of Kopia to assure the system success to work, manually changing the maintenance mode could not result in the data to be deleted earlier.
Therefore, we should have users know that the repo maintains data on its own phase, this is to assure the data safety.

@Lyndon-Li
Copy link
Contributor

Another point:

  • Velero adopting Unified Repository concepts which support multiple types of repos.
  • There is no maintenance mode for Unified Repository because not all repos have these modes.
  • And the frequency is also get from the repo itself, because maintenance is a private operation to repo (as you can see the details above), some repos require a frequent maintenance, some others don't require it at all.

Therefore, it is not safe nor necessary to add the maintenance mode into Unified Repository. At present, we let repo itself to decide how to do maintenance, including the mode and frequency, and offload the maintenance work totally to the repo itself

@kaovilai
Copy link
Member Author

Full maintenance deletes data that quick maintenance doesn't. Running full maintenance too frequent causing the data to be deleted earlier unnecessarily and may result in potential data lose.

We probably want this to occur more often at least for testing/debugging. And we've been getting customer cases where they are saying maintenance does not actually work for them so backup expires but nothing is getting deleted.

https://hackmd.io/12AKVvCnRlmyBksgXJls5Q

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Oct 31, 2024

And we've been getting customer cases where they are saying maintenance does not actually work for them so backup expires but nothing is getting deleted

This may be as the expected behavior, e.g., the data may be referenced by other backups and should not deleted.
Otherwise, we need to treat it as a bug and find the root cause before doing changes.

We probably want this to occur more often at least for testing/debugging

For this purpose, if the debugging happens on users' production environments, changing anything to the maintenance is still not recommended since this may result in users' data lose; if the testing/debugging happens in our dev environments, I think we can change the code locally, moreover, as mentioned above, there are many margins of sub tasks need to be adjusted, only changing the mode may not make it work as expected.

@sseago
Copy link
Collaborator

sseago commented Oct 31, 2024

@Lyndon-Li "Full maintenance deletes data that quick maintenance doesn't. Running full maintenance too frequent causing the data to be deleted earlier unnecessarily and may result in potential data lose."

There shouldn't be any risk here, since kopia requires 4 separate full maintenance cycles at least four hours apart before it will remove any data. The concern is that with the default "once a day" full maintenance, it will be 24 hours at the earliest, but up to 48 hours once a blob is no longer referenced by a needed snapshot. We could reduce this window to 4+ hours if full maintenance ran more often. But even if you ran full maintenance constantly (which we wouldn't actually want) it shouldn't put the data at risk because kopia's built-in safety mechanisms require GC to mark a blob as safe to delete during two separate full maint cycles at least 4 hours apart.

@sseago
Copy link
Collaborator

sseago commented Oct 31, 2024

I don't know if this is possible, but maybe there's a way to configure the kopia repo to do full maintenance more than once per day when velero runs maintenance with "mode=auto" -- that might be cleaner than a config to always run full, but I don't know whether that can be done. Then we could have behavior where full is done every 6 hours but quick every hour.

@sseago
Copy link
Collaborator

sseago commented Oct 31, 2024

It looks like we probably can do that here:

p.FullCycle.Interval = overwriteFullMaintainInterval

		if overwriteFullMaintainInterval != time.Duration(0) {
			logger.Infof("Full maintenance interval change from %v to %v", p.FullCycle.Interval, overwriteFullMaintainInterval)
			p.FullCycle.Interval = overwriteFullMaintainInterval
		}

Maybe making this configurable is preferable to an "always use full maint" flag. Then we could recommend for users who want data to be deleted more quickly to set this to 6 or 12 hours instead of the default 24.

@Lyndon-Li
Copy link
Contributor

@Lyndon-Li "Full maintenance deletes data that quick maintenance doesn't. Running full maintenance too frequent causing the data to be deleted earlier unnecessarily and may result in potential data lose."

There shouldn't be any risk here, since kopia requires 4 separate full maintenance cycles at least four hours apart before it will remove any data. The concern is that with the default "once a day" full maintenance, it will be 24 hours at the earliest, but up to 48 hours once a blob is no longer referenced by a needed snapshot. We could reduce this window to 4+ hours if full maintenance ran more often. But even if you ran full maintenance constantly (which we wouldn't actually want) it shouldn't put the data at risk because kopia's built-in safety mechanisms require GC to mark a blob as safe to delete during two separate full maint cycles at least 4 hours apart.

Yes, if you change the mode but not any margin, full maintenance doesn't make any effect but consume more resources; if you change the mode and also some margins, data risk will happen.

@Lyndon-Li
Copy link
Contributor

It looks like we probably can do that here:

p.FullCycle.Interval = overwriteFullMaintainInterval

		if overwriteFullMaintainInterval != time.Duration(0) {
			logger.Infof("Full maintenance interval change from %v to %v", p.FullCycle.Interval, overwriteFullMaintainInterval)
			p.FullCycle.Interval = overwriteFullMaintainInterval
		}

Maybe making this configurable is preferable to an "always use full maint" flag. Then we could recommend for users who want data to be deleted more quickly to set this to 6 or 12 hours instead of the default 24.

This looks more rational. The overwrite value could be set to udmrepo.RepoOptions through backupRepository config.
Besides, I have two more suggestions on this direction:

  1. It is not necessary nor safe for users to set a specific time, p.FullCycle.Interval is also used and checked elsewhere in the Kopia code and so should be controlled within a reasonable value.
  2. We should keep the Unified Repo concepts even in the loose repo options, so we should avoid exposing Kopia parameters directly.
  3. Considering 1 and 2, I suggest we add the fastGC/eagerGC option. When this repo option is set, we overwrite the full maintenance interval to 12/6 hours.

@sseago
Copy link
Collaborator

sseago commented Nov 4, 2024

@Lyndon-Li I think that's fine. 24/12/6 hour options should be sufficient. There's zero value in full maint more often than 4 hours, and exactly 4 hours could produce edge cases (i.e. last full maint marked this blob 3:59:58 ago and therefore it's too soon to delete now by 2 seconds), and 5 hours doesn't give you consistent day-to-day maint times. So 6 is realistically the smallest value that makes sense.

@mpryc
Copy link
Contributor

mpryc commented Nov 5, 2024

@Lyndon-Li I really like your idea to have pre-set options, that makes it easy for the user to configure preserving underlying repo requirements (e.g. <4h doesn't make sense, so user won't set unacceptable parameters).

@kaovilai kaovilai changed the title Configurable Kopia Maintenance Mode Configurable Kopia Maintenance Interval Nov 5, 2024
@kaovilai kaovilai added this to the v1.16 milestone Dec 4, 2024
@reasonerjt
Copy link
Contributor

@kaovilai
I'm tentatively adding "Needs Design". but if you think this is simple enough feel free to clarify in the comment and remove the label.

@kaovilai kaovilai linked a pull request Jan 6, 2025 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants