-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate tablet repair scheduler #4188
Comments
@asias Is there any possibility of tracking the repair progress (e.g. Task Manager API)? EDIT: I just assumed it, but this API is async, right? We don't hang on the call until the repair is finished, right? |
Currently, it is a sync API, we will add support for async api too. It is pretty trivial. It is in my queue. It is sync currently because the task manager api to wait for the task is not available when the tablet repair scheduler is merged. When the api requests to repair multiple tablets, it makes sense to show the progress how many tablets have finished. Currently, task manager does not support it for tablet repair. @Deexie As the initial integration, I think we can skip the detailed progress report and dcs/hosts selection. SM can still report some progress, e.g., n out of m tables have finished. |
@asias thanks for the explanation!
So the suggestion is to repair all of the table's tablets in a single call (via the
But I guess that batching tablets would result in degraded performance (for the same reason why batching tokens was worse than sending them all with |
The number of tablets of a given table changes from time to time, i.e., merge/split. It would be hard for manager to track what tablets need to be repaired and batch them. Yes, if SM batches, it is possible that the cluster is not full utilized to repair even the cluster would repair more tablets. When a tablet repair api is issued, it will retry itself in case of error when some tablets have error to repair. It is best we could have a pause api for a given request as well for the purpose of efficient resume. |
Got it, so to summarize, SM will use this API only for repairing tablet tables:
|
Yes, the new tablet repair api will uses the tablet repair scheduler which integrates well with the tablet migrations. So no need to stop tablet migration during repair.
Yes, repair with tokens=all, but we are going to add a async api very soon.
We do not support dcs or hosts selection with the tablet repair api. After we have scylladb/scylladb#22417, we can switch the partial repair to use the tablet repair api to select dc and hosts. Make sense to you? |
For pause and resume a "large" tablet repair request: scylladb/scylladb#22419 |
In scylla commit 0d2583600d1325f2064a0d5d776bcf50660a5a42 (Merge 'Add tablet repair scheduler support' from Asias He), the tablet repair scheduler is implemented. A new tablet repair api is added. With this new api, the request of the repairs will be scheduler by scylla core along with other tablet tasks, e.g., migration, rebuild. There is no need for the management tool to schedule repair tasks on different nodes any more. One can run this api to repair tablets of a given table on any of the nodes.
Note: Currently, node and dc selection are not supported yet. We might support it later, there is a PR: scylladb/scylladb#21985.
The text was updated successfully, but these errors were encountered: