-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Fix several problem for cluster snapshot backup #55177
base: main
Are you sure you want to change the base?
Conversation
|
||
automatedSnapshotSvName = data.getAutomatedSnapshotSvName(); | ||
automatedSnapshot = data.getAutomatedSnapshot(); | ||
historyAutomatedSnapshotJobs = data.getHistoryAutomatedSnapshotJobs(); | ||
} | ||
|
||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most risky bug in this code is:
The use of ClusterSnapshotUtils.clearAllAutomatedSnapshotFromRemote(null)
which may result in unintended deletions if a null argument is used incorrectly, indicating an unsafe invocation.
You can modify the code like this:
public void setAutomatedSnapshotOff(AdminSetAutomatedSnapshotOffStmt stmt) {
setAutomatedSnapshotOff();
ClusterSnapshotLog log = new ClusterSnapshotLog();
log.setDropSnapshot(AUTOMATED_NAME_PREFIX);
GlobalStateMgr.getCurrentState().getEditLog().logClusterSnapshotLog(log);
try {
// Ensure the method correctly targets intended snapshots without using null.
String snapshotName = (automatedSnapshot != null) ? automatedSnapshot.getSnapshotName() : null;
if (snapshotName != null) {
ClusterSnapshotUtils.clearAllAutomatedSnapshotFromRemote(snapshotName);
}
} catch (StarRocksException e) {
LOG.warn("Cluster Snapshot delete failed, err msg: {}", e.getMessage());
}
}
This ensures that only defined snapshot names are passed to avoid accidental deletions or errors.
if (path.getName().startsWith(ClusterSnapshotMgr.AUTOMATED_NAME_PREFIX)) { | ||
HdfsUtil.deletePath(status.path, brokerDesc); | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most risky bug in this code is:
Potentially unsafe deletion of files without confirmation.
You can modify the code like this:
public static void clearAllAutomatedSnapshotFromRemote(String notAllowDelete) throws StarRocksException {
StorageVolume sv = GlobalStateMgr.getCurrentState().getClusterSnapshotMgr().getAutomatedSnapshotSv();
BrokerDesc brokerDesc = new BrokerDesc(sv.getProperties());
String snapshotImageRootPath = getSnapshotImagePath(sv, "");
List<TBrokerFileStatus> statuses =
HdfsUtil.listPath(snapshotImageRootPath + ClusterSnapshotMgr.AUTOMATED_NAME_PREFIX + "*",
false, sv.getProperties());
for (TBrokerFileStatus status : statuses) {
Path path = new Path(status.path);
if (notAllowDelete != null && path.getName().equals(notAllowDelete)) {
continue;
}
if (path.getName().startsWith(ClusterSnapshotMgr.AUTOMATED_NAME_PREFIX)) {
LOG.warn("Attempting to delete snapshot file: " + status.path); // Add logging before deletion
// Consider adding a confirmation mechanism or revisiting the conditions here to ensure safe deletion.
HdfsUtil.deletePath(status.path, brokerDesc);
}
}
}
Explanation: The method clearAllAutomatedSnapshotFromRemote
potentially deletes snapshots based on a prefix pattern without explicit confirmation or secondary checks. Adding logging provides a trace for deleted paths, and considering a confirmation mechanism can guard against accidental data loss.
if (!createStarMgrImageRet.first) { | ||
errMsg = "checkpoint failed for starMgr image: " + createStarMgrImageRet.second; | ||
break; | ||
} | ||
} | ||
LOG.info("Finished create image for starMgr image, version: {}", consistentIds.second); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most risky bug in this code is:
There is a possibility of bypassing the creation of FE or starMgr images if their journal IDs are equal to or greater than the checkpoint journal IDs. This might be unintended behavior and could lead to missing checkpoints.
You can modify the code like this:
protected void runCheckpointScheduler() {
job.logJob();
Pair<Long, Long> getFEIdsRet = feController.getCheckpointJournalIds();
long feImageJournalId = getFEIdsRet.first;
long feCheckpointJournalId = consistentIds.first;
// Ensure checkpoint is created even if IDs are equal
Pair<Boolean, String> createFEImageRet = feController.runCheckpointControllerWithIds(feImageJournalId, feCheckpointJournalId);
if (!createFEImageRet.first) {
errMsg = "checkpoint failed for FE image: " + createFEImageRet.second;
break;
}
LOG.info("Finished create image for FE image, version: {}", consistentIds.first);
Pair<Long, Long> getStarMgrIdsRet = starMgrController.getCheckpointJournalIds();
long starMgrImageJournalId = getStarMgrIdsRet.first;
long starMgrCheckpointJournalId = consistentIds.second;
// Ensure checkpoint is created even if IDs are equal
Pair<Boolean, String> createStarMgrImageRet = starMgrController.runCheckpointControllerWithIds(starMgrImageJournalId, starMgrCheckpointJournalId);
if (!createStarMgrImageRet.first) {
errMsg = "checkpoint failed for starMgr image: " + createStarMgrImageRet.second;
break;
}
LOG.info("Finished create image for starMgr image, version: {}", consistentIds.second);
}
Signed-off-by: srlch <[email protected]>
429e5b9
to
50c592c
Compare
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Quality Gate passedIssues Measures |
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]❌ fail : 81 / 121 (66.94%) file detail
|
[BE Incremental Coverage Report]❌ fail : 0 / 3 (00.00%) file detail
|
What I'm doing:
Fix several problem for cluster snapshot backup:
cluster_snapshots
sys tableif the checkpointJournanlid is greater than image version
isUnFinishedState
functionsnapshot is created.
resetLastUnFinishedAutomatedSnapshotJob
to reset the unfinished jobif fe restart or leader change.
Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: