[BugFix] Fix several problem for cluster snapshot backup #55177

srlch · 2025-01-17T01:09:29Z

What I'm doing:

Fix several problem for cluster snapshot backup:

remove finished_time column in cluster_snapshots sys table
fix incorrect persistent and read meta data from image for ClusterSnapshotMgr
Skip first run for ClusterSnapshotCheckpointScheduler
Make sure the checkpoint begin in ClusterSnapshotCheckpointScheduler only
if the checkpointJournanlid is greater than image version
Fix incorrect isUnFinishedState function
Clear all automated snapshot from remote storage if use turn off or new
snapshot is created.
use resetLastUnFinishedAutomatedSnapshotJob to reset the unfinished job
if fe restart or leader change.

Fixes #issue

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

starrocks-cr · 2025-01-17T01:11:27Z

fe/fe-core/src/main/java/com/starrocks/lake/snapshot/ClusterSnapshotMgr.java

+
+        automatedSnapshotSvName = data.getAutomatedSnapshotSvName();
+        automatedSnapshot = data.getAutomatedSnapshot();
+        historyAutomatedSnapshotJobs = data.getHistoryAutomatedSnapshotJobs();
    }

    @Override


The most risky bug in this code is:
The use of ClusterSnapshotUtils.clearAllAutomatedSnapshotFromRemote(null) which may result in unintended deletions if a null argument is used incorrectly, indicating an unsafe invocation.

You can modify the code like this:

public void setAutomatedSnapshotOff(AdminSetAutomatedSnapshotOffStmt stmt) { setAutomatedSnapshotOff(); ClusterSnapshotLog log = new ClusterSnapshotLog(); log.setDropSnapshot(AUTOMATED_NAME_PREFIX); GlobalStateMgr.getCurrentState().getEditLog().logClusterSnapshotLog(log); try { // Ensure the method correctly targets intended snapshots without using null. String snapshotName = (automatedSnapshot != null) ? automatedSnapshot.getSnapshotName() : null; if (snapshotName != null) { ClusterSnapshotUtils.clearAllAutomatedSnapshotFromRemote(snapshotName); } } catch (StarRocksException e) { LOG.warn("Cluster Snapshot delete failed, err msg: {}", e.getMessage()); } }

This ensures that only defined snapshot names are passed to avoid accidental deletions or errors.

starrocks-cr · 2025-01-17T01:11:30Z

fe/fe-core/src/main/java/com/starrocks/lake/snapshot/ClusterSnapshotUtils.java

+            if (path.getName().startsWith(ClusterSnapshotMgr.AUTOMATED_NAME_PREFIX)) {
+                HdfsUtil.deletePath(status.path, brokerDesc);       
+            }
+        }
    }
 }


The most risky bug in this code is:
Potentially unsafe deletion of files without confirmation.

You can modify the code like this:

public static void clearAllAutomatedSnapshotFromRemote(String notAllowDelete) throws StarRocksException { StorageVolume sv = GlobalStateMgr.getCurrentState().getClusterSnapshotMgr().getAutomatedSnapshotSv(); BrokerDesc brokerDesc = new BrokerDesc(sv.getProperties()); String snapshotImageRootPath = getSnapshotImagePath(sv, ""); List<TBrokerFileStatus> statuses = HdfsUtil.listPath(snapshotImageRootPath + ClusterSnapshotMgr.AUTOMATED_NAME_PREFIX + "*", false, sv.getProperties()); for (TBrokerFileStatus status : statuses) { Path path = new Path(status.path); if (notAllowDelete != null && path.getName().equals(notAllowDelete)) { continue; } if (path.getName().startsWith(ClusterSnapshotMgr.AUTOMATED_NAME_PREFIX)) { LOG.warn("Attempting to delete snapshot file: " + status.path); // Add logging before deletion // Consider adding a confirmation mechanism or revisiting the conditions here to ensure safe deletion. HdfsUtil.deletePath(status.path, brokerDesc); } } }

Explanation: The method clearAllAutomatedSnapshotFromRemote potentially deletes snapshots based on a prefix pattern without explicit confirmation or secondary checks. Adding logging provides a trace for deleted paths, and considering a confirmation mechanism can guard against accidental data loss.

starrocks-cr · 2025-01-17T01:11:31Z

fe/fe-core/src/main/java/com/starrocks/lake/snapshot/ClusterSnapshotCheckpointScheduler.java

+                if (!createStarMgrImageRet.first) {
+                    errMsg = "checkpoint failed for starMgr image: " + createStarMgrImageRet.second;
+                    break;
+                }
            }
            LOG.info("Finished create image for starMgr image, version: {}", consistentIds.second);



The most risky bug in this code is:
There is a possibility of bypassing the creation of FE or starMgr images if their journal IDs are equal to or greater than the checkpoint journal IDs. This might be unintended behavior and could lead to missing checkpoints.

You can modify the code like this:

protected void runCheckpointScheduler() { job.logJob(); Pair<Long, Long> getFEIdsRet = feController.getCheckpointJournalIds(); long feImageJournalId = getFEIdsRet.first; long feCheckpointJournalId = consistentIds.first; // Ensure checkpoint is created even if IDs are equal Pair<Boolean, String> createFEImageRet = feController.runCheckpointControllerWithIds(feImageJournalId, feCheckpointJournalId); if (!createFEImageRet.first) { errMsg = "checkpoint failed for FE image: " + createFEImageRet.second; break; } LOG.info("Finished create image for FE image, version: {}", consistentIds.first); Pair<Long, Long> getStarMgrIdsRet = starMgrController.getCheckpointJournalIds(); long starMgrImageJournalId = getStarMgrIdsRet.first; long starMgrCheckpointJournalId = consistentIds.second; // Ensure checkpoint is created even if IDs are equal Pair<Boolean, String> createStarMgrImageRet = starMgrController.runCheckpointControllerWithIds(starMgrImageJournalId, starMgrCheckpointJournalId); if (!createStarMgrImageRet.first) { errMsg = "checkpoint failed for starMgr image: " + createStarMgrImageRet.second; break; } LOG.info("Finished create image for starMgr image, version: {}", consistentIds.second); }

Signed-off-by: srlch <[email protected]>

sonarqubecloud · 2025-01-20T11:19:05Z

Quality Gate passed

Issues
7 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.5% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2025-01-20T13:06:18Z

[Java-Extensions Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2025-01-20T13:08:56Z

[FE Incremental Coverage Report]

❌ fail : 81 / 121 (66.94%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	com/starrocks/lake/snapshot/ClusterSnapshotJob.java	3	8	37.50%	[110, 116, 124, 128, 132]
🔵	com/starrocks/lake/snapshot/ClusterSnapshotMgr.java	47	75	62.67%	[101, 114, 115, 154, 170, 227, 228, 229, 230, 231, 233, 235, 240, 241, 242, 243, 244, 250, 251, 252, 253, 255, 256, 257, 259, 261, 262, 264]
🔵	com/starrocks/lake/snapshot/ClusterSnapshotCheckpointScheduler.java	19	26	73.08%	[93, 94, 97, 108, 109, 112, 131]
🔵	com/starrocks/catalog/CatalogRecycleBin.java	1	1	100.00%	[]
🔵	com/starrocks/server/GlobalStateMgr.java	4	4	100.00%	[]
🔵	com/starrocks/lake/snapshot/ClusterSnapshotUtils.java	1	1	100.00%	[]
🔵	com/starrocks/persist/ClusterSnapshotLog.java	1	1	100.00%	[]
🔵	com/starrocks/leader/CheckpointController.java	5	5	100.00%	[]

github-actions · 2025-01-20T13:18:07Z

[BE Incremental Coverage Report]

❌ fail : 0 / 3 (00.00%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	be/src/exec/schema_scanner/schema_cluster_snapshot_jobs_scanner.cpp	0	3	00.00%	[54, 55, 56]

srlch requested review from a team as code owners January 17, 2025 01:09

wanpengfei-git added the PROTO-REVIEW label Jan 17, 2025

wanpengfei-git requested a review from a team January 17, 2025 01:10

mergify bot assigned srlch Jan 17, 2025

starrocks-cr bot reviewed Jan 17, 2025

View reviewed changes

github-actions bot added the 3.4 label Jan 17, 2025

fix

50c592c

Signed-off-by: srlch <[email protected]>

srlch force-pushed the fix_cluster_snapshot_backup branch from 429e5b9 to 50c592c Compare January 20, 2025 06:03

wanpengfei-git added the META-REVIEW label Jan 20, 2025

wanpengfei-git requested a review from a team January 20, 2025 06:04

srlch added 3 commits January 20, 2025 15:06

fix

c71ca8b

Signed-off-by: srlch <[email protected]>

fix

0f2a27c

Signed-off-by: srlch <[email protected]>

fix

115c4e8

Signed-off-by: srlch <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Fix several problem for cluster snapshot backup #55177

[BugFix] Fix several problem for cluster snapshot backup #55177

srlch commented Jan 17, 2025 •

edited

Loading

starrocks-cr bot Jan 17, 2025

starrocks-cr bot Jan 17, 2025

starrocks-cr bot Jan 17, 2025

sonarqubecloud bot commented Jan 20, 2025

github-actions bot commented Jan 20, 2025

github-actions bot commented Jan 20, 2025

github-actions bot commented Jan 20, 2025

[BugFix] Fix several problem for cluster snapshot backup #55177

Are you sure you want to change the base?

[BugFix] Fix several problem for cluster snapshot backup #55177

Conversation

srlch commented Jan 17, 2025 • edited Loading

What I'm doing:

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

starrocks-cr bot Jan 17, 2025

Choose a reason for hiding this comment

starrocks-cr bot Jan 17, 2025

Choose a reason for hiding this comment

starrocks-cr bot Jan 17, 2025

Choose a reason for hiding this comment

sonarqubecloud bot commented Jan 20, 2025

Quality Gate passed

github-actions bot commented Jan 20, 2025

[Java-Extensions Incremental Coverage Report]

github-actions bot commented Jan 20, 2025

[FE Incremental Coverage Report]

file detail

github-actions bot commented Jan 20, 2025

[BE Incremental Coverage Report]

file detail

srlch commented Jan 17, 2025 •

edited

Loading