Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix] Fix several problem for cluster snapshot backup #55177

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

srlch
Copy link
Contributor

@srlch srlch commented Jan 17, 2025

What I'm doing:

Fix several problem for cluster snapshot backup:

  1. remove finished_time column in cluster_snapshots sys table
  2. fix incorrect persistent and read meta data from image for ClusterSnapshotMgr
  3. Skip first run for ClusterSnapshotCheckpointScheduler
  4. Make sure the checkpoint begin in ClusterSnapshotCheckpointScheduler only
    if the checkpointJournanlid is greater than image version
  5. Fix incorrect isUnFinishedState function
  6. Clear all automated snapshot from remote storage if use turn off or new
    snapshot is created.
  7. use resetLastUnFinishedAutomatedSnapshotJob to reset the unfinished job
    if fe restart or leader change.

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.4
    • 3.3
    • 3.2
    • 3.1
    • 3.0

@srlch srlch requested review from a team as code owners January 17, 2025 01:09
@wanpengfei-git wanpengfei-git requested a review from a team January 17, 2025 01:10
@mergify mergify bot assigned srlch Jan 17, 2025

automatedSnapshotSvName = data.getAutomatedSnapshotSvName();
automatedSnapshot = data.getAutomatedSnapshot();
historyAutomatedSnapshotJobs = data.getHistoryAutomatedSnapshotJobs();
}

@Override
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most risky bug in this code is:
The use of ClusterSnapshotUtils.clearAllAutomatedSnapshotFromRemote(null) which may result in unintended deletions if a null argument is used incorrectly, indicating an unsafe invocation.

You can modify the code like this:

public void setAutomatedSnapshotOff(AdminSetAutomatedSnapshotOffStmt stmt) {
    setAutomatedSnapshotOff();

    ClusterSnapshotLog log = new ClusterSnapshotLog();
    log.setDropSnapshot(AUTOMATED_NAME_PREFIX);
    GlobalStateMgr.getCurrentState().getEditLog().logClusterSnapshotLog(log);

    try {
        // Ensure the method correctly targets intended snapshots without using null.
        String snapshotName = (automatedSnapshot != null) ? automatedSnapshot.getSnapshotName() : null;
        if (snapshotName != null) {
            ClusterSnapshotUtils.clearAllAutomatedSnapshotFromRemote(snapshotName);
        }
    } catch (StarRocksException e) {
        LOG.warn("Cluster Snapshot delete failed, err msg: {}", e.getMessage());
    }
}

This ensures that only defined snapshot names are passed to avoid accidental deletions or errors.

if (path.getName().startsWith(ClusterSnapshotMgr.AUTOMATED_NAME_PREFIX)) {
HdfsUtil.deletePath(status.path, brokerDesc);
}
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most risky bug in this code is:
Potentially unsafe deletion of files without confirmation.

You can modify the code like this:

    public static void clearAllAutomatedSnapshotFromRemote(String notAllowDelete) throws StarRocksException {
        StorageVolume sv = GlobalStateMgr.getCurrentState().getClusterSnapshotMgr().getAutomatedSnapshotSv();
        BrokerDesc brokerDesc = new BrokerDesc(sv.getProperties());
        String snapshotImageRootPath = getSnapshotImagePath(sv, "");
        List<TBrokerFileStatus> statuses =
                HdfsUtil.listPath(snapshotImageRootPath + ClusterSnapshotMgr.AUTOMATED_NAME_PREFIX + "*",
                        false, sv.getProperties());

        for (TBrokerFileStatus status : statuses) {
            Path path = new Path(status.path);
            if (notAllowDelete != null && path.getName().equals(notAllowDelete)) {
                continue;
            }
            if (path.getName().startsWith(ClusterSnapshotMgr.AUTOMATED_NAME_PREFIX)) {
                LOG.warn("Attempting to delete snapshot file: " + status.path);  // Add logging before deletion
                // Consider adding a confirmation mechanism or revisiting the conditions here to ensure safe deletion.
                HdfsUtil.deletePath(status.path, brokerDesc);       
            }
        }
    }

Explanation: The method clearAllAutomatedSnapshotFromRemote potentially deletes snapshots based on a prefix pattern without explicit confirmation or secondary checks. Adding logging provides a trace for deleted paths, and considering a confirmation mechanism can guard against accidental data loss.

if (!createStarMgrImageRet.first) {
errMsg = "checkpoint failed for starMgr image: " + createStarMgrImageRet.second;
break;
}
}
LOG.info("Finished create image for starMgr image, version: {}", consistentIds.second);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most risky bug in this code is:
There is a possibility of bypassing the creation of FE or starMgr images if their journal IDs are equal to or greater than the checkpoint journal IDs. This might be unintended behavior and could lead to missing checkpoints.

You can modify the code like this:

protected void runCheckpointScheduler() {
    job.logJob();

    Pair<Long, Long> getFEIdsRet = feController.getCheckpointJournalIds();
    long feImageJournalId = getFEIdsRet.first;
    long feCheckpointJournalId = consistentIds.first;

    // Ensure checkpoint is created even if IDs are equal
    Pair<Boolean, String> createFEImageRet = feController.runCheckpointControllerWithIds(feImageJournalId, feCheckpointJournalId);
    if (!createFEImageRet.first) {
        errMsg = "checkpoint failed for FE image: " + createFEImageRet.second;
        break;
    }
    
    LOG.info("Finished create image for FE image, version: {}", consistentIds.first);

    Pair<Long, Long> getStarMgrIdsRet = starMgrController.getCheckpointJournalIds();
    long starMgrImageJournalId = getStarMgrIdsRet.first;
    long starMgrCheckpointJournalId = consistentIds.second;

    // Ensure checkpoint is created even if IDs are equal
    Pair<Boolean, String> createStarMgrImageRet = starMgrController.runCheckpointControllerWithIds(starMgrImageJournalId, starMgrCheckpointJournalId);
    if (!createStarMgrImageRet.first) {
        errMsg = "checkpoint failed for starMgr image: " + createStarMgrImageRet.second;
        break;
    }
    
    LOG.info("Finished create image for starMgr image, version: {}", consistentIds.second);
}

@github-actions github-actions bot added the 3.4 label Jan 17, 2025
Signed-off-by: srlch <[email protected]>
@srlch srlch force-pushed the fix_cluster_snapshot_backup branch from 429e5b9 to 50c592c Compare January 20, 2025 06:03
@wanpengfei-git wanpengfei-git requested a review from a team January 20, 2025 06:04
srlch added 3 commits January 20, 2025 15:06
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[FE Incremental Coverage Report]

fail : 81 / 121 (66.94%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/lake/snapshot/ClusterSnapshotJob.java 3 8 37.50% [110, 116, 124, 128, 132]
🔵 com/starrocks/lake/snapshot/ClusterSnapshotMgr.java 47 75 62.67% [101, 114, 115, 154, 170, 227, 228, 229, 230, 231, 233, 235, 240, 241, 242, 243, 244, 250, 251, 252, 253, 255, 256, 257, 259, 261, 262, 264]
🔵 com/starrocks/lake/snapshot/ClusterSnapshotCheckpointScheduler.java 19 26 73.08% [93, 94, 97, 108, 109, 112, 131]
🔵 com/starrocks/catalog/CatalogRecycleBin.java 1 1 100.00% []
🔵 com/starrocks/server/GlobalStateMgr.java 4 4 100.00% []
🔵 com/starrocks/lake/snapshot/ClusterSnapshotUtils.java 1 1 100.00% []
🔵 com/starrocks/persist/ClusterSnapshotLog.java 1 1 100.00% []
🔵 com/starrocks/leader/CheckpointController.java 5 5 100.00% []

Copy link

[BE Incremental Coverage Report]

fail : 0 / 3 (00.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/exec/schema_scanner/schema_cluster_snapshot_jobs_scanner.cpp 0 3 00.00% [54, 55, 56]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants