Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(meta): resolve deadlock caused by Hummock write stop #19989

Closed
wants to merge 3 commits into from

Conversation

zwang28
Copy link
Contributor

@zwang28 zwang28 commented Jan 2, 2025

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Before this PR, tables are unregistered from Hummock either after the drop-stream-job barrier succeeds or during recovery. However there is a corner case that can cause a deadlock situation:

  1. The cluster encounters backpressure originating from Hummock. So the earliest barrier becomes stuck. It is expected to be resolved via Hummock compaction.
  2. User drops the table. Meta removes the table from catalog immediately on receiving the drop command. But Hummock manager won't remove the table until the barrier finishes, which is stuck already.
  3. At the moment, compaction task related to this dropped table will always fail due to the inconsistency between catalog and Hummock. So the backpressure will never recover. It's a deadlock situation. Neither the barrier or the compaction can make any progress.

This PR fixes it by ensuring Hummock unregister tables immediately after catalog has done so, only if any of the dropped tables are causing Hummock write stop. Note that during the period between the immediate unregistration of tables and the later actor dropping due to the mutation barrier,

  1. The actors related to the dropped tables will read incorrect data from Hummock. This is fine since those tables and their dependent tables will be eventually dropped anyway.
  2. Consequenly compute nodes may panic due to consistency check failure. This is expected and the cluster will recover successfully after this.

related #15144

Checklist

  • I have written necessary rustdoc comments.
  • I have added necessary unit tests and integration tests.
  • I have added test labels as necessary.
  • I have added fuzzing tests or opened an issue to track them.
  • My PR contains breaking changes.
  • My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
  • My PR contains critical fixes that are necessary to be merged into the latest release.

Documentation

  • My PR needs documentation updates.
Release note

@zwang28
Copy link
Contributor Author

zwang28 commented Jan 3, 2025

This PR does not address the deadlock scenario where a write stall occurs between the issuance of a drop command and the application of a drop barrier.

@zwang28 zwang28 closed this Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant