Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-15793 Fix ZkMigrationIntegrationTest#testMigrateTopicDeletions #17004

Merged
merged 27 commits into from
Sep 6, 2024

Conversation

mumrah
Copy link
Member

@mumrah mumrah commented Aug 26, 2024

Increases a few timeouts and fixes some retry logic. Also remove the technically unsupported 3.4 MV.

@mumrah
Copy link
Member Author

mumrah commented Aug 31, 2024

After some timeout adjustments and removing the 3.4 case, this is looking a lot more stable

I did a 10x deflake run and all the tests passed

@mumrah mumrah changed the title Deflake ZkMigrationIntegrationTest#testMigrateTopicDeletions KAFKA-15793 Fix ZkMigrationIntegrationTest#testMigrateTopicDeletions Aug 31, 2024
@mumrah
Copy link
Member Author

mumrah commented Aug 31, 2024

cc @soarez @divijvaidya who originally reported this flakiness

Copy link
Member

@soarez soarez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @mumrah.

Many of the new timeout values are oddly specific. How did you determine them?

@mumrah
Copy link
Member Author

mumrah commented Sep 2, 2024

@soarez, thanks for taking a look!

Many of the new timeout values are oddly specific

That's a trick I've used before when debugging timeout issues. Sometimes we don't get a good stacktrace or log message on timeouts, so having a unique value lets you correlate it back to a particular line.

@mumrah mumrah requested a review from soarez September 2, 2024 15:20
@soarez
Copy link
Member

soarez commented Sep 3, 2024

@mumrah having a look at the test report, it seems that all attempts at running this test are being skipped.
image

If any of the replicas is offline during the deletion, the topic is kept under zkClient.getTopicDeletions. Would that be useful to ensure the test can always be performed?

@mumrah mumrah added the tests Test fixes (including flaky tests) label Sep 3, 2024
@mumrah
Copy link
Member Author

mumrah commented Sep 5, 2024

@soarez I never could figure out a reliable way to ensure there were pending deletions. You're suggestion could work, but I'd like to address that in another PR

@soarez
Copy link
Member

soarez commented Sep 5, 2024

@mumrah makes sense. Seems that was maybe an unlucky run anyway, as only one execution was skipped in the latest run:

Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [1] Type=ZK, MetadataVersion=3.5-IV2,Security=PLAINTEXT       11 sec  Passed
Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [1] Type=ZK, MetadataVersion=3.5-IV2,Security=PLAINTEXT       16 sec  Passed
Build / JDK 8 and Scala 2.12 / testMigrateTopicDeletions [1] Type=ZK, MetadataVersion=3.5-IV2,Security=PLAINTEXT        9.1 sec Passed
Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [2] Type=ZK, MetadataVersion=3.6-IV2,Security=PLAINTEXT       5 min 20 sec    Failed
Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [2] Type=ZK, MetadataVersion=3.6-IV2,Security=PLAINTEXT       5 min 19 sec    Failed
Build / JDK 8 and Scala 2.12 / testMigrateTopicDeletions [2] Type=ZK, MetadataVersion=3.6-IV2,Security=PLAINTEXT        9.5 sec Fixed
Build / JDK 8 and Scala 2.12 / testMigrateTopicDeletions [3] Type=ZK, MetadataVersion=3.7-IV0,Security=PLAINTEXT        11 sec  Fixed
Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [3] Type=ZK, MetadataVersion=3.7-IV0,Security=PLAINTEXT       5 min 22 sec    Failed
Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [3] Type=ZK, MetadataVersion=3.7-IV0,Security=PLAINTEXT       5 min 19 sec    Failed
Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [4] Type=ZK, MetadataVersion=3.7-IV1,Security=PLAINTEXT       5 min 20 sec    Failed
Build / JDK 8 and Scala 2.12 / testMigrateTopicDeletions [4] Type=ZK, MetadataVersion=3.7-IV1,Security=PLAINTEXT        11 sec  Fixed
Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [4] Type=ZK, MetadataVersion=3.7-IV1,Security=PLAINTEXT       12 sec  Fixed
Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [5] Type=ZK, MetadataVersion=3.7-IV2,Security=PLAINTEXT       14 sec  Fixed
Build / JDK 8 and Scala 2.12 / testMigrateTopicDeletions [5] Type=ZK, MetadataVersion=3.7-IV2,Security=PLAINTEXT        8.7 sec Fixed
Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [5] Type=ZK, MetadataVersion=3.7-IV2,Security=PLAINTEXT       10 sec  Fixed
Build / JDK 8 and Scala 2.12 / testMigrateTopicDeletions [6] Type=ZK, MetadataVersion=3.7-IV4,Security=PLAINTEXT        8.7 sec Fixed
Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [6] Type=ZK, MetadataVersion=3.7-IV4,Security=PLAINTEXT       11 sec  Fixed
Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [6] Type=ZK, MetadataVersion=3.7-IV4,Security=PLAINTEXT       9.6 sec Fixed
Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [7] Type=ZK, MetadataVersion=3.8-IV0,Security=PLAINTEXT       5 min 19 sec    Failed
Build / JDK 8 and Scala 2.12 / testMigrateTopicDeletions [7] Type=ZK, MetadataVersion=3.8-IV0,Security=PLAINTEXT        8.8 sec Fixed
Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [7] Type=ZK, MetadataVersion=3.8-IV0,Security=PLAINTEXT       10 sec  Fixed
Build / JDK 8 and Scala 2.12 / testMigrateTopicDeletions [8] Type=ZK, MetadataVersion=3.9-IV0,Security=PLAINTEXT        9.6 sec Fixed
Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [8] Type=ZK, MetadataVersion=3.9-IV0,Security=PLAINTEXT       16 sec  Skipped
Build / JDK 21 and Scala 2.13 / testMigrateTopicDeletions [8] Type=ZK, MetadataVersion=3.9-IV0,Security=PLAINTEXT       17 sec  Fixed

However, there are 6 failures, all hitting the same 5 minute timeout to delete topics:

org.opentest4j.AssertionFailedError: Timed out waiting for topics to be deleted
	at app//org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:38)
	at app//org.junit.jupiter.api.Assertions.fail(Assertions.java:138)
	at app//kafka.zk.ZkMigrationIntegrationTest.testMigrateTopicDeletions(ZkMigrationIntegrationTest.scala:386)

Does this mean there's some race condition wherein after the migration the topics cannot be deleted? I'm wondering if the test should timeout earlier and retry the deletion. If the topics cannot be deleted at all after the migration, that would be a serious bug.

@mumrah
Copy link
Member Author

mumrah commented Sep 5, 2024

@soarez

Does this mean there's some race condition wherein after the migration the topics cannot be deleted?

At this point in the test, we are waiting for ZK or KRaft to finalize one of the pending deletions. We're not actually doing an explicit delete. I wonder if the 30s timeout on the listTopics call is too short (others are 60s).

@mumrah
Copy link
Member Author

mumrah commented Sep 6, 2024

@soarez I may have found a fix. In https://github.com/apache/kafka/actions/runs/10729396763 I was able to see the "Timed out waiting for topics to be deleted" failure case. In the logs, it looks like the topic hadn't finished being created by the time I manually put the deletion in ZK. I've added a condition to wait on a full ISR before continuing, and the next run (both deflake x10 and this PR) were successful.

I've kicked off both again, if they both pass I'd like to merge this 🤞

Copy link
Member

@soarez soarez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @mumrah.

LGTM

@soarez soarez merged commit e4d108d into trunk Sep 6, 2024
7 of 8 checks passed
@mumrah mumrah deleted the gh-deflake-testMigrateTopicDeletions branch September 6, 2024 20:59
cmccabe pushed a commit that referenced this pull request Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Test fixes (including flaky tests)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants