Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-49297][CORE][TESTS] Fix race condition in BlockManagerDecommis…
…sionIntegrationSuite ### What changes were proposed in this pull request? Fixing race condition in the test "SPARK-46957: Migrated shuffle files should be able to cleanup from executor" of `BlockManagerDecommissionIntegrationSuite`. There are at least two race conditions in this test: 1) The `SparkListener` is running in a different thread from the main thread of the unit test so `shuffleBlockUpdates` must be accessed from synchronised block. For this `ConcurrentLinkedQueue` is used similarly to `taskEndEvents` in the `runDecomTest()` method. There was not reported error for this yet. 2) The `SparkListener` is informed earlier about a removed executor than the block manager `stop()` is called/finished. So latter when the test recursively iterating over the files to find the shuffle files the block manager's `stop()` might be running and deleting the underlying files recursively via the disk block manager. This leads to `java.nio.file.NoSuchFileException`, like: ``` - SPARK-46957: Migrated shuffle files should be able to cleanup from executor *** FAILED *** 18848 java.io.UncheckedIOException: java.nio.file.NoSuchFileException: /home/runner/work/spark/spark/core/target/tmp/spark-87f59bc6-b996-42cd-9775-2f704b67f773/executor-e0a030d4-434b-46b3-bdbf-81d4908bb0f5/blockmgr-d88ed5dd-c1c8-4713-9433-10694e736a8e/3a 18849 at java.base/java.nio.file.FileTreeIterator.fetchNextIfNeeded(FileTreeIterator.java:87) 18850 at java.base/java.nio.file.FileTreeIterator.hasNext(FileTreeIterator.java:103) 18851 at java.base/java.util.Iterator.forEachRemaining(Iterator.java:132) 18852 at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1939) 18853 at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) 18854 at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) 18855 at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) 18856 at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) 18857 at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) 18858 at org.apache.commons.io.FileUtils.toList(FileUtils.java:3025) 18859 ... 18860 Cause: java.nio.file.NoSuchFileException: /home/runner/work/spark/spark/core/target/tmp/spark-87f59bc6-b996-42cd-9775-2f704b67f773/executor-e0a030d4-434b-46b3-bdbf-81d4908bb0f5/blockmgr-d88ed5dd-c1c8-4713-9433-10694e736a8e/3a 18861 at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) 18862 at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) 18863 at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) 18864 at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) 18865 at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:171) 18866 at java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) 18867 at java.base/java.nio.file.Files.readAttributes(Files.java:1853) 18868 at java.base/java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:226) 18869 at java.base/java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:277) 18870 at java.base/java.nio.file.FileTreeWalker.next(FileTreeWalker.java:374) 18871 ... ``` To address this issue the local directory used by the decommissioned executor was find out and the wait is changed from getting the `ExecutorRemoved` event to wait for the delete of this directory to be finished. This way the directory tree walked after the delete. Disclaimer: The jira also mentions failures like: ``` [info] - SPARK-46957: Migrated shuffle files should be able to cleanup from executor *** FAILED *** (35 seconds, 200 milliseconds) 15718[info] 0 was not greater than or equal to 4 (BlockManagerDecommissionIntegrationSuite.scala:423) 15719[info] org.scalatest.exceptions.TestFailedException: 15720[info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) 15721[info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) 15722[info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) 15723[info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) 15724[info] at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.$anonfun$new$10(BlockManagerDecommissionIntegrationSuite.scala:423) ``` In my reproduction only the failure with `java.nio.file.NoSuchFileException` was coming so it might be there are still something to fix even after this changes. ### Why are the changes needed? As Maven daily test with Java 21 was failing from time to time (flaky because of the race condition). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It was tested by running the existing unit test repeatedly. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48683 from attilapiros/SPARK-49297-3. Lead-authored-by: attilapiros <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
- Loading branch information