Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARN EXCEPTION "virtual-map: cache-cleaner StandardFuture: Future has already been cancelled" #17219

Closed
alex-kuzmin-hg opened this issue Jan 4, 2025 · 3 comments · Fixed by #17475
Assignees
Labels
Platform Tickets pertaining to the platform
Milestone

Comments

@alex-kuzmin-hg
Copy link
Contributor

Description

v0.58.1, 100M acc/100m NFT NLG load test on Latitude

new warning:

2025-01-04 07:21:06.147 231130   WARN  EXCEPTION        <<virtual-map: cache-cleaner #4>> StandardFuture: Future has already been cancelled
com.swirlds.common.threading.futures.StandardFuture.cancelWithError(StandardFuture.java:367)
	at com.swirlds.common.threading.futures.StandardFuture.cancelWithError(StandardFuture.java:351)
	at com.swirlds.virtualmap.internal.cache.ConcurrentArray.lambda$parallelTraverse$0(ConcurrentArray.java:357)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)

repro on all nodes

Log: https://perf.analytics.eng.hashgraph.io/ephemeral/v0.58.1_Latitude_N1_01042025/network-node1_swirlds.log

It was not in v0.58.0

Steps to reproduce

Regular 100M NLG test

Additional context

All logs: https://perf.analytics.eng.hashgraph.io/ephemeral/v0.58.1_Latitude_N1_01042025

Hedera network

other

Version

v0.58.1

Operating system

Linux

@poulok poulok added the Platform Tickets pertaining to the platform label Jan 6, 2025
@artemananiev artemananiev self-assigned this Jan 6, 2025
@poulok
Copy link
Member

poulok commented Jan 13, 2025

@alex-kuzmin-hg what was the impact to the node/network?

@alex-kuzmin-hg
Copy link
Contributor Author

@poulok No visible side-effects or performance degradations observed. However, as this warning is new and may indicate unhealthy state, we need to get proper evaluation from engineers here

@artemananiev
Copy link
Member

There are three different calls to ConcurrentArray.parallelTraverse(), where the exception was thrown:

  • VirtualNodeCache.deletedLeaves() - a part of a flush, called on the lifecycle thread
  • VirtualNodeCache.filterMutations() - a part of a flush, called on the lifecycle thread
  • VirtualNodeCache.purge() - a part of node cache release, called on a thread in the cleaning thread pool

In the first two cases, the future returned from parallelTraverse() is then checked for exceptions using getAndRethrow(). All exceptions rethrown in this way on the lifecycle thread would be very visible in the logs (and most likely would kill or stuck the process). Therefore I assume the exception reported in this bug was thrown from purge(). It may cause a memory leak, but for the leak to result in an OOME, there must be a lot of such exceptions.

I don't know what could go wrong during purge(), so as the first step I'm going to improve logging in parallelTraverse(), so all underlying exceptions are properly propagated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Platform Tickets pertaining to the platform
Projects
None yet
3 participants