-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
With low peers (< 10), sync manager may get stuck in with workers map full of 'w'/'Q' #5794
Comments
Example:
|
Regarding the point above:
e.g., by changing: proc checkPeerScore*[A, B](pool: PeerPool[A, B], peer: A): bool to not kick peers in such a situation, it appears to work well first, but it will trigger nasty endless loop after some hours of seemingly being alright. So, no. Syncing worker is run in this loop: while true:
var peer: A = nil
let doBreak =
try:
man.workers[index].status = SyncWorkerStatus.Sleeping
# This event is going to be set until we are not in sync with network
await man.notInSyncEvent.wait()
man.workers[index].status = SyncWorkerStatus.WaitingPeer
peer = await man.pool.acquire()
await man.syncStep(index, peer)
man.pool.release(peer)
false
except CancelledError:
if not(isNil(peer)):
man.pool.release(peer)
true
except CatchableError as exc:
debug "Unexpected exception in sync worker",
peer = peer, peer_score = peer.getScore(),
peer_speed = peer.netKbps(),
errName = exc.name, errMsg = exc.msg
true
if doBreak:
break Note how there is no yield point between Inside # Avoid a stampede of requests, but make them more frequent in case the
# peer is "close" to the slot range of interest
if peerStatusAge < StatusExpirationTime div 2:
await sleepAsync(StatusExpirationTime div 2 - peerStatusAge)
trace "Updating peer's status information", wall_clock_slot = wallSlot,
remote_head_slot = peerSlot, local_head_slot = headSlot
try:
let res = await peer.updateStatus()
if not(res):
peer.updateScore(PeerScoreNoStatus)
debug "Failed to get remote peer's status, exiting",
peer_head_slot = peerSlot
return
except CatchableError as exc:
debug "Unexpected exception while updating peer's status",
peer_head_slot = peerSlot, errName = exc.name, errMsg = exc.msg
return If the And, if the peer is internally already disconnected, So, endless loop where it acquires a peer with broken connection, then tries to update status, which immediately fails, then releases the peer, which doesn't disconnect due to the Sample of nimbus_beacon_node_gnosis.txt Smells a bit like a separate issue, that |
There is a problem that we somehow processing task which should be visible as |
Only the earliest task can enter Minimal example
|
But this scenario is not actually stuck scenario, i should check it, but even if it happens as soon as new peer appears it will continue working. |
Yes, if the network has enough healthy peers it recovers eventually. This was observed on a tiny devnet with only 10 nodes in total, half of them providing incomplete responses for downloads. In this situation, all healthy peers were acquired in Agree it's not a high priority fix for real public networks. |
I think you should either merge PR fixing issue or close issue :) |
SyncWorkersCount - 1
--> some workers will be stuck inw
modeQ
modew
mode. If it's a new task, assume it succeeds, then the worker is stuck inQ
mode.After a while, every peer is stuck in
w
or inQ
mode, and 0 ..< 16 request is not fulfilled.w
means 'waiting for peer that is not locked by another sync worker'Q
means 'waiting for parent result to become available before we can process our own result'Some ideas on possible approaches to tackle this:
Q
mode so thatw
workers can acquire them? We can still de-score them if they are still around by the time we transition toP
, but we no longer exclusively control them.SyncWorkersCount
dynamically based on number of available peers?SyncWorkersCount
peers?This popped up in a tiny devnet with just 10 peers, where some of the peers provided incomplete responses (blocks without blobs within the spec mandated retention period), leading to gradual descores / disconnects until this situation arised.
The text was updated successfully, but these errors were encountered: