-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gossipsub Dynamic Message Diffusion #4031
Comments
I would like to oppose Approach 2, as it completely eliminates anonymity of validators. Some people said there is already no anonymity in the p2p network, so we don't have to worry about it. That's not true. The only way to guess the IP addresses of validators is to observe the message receiving time in the GossipSub topics, which reduces the anonymity set, but doesn't give you the exact IP addresses. But with Approach 2, you will know the exact IP addresses. Why we care about anonymity? Some people may only think of the DoS attack potential. That's not actually the only problem. IP addresses tell "who you are". For example, if I solo-stake and later withdraw the money and transfer it to an EOA, I would lose the pseudonymity property of the EOA automatically, because you can link the EOA address to "who you are".
I think if you have some techniques in mind, it will be helpful to list them here. |
Currently we use the block propagation time as a metric to see if the network is doing well or not. See [1]. If Approach 3 increases the block propagation time, it will contradict to the metric. So it means the approach is not good. Remember that the goal is not to reduce the number of duplicates but the goal is to make everyone receive the block/blobs as fast as possible with a larger number of blobs (and with a reasonable network topology). Reducing the number of duplicates is one of the ways. So it's hard to say if
Looking at mean or median doesn't make any sense. We have to look at one of the worst cases. Like in [1], we use P95. [1] https://ethresear.ch/t/block-arrivals-home-stakers-bumping-the-blob-count/21096 |
@yiannisbot - Thank you for making this issue. These have been on my to-do list but haven't got around to it yet. @ppopth - The anonymity argument has been a long-standing one and has historically prevented us from potentially making improvements to the protocol. I am a strong advocate for security, however I think it is useful to understand the exact trade-offs. I don't know how much of a bandwidth/latency gain we could make if we sacrificed pseudo-anonymity. If its >95%, perhaps it's worth it? In the beginning (and I think still now, on some clients/versions) - Validators were required to subscribe to long-lived subnets. This meant that you could just search the DHT, grab any ENR that advertised long-lived subnets and you knew that this ENR corresponded to X validators (where X was the number of long-lived subnets). An ENR contains that nodes IP address. So very easily, you can tie an IP address directly to validators. It would then be a small matter of figuring out which validators that IP address had, but you could do that by connecting to them and watching what subnets they subscribed to and associated attestations. So, we have had non-anonymity on the network for years, which I know is not a good argument for doing it, but just wanted to add perspective in tradeoffs. We could go back to what we had, but potentially gain network efficacy. In the current world, (at least in Lighthouse) all BNs subscribe to long-lived subnets, so you can't identify validators from ENRs. However I believe this is changing in PeerDAS with validator custody. I think you can search ENR (which contain IP addresses) and find nodes that have larger custodies. You then have IP adddresses of nodes that have validators. Again once you have that, you connect to those nodes, and watch their traffic to determine which validators they have. In the early days, a researcher built a few sentry nodes on the network and watched gossipsub traffic and tied IP addresses to all validators, with some statistical uncertainty. It showed even with all the measures in place, it was possible to tie IP addresses to validator indexes, just through gossipsub traffic. I suspect the EF devops team has also done something similar (cc @parithosh @barnabasbusa )? As a counter-measure (not saying its the most effective or without its own caveats) Lighthouse gives users the option to create two kinds of nodes. Normal BN nodes and separate
Not sure I follow this. Withdrawals can be sent from any node on the network. You can sign these things offline and broadcast them anywhere. But lets say you broadcast from your local node, it just sends the validators value to its withdrawal address. I guess if you match validators to IP's then yeah, you also match withdrawal addresses to IPs (assuming the node that is staking owns the ether). Approach 1 - I'm not against this. We could add some levers in there that are time-based that help with diffusion. It does sound to have some pitfalls, but if we don't drastically cut-off the diffusion, it could help. Approach 2 - I agree that this definitely reduces/eliminates pseudo-anonymity of validators. I am of the opinion that it might be worth exploring in simulations and getting a better understanding of our current anonymity to make an informed decision as to whether this might be something worth sacrificing Approach 3 - We have tried simulating a few of these. Its hard to get something that looks like mainnet. In the end, we have made Lighthouse configurable via CLI that can change the degree. I think for a small period of time we did drop it to 6, but the default is currently 8. |
Thanks for the input everyone! Very interesting thoughts and context! Some further input and comments: Approach 1
I don’t have ideas here, so feel free to share approaches that are worth considering @AgeManning. Approach 2
Here’s a rough draft: Let’s hypothetically assume that The approach does not keep the block producer anonymised, but reduces the chances of the next hop knowing with more than 30% certainty that the previous node was actually the block producer. The approach results in some messages incurring more duplicates and faster propagation, i.e., those that start with There could be several things that could be done to streamline this according to what we’d want (e.g., setting the random Approach 3
Well, the thing is that we don’t know IIUC? :) It could very well be the case that blocks propagate at the same speed (i.e., no increase in propagation latency), just with less duplicates. Looking at the p95 is best, I agree, but if we only want to be based on max arrival time: https://probelab.io/ethereum/block_arrival/2024-48/#message_arrivals_max_min_on_1536s_window_on_topic_mainnet_beacon_block-plot, we’re already in a very bad place. A good approach here would be to set the default |
Well, we do know, from maths: assuming unsaturated bandwidth (which is broadly the case in the happy case at least) and assuming the initial spread has almost no duplicates this can be used to derive "probable total spread latency" - it's the during the last hops that duplicates happen. If you send to fewer peers in the beginning, the number of peers that have received the message for each subsequent hop is exponentially lower, so it takes longer to reach all of them.
Assuming 5k nodes on the network, you can see how much of a difference this average makes. You can also reason about the probability of a duplicate - it is next to none for the first 3 hops and only on the 4th do we start seeing duplicates. 8 was chosen roughly based on these numbers - they weren't picked randomly at the time. What the above numbers don't take into account is floodsub in the first hop - this is a problematic thing to reason about regardless, because it causes a lot of problems (privacy, bandwidth competition). |
To extend on what @arnetheduck was saying, if we assume a system with uniform nodes, uniform latencies between nodes, and unsaturated bandwidth (i.e. we reason in time steps), we can try to model the diffusion process, the speed of diffusion, and what duplication happens when. It is obviously a lot more complicated to model this in the real network, but there are still lots of similarities with the simple time step based model. Time step based modellingWorth noting that a node with degree D is first receiving from (at least) 1 node, and then sending to the remaining (at most) D-1. So the exponent is only D-1, except for the first step, where it is D or what the source floods to. Now, let’s say that However, this is overestimating What one can think trough is where these nodes in So a node in
More overlaps means a smaller I’m far from having the exact stochastic formulation, and a more realistic continuous time model would be even more difficult. But it is clear that lots of nodes sending with a high D to a limited set of nodes results in lots of duplicates. Edge count based modellingAnother way of looking at the diffusion, as a whole, is to realise that we are sending, overall, between
So on average we have But where are we in this range? That depends on the speed of diffusion, which implicitly depends on D again. The faster the diffusion, the less informed we are about the state of the peer node, hence the more cross-sends we will have. So overall, a larger D is bad from the perspective of duplicates twice. It is more sending, and it is more cross-sending. With this long premise, let me focus on the approachesThe main difference I see is that Approach 1 and 2 tries to somehow estimate the diffusion state and act based on that. Approach 3 is instead just a simple parameter modification. Approach 3simply reducing D has advantages for reducing duplicates, but clearly at a cost of a delay increase. It is a parameter that has to be evaluated based on the expected number of nodes, and the latency allowed for the actual use case. I’m not convinced we are using the right D for all our topics, or that we should use the same for new topics. In fact for DAS we’ve used smaller values in simulations, but there we are not in the “unsaturated bandwidth” case, which makes a huge difference. Approach 1The issue I see with this time-based approach is that it amplifies the “state” of the network. If all is good and we are on the happy path, it keeps diffusion fast while reducing duplicates. But if we are not on the happy path, and diffusion is already slow for any reason, it slows it down even more. This makes me a bit sceptical. What we need is to estimate diffusion state, and time is a strange estimator for that. The problem, as I see it, is that we might not have a better estimator. Approach 2I’m a big fan of this approach as a generic GossipSub mechanism (that’s why I’ve also proposed it in the past). This of course does not mean that it is the right tool for some specific use cases where privacy must be preserved. So we might not use it on all topics, but e.g. for DAS it might make sense. The Push-Pull phase transitionIndependent of how we estimate the diffusion state, with approach 1 or 2, I would modify node behaviour differently from what you describe here. This is what I’ve called Push-Pull phase transition in https://ethresear.ch/t/fulldas-towards-massive-scalability-with-32mb-blocks-and-beyond/19529#duplicate-reduction-techniques-24.
|
Maybe it doesn't matter here, but I would like to point out that the upper bound of the saving is 75%. Currently a node receives 4 copies and send another 4 copies on average. The best you can do is to receives a copy and send another copy (which is just 75% reduction).
But, I think I have a strong sense of how much anonymity the CL currently provides (let me address all the mentioned issues below).
I think claiming that it's already broken is kind of an over-claim (unless you take flood publishing into account). People keep telling me that anonymity is already broken without telling me how, so we don't have an evidence yet that it's already broken.
That sounds easy, but I don't see how it's easy or a small matter at all. Watching subnets they subscribe to doesn't tell which validators that IP address has. Each att subnet has about 400 validators at one slot and about 1,000 nodes, how is it a small matter to figuring that out? Even if you watch the traffic metric like the message arrival time and you can reduce the number of candidates to like 10 (made-up number), that's still a kind of anonymity and it's not broken yet. (you have to guess more which one from those 10 nodes).
I agree that flood publishing kills anonymity, but we don't want to do that anymore, so we are good.
As I said earlier, watching the traffic doesn't tell you exactly which validator they have.
That kind of research is very valuable. I would love to see how you do it and analyze the attack. I thin it's worth making it a formal ethresearch post. In conclusion, I will accept it if we have a consensus that Ethereum is not supposed to provide (pseudo-)anonymity, but claiming that it's already broken is an over-claim. |
I want to point out that from this study https://ethresear.ch/t/number-duplicate-messages-in-ethereums-gossipsub-network/19921, we are already very close to the lower bound. That is, we already have around D/2 duplicates. So it's already unlikely that a message would travel on a link twice, which is kind of a corner case (at least for the current mainnet). |
This has been done before, in the early days, maybe its different now. For simplicity, lets say we target a specific peer, and therefore a specific IP (although it might be more beneficial to have a node with D = 5000 and try and get on everyone's mesh). Depending on the client, you can know instantly which long-lived subnets it will be subscribed to so you avoid those subnets (but you remain subscribed to all subnets). When a validator produces an attestation, it will send that attestation to you (because you are subscribed to those subnets) with some statistical uncertainty. You wait for that and that's it you're done. It will send you the attestation which indicates it's validator id. You know its that validator and not someone else, because it is not subscribed to the subnet, so it is not forwarding any attestation. Validators don't subscribe to subnets unless they are aggregators. Even if that wasn't the case, you could make sure you are on their mesh, and watch the unique messages on subnets that come from them first. Over-time you can identify their validators, because they will be the only ones sending you their attestations (it will be unlikely to get an attestation from another person than from them, although not impossible). Correlating this timing over a few epochs should allow you to narrow it down to exact validators with decent accuracy, i'd imagine. If you are still not convinced, we can do this as a side project, to prove the idea. |
@cskiraly thanks for the insightful input. A couple of clarifications: Approach 1How do you define "happy path"? And what would be reasons for "diffusion is already slow for any reason"? E.g., does it include nodes being CPU-overloaded and dropping network-related events? The Push-Pull phase transitionWhat does this involve exactly? I'm not sure I follow. Reducing |
Yeah, I wasn't aware that validators don't subscribe to subnets. That kind of makes sense because finding and connecting to mesh peers are not cheap and they have to do every epoch. If nodes use fan-out peers to publish, it kills anonymity and I was aware of that.
I'm not convinced by this. In fact, it's very likely to get from another person. Let me explain why. Let's assume you are 2 hops away from the publisher (publisher <--> middle man <--> you). Because there are only two hops, the latency will be low, so this path is a fast path. the question is how am I sure that there is no faster path? Assume that there are 10k nodes in the network. You can see that there are only 8^1, 8^2, 8^3 nodes in the first hop, second hop, and the third hop respectively.
So I can say with confidence that all other paths from the publisher to you are at least 4 hops long. So the 2-hop path through that middle man is the only shortest path. It also means you will always receive the message from the middle man rather than the original publisher. Do you see now that it's very likely that you will always receive the message from someone that's not the publisher?
Now, I will show you that it's not accurate at all. As I said, there are only 8^2 nodes in the second hop, but there are only 8^1 in the first hop (direct mesh peer to the publisher). Assuming that all the second hop peers always receive the message from one of the peers (their middle mans), it can happen with high probability as I mentioned above. Now you do the attack and always receive the message from one peer. But you can see that there are only 8 mesh peers to the publisher, so you will have the chance at most 8/8^2 = 8/64 = 12.5% that you are directly connected to the publisher. So 12.5% is not just decent accuracy, but it's too way low. You can see from my numbers that gossipsub provides pretty good anonymity, but I agree that it's only for subscribing nodes. By the way, I'm sorry for the late reply. |
Happy path in this case is just diffusion as you would expect based on only bandwidth and baseline network latency constraints. The reasons for slow diffusion can be many, from temporary network outages to CPU overload, or simply late release of the message itself because of e.g. a disk I/O hiccup or waiting on something which supposed to be immedate.
Swithing to "pull mode" would mean sending only HAVE information on a mesh link, and letting the other side WANT the message. So this is changing how you send the message to your mesh peer. You do not "push" it, just let it know you have it.
|
@ppopth - All good for the late reply. Thanks for clarifying. I think the arguments you are making are for the block topic, where mesh's are relatively stable and that the de-anonymizing node has a mesh degree of 8. If i wanted to de-anonymize validators, I'd create a node on the subnets, set my mesh degree to 20,000, and continually send grafts to every node until I was on every node's mesh. I therefore have a direct connection to everyone on the subnets (which often get formed and unformed as nodes subscribe/unsubscribe when they are aggregators). I've had lighthouse been eclipsed in a similar manner before, so I know this kind of thing is possible. It would take a little bit of engineering to do this and keep track of which validators are which, but again, I think its more possible than you might think. But also because we publish to our fan-out and not-subscribe, that also is mostly game-over for anonymization anyway. |
It's also good to note that that requires like 2,000 times more bandwidth than normal full nodes. If you have that high bandwidth, you could do more extraordinary things like splitting the network which results to the liveness issue of the chain. Even though it's achievable to get a 50Gbps machine, it makes more sense to exclude it from the threat model, because otherwise we are doomed anyway. But that doesn't matter anyway because as you said publishing to fan-out peers kills anonymity and nothing else matters. |
I would like to correct this. We don't receive 4 copies and send 4 copies. We expect to receive 8 copies (which can be lower than that if some mesh peers are dead). I also put the correction in the original research result here. https://ethresear.ch/t/number-duplicate-messages-in-ethereums-gossipsub-network/19921/4?u=pop |
Problem
As per the current Gossipsub spec, messages are pushed to all peers within the peer’s mesh, defined by the variable
D
, the mesh size. Inevitably, this results in duplicate messages, which is generally expected and a somewhat desirable property to defend against malicious behaviour or attacks. At the same time, the number of duplicates should be limited to the extent possible to avoid overloading nodes.Intuitively, given that the rate of diffusion stays stable throughout the 4-second propagation window, we expect to see more duplicates towards the end of the window, when the message has already propagated to the majority of the network.
Solution Spectrum
Approach 1: In order to reduce the amount of duplicates as a result of excessive diffusion at the end of the 4-second window, we propose that dynamic diffusion should be considered. According to the dynamic diffusion scheme, more aggressive propagation of messages takes place at the beginning of the window and slower propagation towards the end. For example, a message is propagated to all
D
peers of a peer’s mesh during the first two seconds and toD/2
peers during the latter two seconds of the window. However, setting diffusion parameters based on time doesn’t sound like a great choice - makes things rigid and tied to clocks of different nodes in the network, which is generally not a great idea.Approach 2: An alternative way of defining the aggressiveness of message diffusion is through the number of hops that a message has propagated, e.g., a message is pushed to:
D
nodes whenhop_count < 2
from the message producer,D/2
nodes when2 < hop_count < 4
andD/4
whenhop_count > 4
. This, however, means that the message/block producer becomes known to its immediate peers that seehop_count = 0
, although there could be techniques to obfuscate that a little bit.Approach 3: A much simpler alternative is to just reduce
D
from the current default of8
down to6
. This approach won’t solve the issue of more duplicates towards the end of the propagation window (i.e., it’s not dynamic), but will reduce the diffusion velocity throughout the 4-second window and as a result the amount of duplicates throughout. From the Block Arrival times seen at ProbeLab’s dashboard [link], there seems to be space to reduce aggressiveness (and duplicates) and (inevitably) increase block propagation time - we see both mean and median block arrival times well below the 4 second mark, hovering on average around 2.1-2.3 secs.The purpose of this issue is to gather feedback on whether investigating this further is desirable, as well as ideas on how to formulate it better (e.g., based on hop count or some other parameter).
[Initial ideas in this issue have been discussed with @cortze @AgeManning @cskiraly and others during Devcon 7.]
The text was updated successfully, but these errors were encountered: