-
Notifications
You must be signed in to change notification settings - Fork 997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network shards (Attnet Revamp + DAS Distribution Columns) #3623
Conversation
Co-authored-by: Divma <[email protected]>
@AgeManning thanks for making this PR 👍
From my perspective
|
I still hope we would mix in randao to the |
Do we potentially want to provide the ability for a node to voluntary participate on a number of additional network shards? This feature was a part of PeerDAS proposal |
@Nashatyrev What do you mean by leaving other shards vacant? do you mean that that the vacant shard won't be assigned any columns for the period? if that's the case, wouldn't it make retrieving columns harder, because nodes in those shards won't serve any columns? (and they'd also not be very useful to other peers for the period) Thinking in terms of 2D PeerDAS, is there any benefit for a node to custody both rows and columns? Was wondering if there's a 3rd option - instead of leaving the other shards vacant, could they be assigned row subnets instead? (could be messy mixing columns and row though, but it came to my mind and I thought I'd mention anyway) |
I actually did't mean any specific topic. Particularly DAS column subnets would most likely occupy all the shards. I was rather meaning for example sync committee subnets (there are just 4 of them). We may decide to simply assign them to just 4 network shards so remaining shards would not serve any sync committee subnets |
I think sharding should go hand-in-hand with the DAS structure, with its own diffusion protocol which should move away (on the long term) form GossipSub. What we really need is a block diffusion protocol (Diffuse?) that is integrated with the concept of the 2D structured, mapped into the bits of a virtual ID space. This ID space can be linked to the nodeID space. With X rows and Y columns, you can easily map them to X+Y bits. Then we can have compact representation of nodes "interest" in given bits, representing specific columns (upper part of bits) or rows (lower part of bits). We can derive these interests systematically, eventually running it through randomization, or signal these with a bitmap of X+Y bits. |
Handling it as an integrated protocol instead of a series of GS topics would also allows us to have compact representations of HAVE list and alike, or to express "interest" in single segments (X,Y coordinates), or single segments run through some randomization, eventually some randomization including the nodeID, a.k.a. push sampling. |
I think that's needed, since nodes are not equal, neither in their resources, nor in their interest to verify and custody more of the block. I think instead of selecting one shard, we should randomize the nodeID into the column+row address space a few times, allowing us to select multiple columns and rows. The number of samples we need from the column ID space and from the row ID space can be controlled and advertised by the node, keeping a minimum requirement that allows other nodes to know a base set without looking up the advertised value. Another topic is voluntary participation "hand-picking" specific columns or rows. I.e. participation is not derived from the NodeID, is is instead like voluntarily joining a GS topic. I think that could be allowed as well, but I can imagine nodes prioritizing peers based on where the interst comes from. Example priority order:
|
Thanks everyone for the detailed review so far! Some of my thoughts:
I didn't quite follow the complication here. I thought in the example of 32 column subnets, we could just do something like shard%32 = subnet_column. This would be the second case you mentioned. From a peer management perspective, if I balanced my node such that its peers were uniformly distributed across the shards, then i'd have double the amount in the das subnets and they'd be uniformly distributed. If I need a peer on any column subnet, I can search for 2 shards to find them if needed. I might be missing something?
Yep. I've decoupled the shard count from the subnet count to make this a bit easier to do. i.e we have the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate the inline comments in compute_network_shard
, it's very clear now
I would like to clearly separate node roles with respect to network shards:
My original comment was related to the server role solely. |
I was probably talking about slightly another matter. |
To simplify, let's even forget the rotation for the moment and just consider a single instance of a mapping of f: node_id -> sets of subnets to participate in. A desirable property is that, whenever data is made available on < 50% of the subnets, only a small fraction of node_ids are such that f(node_id) is a set of subnets that is completely available. In other words, no matter how you select one half of the subnets, most assignments are not contained within that half. This is useful because any node outside of this small fraction would be unable to get all of the data they are looking for according to f(node_id), and so they would immediately consider this data to be unavailable before even trying to sample. In particular, such a node would not vote for the related block. Whether or not we have this property is entirely dependent on the mapping f, and on the number of nodes that we have. It's essentially just a combinatorial property. As long as there are enough node_ids and f spreads their assignments out well enough, it is not possible to trick a high percentage of the nodes. Note that for this purpose f does not need to be unpredictable or changing over time. Correct me if I am wrong: in order to achieve the goals of this PR (in particular "all clients will only have to optimise their peer set to be uniformly distributed based on network shards and they will be guaranteed to have a uniformly distributed set across all transient topics"), the mapping needs to be node_ids -> shard -> set of subnets to custody. If that's the case, it doesn't really matter how random the node_ids -> shard mapping is, or how random the shard -> set of subnets mapping is. Either way, there are only 64 possible sets of subnets that a node might be able to custody, which is the same as having only 64 nodes for the purpose of the property above. We could in principle just use network shards for the backbone of the topics, e.g., if the custody requirement is 8, we could map each shard to 2 subnets, while the other 6 could still be directly pseudorandomly chosen based on the node_id, without going through the node_id -> shard "bottleneck". That would work fine from this security perspective. But then doesn't it fail to achieve the goal of allowing a simple criterion for peer set optimization? If you just optimize for a uniform distribution of shards in your peer set, you are only taking into account a small portion of a node's custody, and possibly kicking out nodes which are in subnets you need.
As mentioned above in this message, the issue I see is that this mapping, while pseudorandom, only produces 64 (shard count) possible values. |
@@ -136,9 +136,11 @@ MESSAGE_DOMAIN_VALID_SNAPPY: 0x01000000 | |||
SUBNETS_PER_NODE: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should SUBNETS_PER_NODE
be removed while SUBNETS_PER_SHARD
and SHARDS_PER_NODE
should be added instead?
Yeah, I'm wrong here, didn't get updated spec. Subnets are column only, so attacker needs to control 50%+ subnets to make any harm. |
Reading through the discussion, what is not clear to me (and maybe here I'm just repeating @fradamt) is the relation between I see two cases here: 1, this is really just a support function at the network overlay level, relaying messages not because these nodes are interested, but because obliged to do so. In this case it does help connect the overlay, but it also adds extra traffic to count with. We can however juggle the ratio of support nodes as we like, it is just a support function providing stable extra nodes, connected between them to form a "backbone". Hopefully not too connected, so that the nodes temporarily joining a topic can fit in their mesh. As a consequence, our overlay graph will be far from a d-regular random graph, but it will be at least connected. The danger I see is a bit in this in-fact, that the topology will be much more a hierarchical topology than we like it to be, but this might be a reasonable compromise for these topics. One could even think of doing some peer exchange to re-randomize it fast, "forcing" links between the short-term nodes as well. 2, shard nodes are "forced" to be also interested in the topic, which means that sharding must influence the topic distribution logic on top of this mechanism. This would mean that network sharding is not a generic tool for topics with fast rotation (or high churn, or similar), but a cross-layer mechanism that is coordinated with some interest assignment logic. In that case, taking the example of applying this to the case of PeerDAS column topics, we are indeed effectively reducing the number of columns we have to the number of shards. Based on the original description, I tend to see this more as the first option, but I'm curious whether I understood this correctly. Another thing that puzzles me a bit is why finding peers takes that much time if we have a NodeID -> topic mapping function. Is it the number of peers vs. number of topics, the mesh degree bounds, the warm-up time on mesh links, or something else that makes it so that |
From what I understand from your messages, I think there is a misunderstanding of what this PR does and how it works in practice. Perhaps it might be useful to align a call to discuss, but i'll also explain here as best I can. Lets, for now, ignore all the complexities of this PR and remove the idea of shards and just talk about a node-id -> custody column mapping. For this explanation, I'm going to assume the custody columns are also going to form this "backbone gossipsub" structure I've been talking about, which I think is what we want in practice anyway. At this point, this PR is general with respect to the number of custody columns and the total number of column subnets, but for the sake of simplicity, lets say each node must custody 2 columns and in total there are 64 of them. Now lets make a node-id mapping that uniformly distributes node-ids to all 64 DAS columns (think node-id mod 64). In fact because the custody requirement in this example is 2, then each node-id maps to exactly 2 columns pseudo-randomly. So every node on the network, now must custody 2 columns. There are now a few problems in practice we need to solve:
In practice, we can't do a), because it takes too much time (and is unreliable, for a variety of reasons I wont mention here), and sampling has to be fast. So we have to do b). Given our node-id to custody column mapping, we can do these discovery process in advance or some other technique to ensure that at all times we always have 2 or so peers that can tell us a sample for every column, so we can sample when we need. I think one point that might be of confusion here, is that we can connect and maintain a connection with these peers, without subscribing to the topic. So we can have peers on all das columns, but we don't subscribe to that topic, therefore we don't form meshes with them, we dont receive messages from them and those columns are essentially non-existent to us. We only use them to quickly sample via RPC when we need. Hopefully this makes it clear about the actual workings of clients in the DAS context and provides a bit more context around this PR. As far as I can tell, I think we still maintain the property that @fradamt wanted. Let me know if you think we should have a call, because I could also be misunderstanding the issue. |
@AgeManning, thanks for the detailed reply. That you can connect without subscribing is clear to me. Also, it is clear to me that having a node-id -> custody column mapping is good for us, because it is enough to know a number of node-ids to see who to potentially subcribe to. The part where I'm confused is why you would not keep a reasonably large set of node-ids and ENRs. Your Discv5 buckets already have a large number of them, also doing liveness check every now and then through UDP.
Dialing a peer to get a sample should not be so difficult, if we do not require peer scoring to play a large role there. I wonder whether we need peer scoring in that context. I'm a bit limited in call availability as I will have ZKSummit during the day, but I can try. I'm on GMT+3 |
This conversation is probably starting to deviate from this PR and maybe we move this conversation to a DM. I'll give just short answers here and longer answers elsewhere. We do store large amounts of node-ids and ENRs in our routing table in discovery. The majority of these could be peers not related to ethereum, the consensus or even useful to us at all (the reason we do this is for security and the desire to have a very large peer set in the DHT). Its a statistical game, but really we only have peers on like 10 or so buckets, so lets so on average every node will store like 160 Enr's or Node-ids (16 per bucket if they are full). Most of these are irrelevant to us (by design).
I used to think this too in the early days. Let me highlight the main things that go wrong (very commonly). Here are all the steps that necessarily have to pass in order to successfully dial a peer and get a sample:
Its more often the case that 1 of these 7 steps fail than all of them succeeding. From experience its not a reasonable strategy to assume we can accomplish all these steps in a timely manner and have them succeed. |
I've modified this PR to decouple the rotation periods. Because we are decoupling the rotation periods, the concept of a "shard" becomes more ambiguous (due to different topics rotating differently, so the "shard" for each topic are no longer linked). Therefore the whole concept of a "shard" becomes more of a confusing concept than a helpful one. Instead, we just have a function that maps node-ids to topics. We use this function for attnets and das columns. And implementers can choose to keep a uniform distribution of node-id prefixes, which will provide (on average) uniform distribution over both topics. |
@fradamt I thin I understand what you are trying to say.
I think what you assume is that the validators will not actively request the samples using RPC (Req/Resp) as @AgeManning explained and specified in the current PeerDAS spec, but instead passively receive the samples from the subnets of columns they are supposed to custody and consider that the data is available if they receive all the samples. If that's the case, what you said makes sense. I guess many people in this PR haven't heard about this idea yet and that's why it caused confusion here. It's not even specified in the spec yet.
Is it really a problem on the number of possible sets of subnets that a node can custody? I think what matters is the sets of subnets that a node will be subscribed to to receive the samples? The two collections of sets don't have to be the same. Of course, the nodes have to custody all the samples that are in the subnets of the shard assigned to them because shards are backbones and they have to serve samples to others, but they can also receive samples from other subnets in other shards as well. What they can do is to subscribe to the subnets in other shards a few seconds before the samples are expected to arrive and unsubscribe a few moments later. This can be done quite fast because every node is already connected to nodes in every shard. So I think the function, f: node_id -> sets of subnets to receive samples, can be uniformly random. Please let me know if I misunderstand anything. |
each topic that remain subscribed for a long period of time (order 1 | ||
day). This allows nodes to search for and maintain a selection of these | ||
long-lived peers in order to publish/receive messages on the topic. As some | ||
topics are transient for most nodes (i.e attestation subnets, DAS-related |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to mention DAS in the phase0 spec?
I've spoken with @fradamt out of this channel. To summarise our conversation: The problem @fradamt is raising is that this PR is essentially bucketing all node-ids into 64 buckets (what we are calling the prefix's or shards). We are doing this to make these nodes easily searchable in discovery. The problem is that there are only 64 unique combinations of subnets all node-ids will subscribe to. Therefore an attacker can send 50% of samples and because there are only 64 unique combinations of column subscriptions they will satistically convince a large portion of these nodes that the data is there. If there were more combinations, it would be harder to trick a large group of the network. So essentially we need a larger range combinations to map node-ids to column subnets. Our current thinking therefore is to stick with this PR in that the first 6 bits can be used to map to a number of column subnets (i.e 2), then use the remaining bits (ones that won't be searchable) to specify extra columns. The first bits should be sufficient for us to easily search and build a backbone structure and the latter bits will be enough to split the nodes into more unique combinations of subscriptions to prevent the attack mentioned above. I believe this PR (which targets phase 0) may be alright as it is, and another PR specifically related to DAS will have to account for these extra column subnets. |
So what you have here, in my understanding, is that you have N columns (is this NETWORK_TOPIC_COUNT?), and for each node you want to select L of these (I’m not sure what’s the name of this parameter in your description). For the probabilistic arguments to work, you want to select these for each node independently, with uniform random selection, ideally without replacement. Now this would be easy if you would just roll a dice a few times in every node and publish the selection somehow. However, we want to introduce a few tricks:
Forgetting a little bit about the rotation aspect, the randomness required basically means that each node has to pick a random point in an N^L space (this is still with replacement, but we can deal with that later if we want). We want to connect this to the hierarchical Kad address space. A natural way to obtain this is to just simply map to the first L*log2(N) bits of the node-id (Kad address space). You can’t really use less bits, because you do need that amount of randomness. So we do an ordered selection, simply treating your node-id’s prefix bits as a column selection vector <C1,C2, .. CL>. With this you achieve fast peer candidate search in Kad. Say you are looking for candidates for column C
Now your probabilistic space is fine (although we still do selection with replacement). But you still have an issue from the networking perspective, because your topic network now have a hierarchy. Nodes that selected C as their first selection are in a privileged position, forming a backbone that everyone tries to connect to. Similarly, the later in your column selection vector you have C, the less nodes try to connect to you because of it. But there is enough randomness in the lookups that I think this is not such a big issue for this version. Worth keeping in mind for later version, where we might want to re-randomize the graph with peer exchange and alike. So we’ve solved 1 and 2, but we still need 3, the rotation. Here the real question is IMHO why we are doing it in the first place, and I don’t think that had been sufficiently clarified.
Anyway, the challenge is now to change individual positions in the selection vector <C1,C2, .. CL>, but still keep it related to the Kad structure. Rotation is becoming tricky because the only thing that differentiates your node from other nodes is the node-id. Everything else is global. (We can add things in the ENR, but that’s fundamentally just like having a bigger node-id). If we derive something deterministically from the same bits of the node-id, it will select the same nodes all the time, independent of how we randomise, shuffle, etc. We do need to include other bits that differentiate nodes to make nodes jump individually between topics, and the more bits we add, the more individual this jumping behaviour will be. One idea here is to e.g. use the first 6 + S bits and the epoch to decide the first column selection (6 bits). Mapping from that 6+S bit ID space to a 6 bit ID space allows us to avoid keeping fixed groups, if that’s what we want to. If S is larger than log2 of the number of nodes, nodes essentially move independent from each other. If not, they are somewhat grouped. We also want this mapping to be easily reversible, so that we know what to search for in the Kad address space. We would end up with any specific C mapped to a few or a large amount of prefixes, but still prefixes, which can be used for Kad search. Another idea is to do the shuffling period differently for different columns selection positions. Rotate C1 slowly (or maybe keep it fixed?) giving a stable 1/NETWORK_TOPIC_COUNT portion of nodes on the same topic, rotate C2 faster, C3 even faster, etc. But really, I think the base question is what we want to achieve with the rotation. |
So you still want to use 6 bits to select 2 out of 64 (or from 128?). That's still limiting the space of combinations. Why would we do that instead of just using enough bits?
They are actually searchable, just way more complicated, and would need Discv5 changes. Anyway, I think we agree on this. |
Every node has 2 separate roles:
Sorry, I'm not following how performing propagator duties affects the security of sampling ? |
It is currently unclear whether peer sampling will be required before voting in the current slot, or whether it will only be required after one slot or something like that. If that's the case, this opens up lots of problems. The most straightforward solution involves leveraging subnet sampling, i.e., getting some security guarantees directly from the distribution phase. An example of this is Dankrad's idea that validators custodying 2 rows and 2 columns in Danksharding would make it so that at most 1/16 (because 16 = 2^4) of the honest validators can be made to vote on an unavailable proposal. See here for a longer explanation of the problem and the suggested solution. For more context, this idea is also what SubnetDAS was based on. Importantly, for this purpose (ensuring that at most X% of the nodes can be tricked), it is irrelevant whether it is predictable what nodes are sampling. They can even "sample" (i.e., subscribe to) the same subnets all the time, as long as the assignments were made sufficiently randomly (hence the issue with the 64 buckets)
This would work for the purpose of enabling "subnet sampling". What I don't like about it is that we pay the bandwidth cost to have nodes subscribing to |
@cskiraly - You've nailed most of the points of this PR related to the improvements with discovery. Essentially we are partitioning the DHT into 64 sub-dht's that we can search for easily as you've pointed out. One of the biggest points you've missed out at the moment is enforceability. We need peers to subscribe to topics for a long period of time in order for the network to work. However, this costs bandwidth and has no incentive for a peer to do. I know some nodes change a few lines of codes and reduce their bandwidth by 50%+ and their nodes work ideally better. Subscribing to subnets is kind of like a chore that we need to make everyone participate in. By tagging it to the peer-id, we can enforce node's to do these subscriptions. It's a long read, but there is history to this PR where we have discussed quite a bit of this.
Quick answers to some of your questions. The current rotation splits nodes up in a rotation period and rotates them at different epochs. We do this so that there isn't a mass rotation of peers all at once from one topic. i.e If all peers transitioned at the same time, that gossipsub topic would temporarily be unavailable, we'd miss information and all nodes would be scrambling to find new peers to fit that topic. I reason we have a rotation at all, comes from the original spec with attestation subnets where we decided to rotate the long-lived subscription for subnets to 24 hours. I believe the idea was that no single nodes sit on a single subnet and try to censor it. I.e build up a large peer set all on one topic and don't forward messages. I'm aware the rotations don't solve this problem entirely but its a nice-to-have. |
As there's a lot of conversation going on here. I propose either we have a dedicated call sometime for this PR as to whether we want it or not, or make a decision in the next consensus call. If there isn't interest in a dedicated call, i'll add it to the agenda on the next consensus call. |
@AgeManning thanks for all the pointers! Happy to have a dedicated call (e.g. something Wed CEST afternoon), or to discuss during the consensus call. |
Yep, lets discuss in the consensus call. I'm also not convinced on the rotation any more and it would simplify things dramatically to remove it. (See the consensus call issue) |
I created a PR at #3717 for what I mentioned here |
From earlier conversations, it bears repeating that this PR is no more enforceable than the status quo due to the possibility for a node to opt out from the mesh instead of the subscription table to save bandwidth - ie we can annoy each other but not really address leeches. In this context, it's worth bringing up that light clients also rely on non-enforced subscription grey zone to operate, so before we venture into enforcement territory, we should settle a policy for those use cases. Indeed, broad enforcement carries with it risks of doing more harm than good. |
Sure. People can modify gossipsub and have a 0-mesh degree on some topics. I feel like combating this, is more suited to gossipsub scoring and ensuring peers are actually gossiping on the topic they are subscribed to. This PR isn't trying to solve everything, rather be a first step towards improving what we currently have. I'm going to close this PR in favour of the simplest approach, which will be to use just the first prefix bits, in a static way to define attestation subnet subscriptions without rotation. The current plan is for us is to implement this in Lighthouse and then we can modify the structure going forward to attempt to handle all of the possible complexities that this PR has raised. |
Our (Lighthouse) current plan is to adopt the simplest approach in the meantime and then progress from there. Therefore, I'll close this PR in favour of #3735 |
Tricky, broadly it is legit to have a "full" mesh and prune others and not publish anything on subscribed topics (ie only occasionally publish some appropriately old ihaves, if at all since the ihave publishing isn't regular) |
This supersedes #3545
Overview
Gossipsub topics require a backbone structure of peers in order to function properly. We have a few transient sets of topics (where nodes are not permanently subscribed). Currently this is the attestation subnets, but with the introduction of DAS, the broadcast columns also. Potentially also sync committee subnets if we wish to improve their structure in the future.
We need to enforce peers to be long-lived subscribed to these topics for a healthy network. See #3545 and related issues around these requirements.
This PR introduces the concept of a "network shard" which is just an abstract concept that tags a node-id to a number (a network shard). We can then use this network shard (number) to assign topics that a node must subscribe to for a long period of time.
Benefits of this approach
There are number of benefits outlined in #3545 - which are mainly about the function that we use to map node-id to network shard. In short they are:
These benefits are solely related to how we map the node-id to a topic (which can be done without this network shard concept).
One of the main benefits of the network shard approach is that all transient topics are now linked to specific network shards. One problem we have faced in lighthouse, is that given a set of peers, how do we choose which ones to keep in order to get a balanced and uniform distribution of nodes subscribed to all subnets. This problem is relatively easily solvable because it's mostly one-dimensional (i.e if we have too many peers on one subnet, remove some of them in favour of peers on another). However, once we introduce DAS columns, the problem becomes harder. If many peers are on one attestation subnet, but individually they are the only ones on a DAS column that we need, it is no longer straight forward as to which to remove and which to keep. If we couple all the transient topics to a network shard, then all clients will only have to optimise their peer set to be uniformly distributed based on network shards and they will be guaranteed to have a uniformly distributed set across all transient topics (i.e the problem is one dimensional for all transient topics we add).
Another nice side effect is that is simplifies implementation. We have one function with all these nice properties, we don't have to keep re-inventing it each time we add transient topics, especially when we re-make it we usually want the same properties.
Downsides of this approach
The biggest downside is that we necessarily have to couple the rotation periods (i.e DAS columns peers transition to different DAS columns over a different period to the attestation subnets). If we de-couple the rotation periods (which I originally considered), then we loose the nice benefit mentioned above and it becomes much harder to maintain a balanced peer-set at any given time (at least I think).
My thinking is that the rotation periods are not necessarily a hard requirement here. For DAS I believe we're thinking the order of the pruning period (~18 days) and the attestation subnets is around 27 hours. The requirement from the DAS side is for reconstruction, however the network shards are calculable, i.e for any epoch we can calculate a node's shards so we can know at any given time which node had which column. In principle I think we could reconstruct blobs even if the shard rotation period was smaller than the 18 days. Alternatively, I don't know of the hard limit keeping the attestation subnet rotation period of 27 hours.