Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network shards (Attnet Revamp + DAS Distribution Columns) #3623

Closed
wants to merge 13 commits into from

Conversation

AgeManning
Copy link
Contributor

This supersedes #3545

Overview

Gossipsub topics require a backbone structure of peers in order to function properly. We have a few transient sets of topics (where nodes are not permanently subscribed). Currently this is the attestation subnets, but with the introduction of DAS, the broadcast columns also. Potentially also sync committee subnets if we wish to improve their structure in the future.

We need to enforce peers to be long-lived subscribed to these topics for a healthy network. See #3545 and related issues around these requirements.

This PR introduces the concept of a "network shard" which is just an abstract concept that tags a node-id to a number (a network shard). We can then use this network shard (number) to assign topics that a node must subscribe to for a long period of time.

Benefits of this approach

There are number of benefits outlined in #3545 - which are mainly about the function that we use to map node-id to network shard. In short they are:

  • Enforceable - We know which nodes should be subscribed to which topics
  • Faster discovery searches - We can search node-id prefixes in discovery to find peers on specific topics
  • Smooth subnet transitions - The function facilitates smooth rotations of peers between topics

These benefits are solely related to how we map the node-id to a topic (which can be done without this network shard concept).

One of the main benefits of the network shard approach is that all transient topics are now linked to specific network shards. One problem we have faced in lighthouse, is that given a set of peers, how do we choose which ones to keep in order to get a balanced and uniform distribution of nodes subscribed to all subnets. This problem is relatively easily solvable because it's mostly one-dimensional (i.e if we have too many peers on one subnet, remove some of them in favour of peers on another). However, once we introduce DAS columns, the problem becomes harder. If many peers are on one attestation subnet, but individually they are the only ones on a DAS column that we need, it is no longer straight forward as to which to remove and which to keep. If we couple all the transient topics to a network shard, then all clients will only have to optimise their peer set to be uniformly distributed based on network shards and they will be guaranteed to have a uniformly distributed set across all transient topics (i.e the problem is one dimensional for all transient topics we add).

Another nice side effect is that is simplifies implementation. We have one function with all these nice properties, we don't have to keep re-inventing it each time we add transient topics, especially when we re-make it we usually want the same properties.

Downsides of this approach

The biggest downside is that we necessarily have to couple the rotation periods (i.e DAS columns peers transition to different DAS columns over a different period to the attestation subnets). If we de-couple the rotation periods (which I originally considered), then we loose the nice benefit mentioned above and it becomes much harder to maintain a balanced peer-set at any given time (at least I think).

My thinking is that the rotation periods are not necessarily a hard requirement here. For DAS I believe we're thinking the order of the pruning period (~18 days) and the attestation subnets is around 27 hours. The requirement from the DAS side is for reconstruction, however the network shards are calculable, i.e for any epoch we can calculate a node's shards so we can know at any given time which node had which column. In principle I think we could reconstruct blobs even if the shard rotation period was smaller than the 18 days. Alternatively, I don't know of the hard limit keeping the attestation subnet rotation period of 27 hours.

configs/mainnet.yaml Outdated Show resolved Hide resolved
@Nashatyrev
Copy link
Member

@AgeManning thanks for making this PR 👍

I think this number only becomes an issue if we want to introduce more than 64 topics and have individual granularity.

From my perspective

  • if more than 64 topics is required: it could simply be shard = subnet % 64. I'm not sure any major issues are expected due to modulo bias (if subnet count is not multiple of 64), however if there are several such topics then it could probably be desirable to balance shard throughput. All peers would serve the same throughput in average due to rotation, however throughput of a single peer may change at rotations.
  • if less than 64 topics are required: I see 2 options here:
    • If a topic doesn't require kind of strengthened security, then just assign 1 shard per subnet, leaving other shards vacant. In that case we also need to balance throughput across shards
    • Assign >1 shards per subnet. IMO this option would complicate peer management since a node would need manage connections to the peers of its own shard and also manage connections to peers from the neighbor shard(s) to smoothly serve a subnet assigned to this shard and 1 or more other shards.

@Nashatyrev
Copy link
Member

Alternatively, I don't know of the hard limit keeping the attestation subnet rotation period of 27 hours.

I still hope we would mix in randao to the shard() function at some point to eliminate sibyl attack surface on a shard. For this purpose it makes sense to keep rotation period short enough.

@Nashatyrev
Copy link
Member

Do we potentially want to provide the ability for a node to voluntary participate on a number of additional network shards? This feature was a part of PeerDAS proposal

@jimmygchen
Copy link
Contributor

If a topic doesn't require kind of strengthened security, then just assign 1 shard per subnet, leaving other shards vacant

@Nashatyrev What do you mean by leaving other shards vacant? do you mean that that the vacant shard won't be assigned any columns for the period? if that's the case, wouldn't it make retrieving columns harder, because nodes in those shards won't serve any columns? (and they'd also not be very useful to other peers for the period)

Thinking in terms of 2D PeerDAS, is there any benefit for a node to custody both rows and columns? Was wondering if there's a 3rd option - instead of leaving the other shards vacant, could they be assigned row subnets instead? (could be messy mixing columns and row though, but it came to my mind and I thought I'd mention anyway)

@Nashatyrev
Copy link
Member

@Nashatyrev What do you mean by leaving other shards vacant? do you mean that that the vacant shard won't be assigned any columns for the period?

I actually did't mean any specific topic. Particularly DAS column subnets would most likely occupy all the shards. I was rather meaning for example sync committee subnets (there are just 4 of them). We may decide to simply assign them to just 4 network shards so remaining shards would not serve any sync committee subnets

@cskiraly
Copy link
Contributor

If a topic doesn't require kind of strengthened security, then just assign 1 shard per subnet, leaving other shards vacant

@Nashatyrev What do you mean by leaving other shards vacant? do you mean that that the vacant shard won't be assigned any columns for the period? if that's the case, wouldn't it make retrieving columns harder, because nodes in those shards won't serve any columns? (and they'd also not be very useful to other peers for the period)

Thinking in terms of 2D PeerDAS, is there any benefit for a node to custody both rows and columns? Was wondering if there's a 3rd option - instead of leaving the other shards vacant, could they be assigned row subnets instead? (could be messy mixing columns and row though, but it came to my mind and I thought I'd mention anyway)

I think sharding should go hand-in-hand with the DAS structure, with its own diffusion protocol which should move away (on the long term) form GossipSub. What we really need is a block diffusion protocol (Diffuse?) that is integrated with the concept of the 2D structured, mapped into the bits of a virtual ID space. This ID space can be linked to the nodeID space. With X rows and Y columns, you can easily map them to X+Y bits. Then we can have compact representation of nodes "interest" in given bits, representing specific columns (upper part of bits) or rows (lower part of bits). We can derive these interests systematically, eventually running it through randomization, or signal these with a bitmap of X+Y bits.

@cskiraly
Copy link
Contributor

Handling it as an integrated protocol instead of a series of GS topics would also allows us to have compact representations of HAVE list and alike, or to express "interest" in single segments (X,Y coordinates), or single segments run through some randomization, eventually some randomization including the nodeID, a.k.a. push sampling.

@cskiraly
Copy link
Contributor

Do we potentially want to provide the ability for a node to voluntary participate on a number of additional network shards? This feature was a part of PeerDAS proposal

I think that's needed, since nodes are not equal, neither in their resources, nor in their interest to verify and custody more of the block.

I think instead of selecting one shard, we should randomize the nodeID into the column+row address space a few times, allowing us to select multiple columns and rows. The number of samples we need from the column ID space and from the row ID space can be controlled and advertised by the node, keeping a minimum requirement that allows other nodes to know a base set without looking up the advertised value.

Another topic is voluntary participation "hand-picking" specific columns or rows. I.e. participation is not derived from the NodeID, is is instead like voluntarily joining a GS topic. I think that could be allowed as well, but I can imagine nodes prioritizing peers based on where the interst comes from. Example priority order:

  • minimum custody requirement derived
  • derived from advertised (voluntary) custody
  • expressed interest in column or row

@AgeManning
Copy link
Contributor Author

Thanks everyone for the detailed review so far!

Some of my thoughts:

if less than 64 topics are required: I see 2 options here:

If a topic doesn't require kind of strengthened security, then just assign 1 shard per subnet, leaving other shards vacant. In that case we also need to balance throughput across shards

Assign >1 shards per subnet. IMO this option would complicate peer management since a node would need manage connections to the peers of its own shard and also manage connections to peers from the neighbor shard(s) to smoothly serve a subnet assigned to this shard and 1 or more other shards.

I didn't quite follow the complication here. I thought in the example of 32 column subnets, we could just do something like shard%32 = subnet_column. This would be the second case you mentioned. From a peer management perspective, if I balanced my node such that its peers were uniformly distributed across the shards, then i'd have double the amount in the das subnets and they'd be uniformly distributed. If I need a peer on any column subnet, I can search for 2 shards to find them if needed. I might be missing something?

Do we potentially want to provide the ability for a node to voluntary participate on a number of additional network shards? This feature was a part of PeerDAS proposal

Yep. I've decoupled the shard count from the subnet count to make this a bit easier to do. i.e we have the SHARD_PER_NODE and SUBNET_PER_SHARD variables. This will allow users to increase the subnet subscriptions independently from the DAS columns if they want. In principle, to subscribe to more, locally an implementation can just increase SUBNETS_PER_SHARD to 64 to subscribe to them all. We can do a similar thing for the DAS columns. I'm not sure SHARDS_PER_NODE is necessary, but maybe in the future we want to increase it. Perhaps it can be omitted.

Copy link
Member

@dapplion dapplion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the inline comments in compute_network_shard, it's very clear now

@Nashatyrev
Copy link
Member

I didn't quite follow the complication here. I thought in the example of 32 column subnets, we could just do something like shard%32 = subnet_column. This would be the second case you mentioned. From a peer management perspective, if I balanced my node such that its peers were uniformly distributed across the shards, then i'd have double the amount in the das subnets and they'd be uniformly distributed. If I need a peer on any column subnet, I can search for 2 shards to find them if needed. I might be missing something?

I would like to clearly separate node roles with respect to network shards:

  • server role: if a node is assigned to some shard it should reserve and maintain a number of connections to other nodes in the same shard to provide reliable messages dissemination in gossip topics assigned to this shard. I believe the reasonable number of connections should be > D
  • client role: the same node would require to join other subnets for performing its higher-level logic (sampling, attestation aggregation, etc.). Then number of peers across different shards and their rotation could decided by a concrete implementation in its own flexible means
  • custody role: (could be a part of server role) a node is obligated to custody messages of specific topics in assigned shards for specific custody period and serve req/resp queries. That could potentially be abstracted away from DAS or any other specific applications

My original comment was related to the server role solely.
For example some topic is spanned across shards #0 and #1. If nodes from the shard #0 maintain connections to nodes from the shard #0 only (as a server) and the nodes from the shard #1 do the same then this topic most likely be fragmented. Thus nodes assigned to the shard #0 should also maintain enough connections to the nodes from the shards #1.
It's probably not a major complication but something to keep in mind.

@Nashatyrev
Copy link
Member

Yep. I've decoupled the shard count from the subnet count to make this a bit easier to do. i.e we have the SHARD_PER_NODE and SUBNET_PER_SHARD variables. This will allow users to increase the subnet subscriptions independently from the DAS columns if they want. In principle, to subscribe to more, locally an implementation can just increase SUBNETS_PER_SHARD to 64 to subscribe to them all. We can do a similar thing for the DAS columns. I'm not sure SHARDS_PER_NODE is necessary, but maybe in the future we want to increase it. Perhaps it can be omitted.

I was probably talking about slightly another matter.
By the spec a node is obligated to participate in 2 shards (this includes both forming topics backbone and custody messages)
A node (with high-capacity) may voluntary participate in more than 2 shards. In this case it needs to advertise its intentions via special discovery ENR field. In this case a client node may search suitable shard nodes by both NodeID prefixes and ENR advertisements.

@fradamt
Copy link
Contributor

fradamt commented Apr 9, 2024

EDIT: I thought about this more and chatted with the team.

I think we do want to couple the custody with the long-lived subscriptions because the custody peers will necessarily be long-lived subscribed anyway. We will probably want to pin the rotation period to the custody duration to simplify a bunch of code. So if peers custody columns and have a regular rotation and form the gossipsub backbone, what security properties do we lose by not adding entropy to the rotation?

To simplify, let's even forget the rotation for the moment and just consider a single instance of a mapping of f: node_id -> sets of subnets to participate in. A desirable property is that, whenever data is made available on < 50% of the subnets, only a small fraction of node_ids are such that f(node_id) is a set of subnets that is completely available. In other words, no matter how you select one half of the subnets, most assignments are not contained within that half. This is useful because any node outside of this small fraction would be unable to get all of the data they are looking for according to f(node_id), and so they would immediately consider this data to be unavailable before even trying to sample. In particular, such a node would not vote for the related block.

Whether or not we have this property is entirely dependent on the mapping f, and on the number of nodes that we have. It's essentially just a combinatorial property. As long as there are enough node_ids and f spreads their assignments out well enough, it is not possible to trick a high percentage of the nodes. Note that for this purpose f does not need to be unpredictable or changing over time.

Correct me if I am wrong: in order to achieve the goals of this PR (in particular "all clients will only have to optimise their peer set to be uniformly distributed based on network shards and they will be guaranteed to have a uniformly distributed set across all transient topics"), the mapping needs to be node_ids -> shard -> set of subnets to custody. If that's the case, it doesn't really matter how random the node_ids -> shard mapping is, or how random the shard -> set of subnets mapping is. Either way, there are only 64 possible sets of subnets that a node might be able to custody, which is the same as having only 64 nodes for the purpose of the property above.

We could in principle just use network shards for the backbone of the topics, e.g., if the custody requirement is 8, we could map each shard to 2 subnets, while the other 6 could still be directly pseudorandomly chosen based on the node_id, without going through the node_id -> shard "bottleneck". That would work fine from this security perspective. But then doesn't it fail to achieve the goal of allowing a simple criterion for peer set optimization? If you just optimize for a uniform distribution of shards in your peer set, you are only taking into account a small portion of a node's custody, and possibly kicking out nodes which are in subnets you need.

@fradamt Maybe I am missing something here, why would network shards affect this ? In the current spec, the choice of which column subnet a node custodies is also pseudorandomly determined by the node id. Network shards changes in how we organize this so that we can extend this across topics (column subnets,att subnets and maybe sync subnets). With network shards we would still pseudorandomly decide how the mapping works.

As mentioned above in this message, the issue I see is that this mapping, while pseudorandom, only produces 64 (shard count) possible values.

@@ -136,9 +136,11 @@ MESSAGE_DOMAIN_VALID_SNAPPY: 0x01000000
SUBNETS_PER_NODE: 2
Copy link
Member

@ppopth ppopth Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should SUBNETS_PER_NODE be removed while SUBNETS_PER_SHARD and SHARDS_PER_NODE should be added instead?

@zilm13
Copy link
Contributor

zilm13 commented Apr 9, 2024

EDIT: I thought about this more and chatted with the team.

I think we do want to couple the custody with the long-lived subscriptions because the custody peers will necessarily be long-lived subscribed anyway. We will probably want to pin the rotation period to the custody duration to simplify a bunch of code. So if peers custody columns and have a regular rotation and form the gossipsub backbone, what security properties do we lose by not adding entropy to the rotation?

Say, we have 6400 honest peers, 100 per shard. An attacker generates 1000 PeerIds for 1 exact row (it's cheap), so whenever you are looking peers for this row you get just 1 honest peer out of 11. These nodes couldn't block reconstruction thanks to the columns, because it will require 1/2 of nodes even assuming everyone subscribes to only 1 column, and you still have the same number of honest peers as on any other subnet. But most of your connections for this subnet are not honest nodes (if not all). Reconstruction by columns will take significantly more time and distribution of this row could be delayed. If we cannot confirm DA for 1 blob out of 32, block could be considered delayed and it could lead to reorg. So it could be a cheap reorg attack.

For attestations attack on 1 subnet makes less harm, so it could be a reason it was never performed. Faster rotation doesn't have a lot of value, PeerId generation for this task is very fast. But for example, 2 rows + 2 columns requirement will make it more expensive (not by the factor of 2, a greater value).
Am I wrong here?

Yeah, I'm wrong here, didn't get updated spec. Subnets are column only, so attacker needs to control 50%+ subnets to make any harm.

@cskiraly
Copy link
Contributor

cskiraly commented Apr 9, 2024

Reading through the discussion, what is not clear to me (and maybe here I'm just repeating @fradamt) is the relation between facilitate the support of these topics by long-lived subscribing to them (and thereby validating and forwarding messages), and actually being interested in that topic (would have anyway subscribed to it, passing the information up beyond message validation).

I see two cases here:

1, this is really just a support function at the network overlay level, relaying messages not because these nodes are interested, but because obliged to do so. In this case it does help connect the overlay, but it also adds extra traffic to count with. We can however juggle the ratio of support nodes as we like, it is just a support function providing stable extra nodes, connected between them to form a "backbone". Hopefully not too connected, so that the nodes temporarily joining a topic can fit in their mesh. As a consequence, our overlay graph will be far from a d-regular random graph, but it will be at least connected. The danger I see is a bit in this in-fact, that the topology will be much more a hierarchical topology than we like it to be, but this might be a reasonable compromise for these topics. One could even think of doing some peer exchange to re-randomize it fast, "forcing" links between the short-term nodes as well.

2, shard nodes are "forced" to be also interested in the topic, which means that sharding must influence the topic distribution logic on top of this mechanism. This would mean that network sharding is not a generic tool for topics with fast rotation (or high churn, or similar), but a cross-layer mechanism that is coordinated with some interest assignment logic. In that case, taking the example of applying this to the case of PeerDAS column topics, we are indeed effectively reducing the number of columns we have to the number of shards.

Based on the original description, I tend to see this more as the first option, but I'm curious whether I understood this correctly.

Another thing that puzzles me a bit is why finding peers takes that much time if we have a NodeID -> topic mapping function. Is it the number of peers vs. number of topics, the mesh degree bounds, the warm-up time on mesh links, or something else that makes it so that In order for gossipsub to function, there must be a stable set of peers for each topic that remain subscribed for a long period of time (order 1 day). I would expect the overlay to start function much much faster than 1 day, allowing a relatively fast staggered rotation scheme (if needed) on the NodeID -> topic mapping.

@AgeManning
Copy link
Contributor Author

@fradamt @cskiraly

From what I understand from your messages, I think there is a misunderstanding of what this PR does and how it works in practice. Perhaps it might be useful to align a call to discuss, but i'll also explain here as best I can.

Lets, for now, ignore all the complexities of this PR and remove the idea of shards and just talk about a node-id -> custody column mapping. For this explanation, I'm going to assume the custody columns are also going to form this "backbone gossipsub" structure I've been talking about, which I think is what we want in practice anyway.

At this point, this PR is general with respect to the number of custody columns and the total number of column subnets, but for the sake of simplicity, lets say each node must custody 2 columns and in total there are 64 of them.

Now lets make a node-id mapping that uniformly distributes node-ids to all 64 DAS columns (think node-id mod 64). In fact because the custody requirement in this example is 2, then each node-id maps to exactly 2 columns pseudo-randomly. So every node on the network, now must custody 2 columns.

There are now a few problems in practice we need to solve:

  1. Get the data to custody - As a node on the network, i need to get the data for my two columns. The way to do this is to just subscribe to the columns I am assigned. However, this will only work if I have peers that are also subscribed to these columns, if I have no peers subscribed, I wont get any messages. This brings us to:

  2. Find Peers of my custody columns - I need to find other peers that are also assigned to my columns. In this example I can't find node-ids for a specific column (Because we're using node-id mod 64) but this PR has a way that allows us to find the first few bytes of a node-id to search for. We search for these first few bytes on discovery to find peers that are required to also be subscribed on this column. Now, we are using discovery here. It takes ~30 seconds to search. We have to search through the entire DHT, which contains 10s of thousands of nodes, nodes on all testnets, not just mainnet, nodes that are in the execution layer and even nodes that are not related to ethereum. The data on DHT updates slowly, so traditionally we need people to subscribe for a long period of time, because if they switch quickly it takes a while to propagate information on the DHT and we could constantly be finding old entries/chasing people (this is true for ENRs but probably less true for just node-id searching). In either case, longer rotation periods allow us find peers and allow for slow DHT updates. So lets say we find 3 or 4 peers that are also on our column subnet, then we are good. We have an overlay mesh and we can get our 2 column data samples and our custody requirements are met. I believe we satisfy the property we want, where if 50% of the data is not sent, we are only viewing 2/64=1/32 of all the columns and will never say a block is available when its not, because viewing only 2 of the subnets is not sufficient to make a call here.

  3. Sampling - We need to randomly sample in order to convince ourselves of the block availability. This process must be fast. So if at this point, we only have our peers subscribed to our subnet columns, those peers cannot give us information about any other column. When we sample we will randomly select other columns and need to request them via RPC. However, at this point, we don't have any peers that know about any other columns, so we have two options:
    a) Try and use discovery to find peers on sampled subnet, connect to them and ask for their samples
    b) Try and maintain (lets say 2) peers at all times that span all subnets, so we can just ask them straight up.

In practice, we can't do a), because it takes too much time (and is unreliable, for a variety of reasons I wont mention here), and sampling has to be fast. So we have to do b). Given our node-id to custody column mapping, we can do these discovery process in advance or some other technique to ensure that at all times we always have 2 or so peers that can tell us a sample for every column, so we can sample when we need.

I think one point that might be of confusion here, is that we can connect and maintain a connection with these peers, without subscribing to the topic. So we can have peers on all das columns, but we don't subscribe to that topic, therefore we don't form meshes with them, we dont receive messages from them and those columns are essentially non-existent to us. We only use them to quickly sample via RPC when we need.

Hopefully this makes it clear about the actual workings of clients in the DAS context and provides a bit more context around this PR.

As far as I can tell, I think we still maintain the property that @fradamt wanted. Let me know if you think we should have a call, because I could also be misunderstanding the issue.

@cskiraly
Copy link
Contributor

cskiraly commented Apr 10, 2024

@AgeManning, thanks for the detailed reply.

That you can connect without subscribing is clear to me. Also, it is clear to me that having a node-id -> custody column mapping is good for us, because it is enough to know a number of node-ids to see who to potentially subcribe to.

The part where I'm confused is why you would not keep a reasonably large set of node-ids and ENRs. Your Discv5 buckets already have a large number of them, also doing liveness check every now and then through UDP.

Now, the specific mapping you define is directly based on node-id bits. With this I think you have options with the following two extremes:
- use top bits, as you define here: In this case, with the default implementation, only your top bucket will be useful, which is small, but you can use cheap lookups (or maybe you can bump the size of that bucket?)
- use low bits: in this case all buckets are useful, but if there happens to be none mod 64 for a specific topic, you can only do random walk search.
I have to rethink this part, checking again the defined permutation. I might be fully misunderstanding something here.

Dialing a peer to get a sample should not be so difficult, if we do not require peer scoring to play a large role there. I wonder whether we need peer scoring in that context.

I'm a bit limited in call availability as I will have ZKSummit during the day, but I can try. I'm on GMT+3

@AgeManning
Copy link
Contributor Author

AgeManning commented Apr 10, 2024

@cskiraly

The part where I'm confused is why you would not keep a reasonably large set of node-ids and ENRs. Your Discv5 buckets already have a large number of them, also doing liveness check every now and then through UDP.

This conversation is probably starting to deviate from this PR and maybe we move this conversation to a DM. I'll give just short answers here and longer answers elsewhere.

We do store large amounts of node-ids and ENRs in our routing table in discovery. The majority of these could be peers not related to ethereum, the consensus or even useful to us at all (the reason we do this is for security and the desire to have a very large peer set in the DHT). Its a statistical game, but really we only have peers on like 10 or so buckets, so lets so on average every node will store like 160 Enr's or Node-ids (16 per bucket if they are full). Most of these are irrelevant to us (by design).

Dialing a peer to get a sample should not be so difficult,

I used to think this too in the early days. Let me highlight the main things that go wrong (very commonly). Here are all the steps that necessarily have to pass in order to successfully dial a peer and get a sample:

  1. Have to find them on discovery (this in itself isn't straight forward for a number of security reasons)
  2. Their ENR must be up to date with correct IP that is contactable
  3. Sometimes they are behind a NAT so IP is not reachable, connection fails
  4. The dial attempt works, we dont miss any tcp packets and we negotiate all the required protocols
  5. The peer allows the connection (this fails most of the time, because clients have a peer limit which is typically full)
  6. The peer is fully synced, on our chain and our fork and their ENR isn't out of date such that they are actually subscribed to the subnets their ENR says they are
  7. Download the sample.

Its more often the case that 1 of these 7 steps fail than all of them succeeding. From experience its not a reasonable strategy to assume we can accomplish all these steps in a timely manner and have them succeed.

@AgeManning
Copy link
Contributor Author

I've modified this PR to decouple the rotation periods.

Because we are decoupling the rotation periods, the concept of a "shard" becomes more ambiguous (due to different topics rotating differently, so the "shard" for each topic are no longer linked). Therefore the whole concept of a "shard" becomes more of a confusing concept than a helpful one.

Instead, we just have a function that maps node-ids to topics. We use this function for attnets and das columns. And implementers can choose to keep a uniform distribution of node-id prefixes, which will provide (on average) uniform distribution over both topics.

@ppopth
Copy link
Member

ppopth commented Apr 10, 2024

@fradamt I thin I understand what you are trying to say.

To simplify, let's even forget the rotation for the moment and just consider a single instance of a mapping of f: node_id -> sets of subnets to participate in. A desirable property is that, whenever data is made available on < 50% of the subnets, only a small fraction of node_ids are such that f(node_id) is a set of subnets that is completely available. In other words, no matter how you select one half of the subnets, most assignments are not contained within that half. This is useful because any node outside of this small fraction would be unable to get all of the data they are looking for according to f(node_id), and so they would immediately consider this data to be unavailable before even trying to sample. In particular, such a node would not vote for the related block.

I think what you assume is that the validators will not actively request the samples using RPC (Req/Resp) as @AgeManning explained and specified in the current PeerDAS spec, but instead passively receive the samples from the subnets of columns they are supposed to custody and consider that the data is available if they receive all the samples.

If that's the case, what you said makes sense.

I guess many people in this PR haven't heard about this idea yet and that's why it caused confusion here. It's not even specified in the spec yet.

Correct me if I am wrong: in order to achieve the goals of this PR (in particular "all clients will only have to optimise their peer set to be uniformly distributed based on network shards and they will be guaranteed to have a uniformly distributed set across all transient topics"), the mapping needs to be node_ids -> shard -> set of subnets to custody. If that's the case, it doesn't really matter how random the node_ids -> shard mapping is, or how random the shard -> set of subnets mapping is. Either way, there are only 64 possible sets of subnets that a node might be able to custody, which is the same as having only 64 nodes for the purpose of the property above.

Is it really a problem on the number of possible sets of subnets that a node can custody? I think what matters is the sets of subnets that a node will be subscribed to to receive the samples?

The two collections of sets don't have to be the same.

Of course, the nodes have to custody all the samples that are in the subnets of the shard assigned to them because shards are backbones and they have to serve samples to others, but they can also receive samples from other subnets in other shards as well.

What they can do is to subscribe to the subnets in other shards a few seconds before the samples are expected to arrive and unsubscribe a few moments later. This can be done quite fast because every node is already connected to nodes in every shard.

So I think the function, f: node_id -> sets of subnets to receive samples, can be uniformly random.

Please let me know if I misunderstand anything.

each topic that remain subscribed for a long period of time (order 1
day). This allows nodes to search for and maintain a selection of these
long-lived peers in order to publish/receive messages on the topic. As some
topics are transient for most nodes (i.e attestation subnets, DAS-related
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to mention DAS in the phase0 spec?

@AgeManning
Copy link
Contributor Author

I've spoken with @fradamt out of this channel.

To summarise our conversation:

The problem @fradamt is raising is that this PR is essentially bucketing all node-ids into 64 buckets (what we are calling the prefix's or shards). We are doing this to make these nodes easily searchable in discovery. The problem is that there are only 64 unique combinations of subnets all node-ids will subscribe to. Therefore an attacker can send 50% of samples and because there are only 64 unique combinations of column subscriptions they will satistically convince a large portion of these nodes that the data is there. If there were more combinations, it would be harder to trick a large group of the network.

So essentially we need a larger range combinations to map node-ids to column subnets.

Our current thinking therefore is to stick with this PR in that the first 6 bits can be used to map to a number of column subnets (i.e 2), then use the remaining bits (ones that won't be searchable) to specify extra columns. The first bits should be sufficient for us to easily search and build a backbone structure and the latter bits will be enough to split the nodes into more unique combinations of subscriptions to prevent the attack mentioned above.

I believe this PR (which targets phase 0) may be alright as it is, and another PR specifically related to DAS will have to account for these extra column subnets.

@cskiraly
Copy link
Contributor

So what you have here, in my understanding, is that you have N columns (is this NETWORK_TOPIC_COUNT?), and for each node you want to select L of these (I’m not sure what’s the name of this parameter in your description).
For the probabilistic arguments to work, you want to select these for each node independently, with uniform random selection, ideally without replacement.

Now this would be easy if you would just roll a dice a few times in every node and publish the selection somehow. However, we want to introduce a few tricks:

  1. we do not want to deal with publishing (and updating and retrieval) of the column selection, so we want to derive it from public data (node-id, ENR content, epoch, randao, etc.) This allows us to reuse our set of connected peers without an explicit “update of column selection” message. It also allows us to reuse our set of known (but not connected) peers. It makes GS setup and sample query much faster.
  2. If and when we need to search for new peers for a specific column, we want this to be better than a random walk on the Kademlia.
  3. Finally, we do not want a static allocation. We want a rotation of the selection, and we want this to be de-synched between nodes to avoid churn induced problems on a specific column topic.

Forgetting a little bit about the rotation aspect, the randomness required basically means that each node has to pick a random point in an N^L space (this is still with replacement, but we can deal with that later if we want). We want to connect this to the hierarchical Kad address space. A natural way to obtain this is to just simply map to the first L*log2(N) bits of the node-id (Kad address space). You can’t really use less bits, because you do need that amount of randomness. So we do an ordered selection, simply treating your node-id’s prefix bits as a column selection vector <C1,C2, .. CL>.

With this you achieve fast peer candidate search in Kad. Say you are looking for candidates for column C

  • you can immediately check your connected peers and known peer set
  • You can then look in Kad for nodes with a prefix of C. The nice thing about this is that of all nodes having C in their random selection, 1/L of them will be interested in C1 position, as their “first random selection”, so we get to them easily.
You can do a special Kad lookup with any address with the prefix C, and you start getting a stream of node-ids all starting with C. I’m saying special because you need a lookup primitive that “leeks” all addresses encountered, because you are not searching for the neighbours of the address, but for anything on the path. Anyway, that’s there in all implementation, even if not exposed. You can even do a BFS query, or a prefix constrained random walk, if you want. Discv5 FINDNODE with its restricted log distance is ideal for this.
  • If you really need to (but I don’t think you need it with PeerDAS numbers), you can also construct a special query for those that are interested in C as their second selection (C2=C). These are the nodes that have their 7..12 bits as C. This is not a typical Kademlia network primitive, but such a collect query can be made, and it still benefits from being prefix restricted.

Now your probabilistic space is fine (although we still do selection with replacement). But you still have an issue from the networking perspective, because your topic network now have a hierarchy. Nodes that selected C as their first selection are in a privileged position, forming a backbone that everyone tries to connect to. Similarly, the later in your column selection vector you have C, the less nodes try to connect to you because of it. But there is enough randomness in the lookups that I think this is not such a big issue for this version. Worth keeping in mind for later version, where we might want to re-randomize the graph with peer exchange and alike.

So we’ve solved 1 and 2, but we still need 3, the rotation. Here the real question is IMHO why we are doing it in the first place, and I don’t think that had been sufficiently clarified.
Anyway, the challenge is now to change individual positions in the selection vector <C1,C2, .. CL>, but still keep it related to the Kad structure.
One thing we have to decide here is whether we want nodes to “move together” between columns, or we want them to jump individually. E.g. take the nodes that have the same 6 bit prefix. Do we want them to be always on the same column because of this? My answer is no, because that would just mean we are “renaming” the columns every now and then. We would name that column “11” for some epoch, and then we would call it “33”, but it would not really change what happens. And this is my problem with the current rotation scheme, if I understand it correctly. I think you are just renaming columns every now and then, which does not add any extra guarantee. To be more precise, if we add the selection vector position (the 1 in C1 and the 2 in C2) in the function, we do shuffle a bit, but nodes still move in groups.

Rotation is becoming tricky because the only thing that differentiates your node from other nodes is the node-id. Everything else is global. (We can add things in the ENR, but that’s fundamentally just like having a bigger node-id). If we derive something deterministically from the same bits of the node-id, it will select the same nodes all the time, independent of how we randomise, shuffle, etc. We do need to include other bits that differentiate nodes to make nodes jump individually between topics, and the more bits we add, the more individual this jumping behaviour will be.

One idea here is to e.g. use the first 6 + S bits and the epoch to decide the first column selection (6 bits). Mapping from that 6+S bit ID space to a 6 bit ID space allows us to avoid keeping fixed groups, if that’s what we want to. If S is larger than log2 of the number of nodes, nodes essentially move independent from each other. If not, they are somewhat grouped. We also want this mapping to be easily reversible, so that we know what to search for in the Kad address space. We would end up with any specific C mapped to a few or a large amount of prefixes, but still prefixes, which can be used for Kad search.

Another idea is to do the shuffling period differently for different columns selection positions. Rotate C1 slowly (or maybe keep it fixed?) giving a stable 1/NETWORK_TOPIC_COUNT portion of nodes on the same topic, rotate C2 faster, C3 even faster, etc.

But really, I think the base question is what we want to achieve with the rotation.

@cskiraly
Copy link
Contributor

cskiraly commented Apr 12, 2024

Our current thinking therefore is to stick with this PR in that the first 6 bits can be used to map to a number of column subnets (i.e 2),

So you still want to use 6 bits to select 2 out of 64 (or from 128?). That's still limiting the space of combinations. Why would we do that instead of just using enough bits?

then use the remaining bits (ones that won't be searchable) to specify extra columns.

They are actually searchable, just way more complicated, and would need Discv5 changes. Anyway, I think we agree on this.

@Nashatyrev
Copy link
Member

Therefore an attacker can send 50% of samples and because there are only 64 unique combinations of column subscriptions they will satistically convince a large portion of these nodes that the data is there.

Every node has 2 separate roles:

  1. propagator : this is the role we are talking about here in this PR. A node just joins subnets derived from its nodeID and helps with samples dissemination. It has almost nothing to do with DAS
  2. sampler : as prescribed by DAS a node does random sampling from all the selected subnets. The selected random subnets doesn't depend on the subnets a node is subscribed in its propagator role.

Sorry, I'm not following how performing propagator duties affects the security of sampling ?

@fradamt
Copy link
Contributor

fradamt commented Apr 12, 2024

Therefore an attacker can send 50% of samples and because there are only 64 unique combinations of column subscriptions they will satistically convince a large portion of these nodes that the data is there.

Every node has 2 separate roles:

  1. propagator : this is the role we are talking about here in this PR. A node just joins subnets derived from its nodeID and helps with samples dissemination. It has almost nothing to do with DAS
  2. sampler : as prescribed by DAS a node does random sampling from all the selected subnets. The selected random subnets doesn't depend on the subnets a node is subscribed in its propagator role.

Sorry, I'm not following how performing propagator duties affects the security of sampling ?

It is currently unclear whether peer sampling will be required before voting in the current slot, or whether it will only be required after one slot or something like that. If that's the case, this opens up lots of problems. The most straightforward solution involves leveraging subnet sampling, i.e., getting some security guarantees directly from the distribution phase. An example of this is Dankrad's idea that validators custodying 2 rows and 2 columns in Danksharding would make it so that at most 1/16 (because 16 = 2^4) of the honest validators can be made to vote on an unavailable proposal. See here for a longer explanation of the problem and the suggested solution. For more context, this idea is also what SubnetDAS was based on. Importantly, for this purpose (ensuring that at most X% of the nodes can be tricked), it is irrelevant whether it is predictable what nodes are sampling. They can even "sample" (i.e., subscribe to) the same subnets all the time, as long as the assignments were made sufficiently randomly (hence the issue with the 64 buckets)

Of course, the nodes have to custody all the samples that are in the subnets of the shard assigned to them because shards are backbones and they have to serve samples to others, but they can also receive samples from other subnets in other shards as well.

What they can do is to subscribe to the subnets in other shards a few seconds before the samples are expected to arrive and unsubscribe a few moments later. This can be done quite fast because every node is already connected to nodes in every shard.

This would work for the purpose of enabling "subnet sampling". What I don't like about it is that we pay the bandwidth cost to have nodes subscribing to backbone_count subnets and to subnet_sampling_count subnets, but only backbone_count subnets are actually helpful in peer sampling, because they are the ones that are advertised by the node id. That seems like a big waste, so the subnet_sampling_count subnets should also be derived from node id imo, though using more bits.

@AgeManning
Copy link
Contributor Author

@cskiraly - You've nailed most of the points of this PR related to the improvements with discovery. Essentially we are partitioning the DHT into 64 sub-dht's that we can search for easily as you've pointed out.

One of the biggest points you've missed out at the moment is enforceability. We need peers to subscribe to topics for a long period of time in order for the network to work. However, this costs bandwidth and has no incentive for a peer to do. I know some nodes change a few lines of codes and reduce their bandwidth by 50%+ and their nodes work ideally better. Subscribing to subnets is kind of like a chore that we need to make everyone participate in. By tagging it to the peer-id, we can enforce node's to do these subscriptions.

It's a long read, but there is history to this PR where we have discussed quite a bit of this.
It might be helpful to read these in order:

Quick answers to some of your questions.

The current rotation splits nodes up in a rotation period and rotates them at different epochs. We do this so that there isn't a mass rotation of peers all at once from one topic. i.e If all peers transitioned at the same time, that gossipsub topic would temporarily be unavailable, we'd miss information and all nodes would be scrambling to find new peers to fit that topic.

I reason we have a rotation at all, comes from the original spec with attestation subnets where we decided to rotate the long-lived subscription for subnets to 24 hours. I believe the idea was that no single nodes sit on a single subnet and try to censor it. I.e build up a large peer set all on one topic and don't forward messages. I'm aware the rotations don't solve this problem entirely but its a nice-to-have.

@AgeManning
Copy link
Contributor Author

As there's a lot of conversation going on here. I propose either we have a dedicated call sometime for this PR as to whether we want it or not, or make a decision in the next consensus call.

If there isn't interest in a dedicated call, i'll add it to the agenda on the next consensus call.

@cskiraly
Copy link
Contributor

@AgeManning thanks for all the pointers!
I still have a few concerns (about actual enforcement, why we rotate, why 6 bits, etc.), but I want to read though everything to avoid repeating things already discussed, or not part of the PR.

Happy to have a dedicated call (e.g. something Wed CEST afternoon), or to discuss during the consensus call.

@AgeManning
Copy link
Contributor Author

@AgeManning thanks for all the pointers! I still have a few concerns (about actual enforcement, why we rotate, why 6 bits, etc.), but I want to read though everything to avoid repeating things already discussed, or not part of the PR.

Happy to have a dedicated call (e.g. something Wed CEST afternoon), or to discuss during the consensus call.

Yep, lets discuss in the consensus call. I'm also not convinced on the rotation any more and it would simplify things dramatically to remove it. (See the consensus call issue)

@ppopth
Copy link
Member

ppopth commented Apr 23, 2024

I think what you assume is that the validators will not actively request the samples using RPC (Req/Resp) as @AgeManning explained and specified in the current PeerDAS spec, but instead passively receive the samples from the subnets of columns they are supposed to custody and consider that the data is available if they receive all the samples.

I created a PR at #3717 for what I mentioned here

@arnetheduck
Copy link
Contributor

Enforceable - We know which nodes should be subscribed to which topics

From earlier conversations, it bears repeating that this PR is no more enforceable than the status quo due to the possibility for a node to opt out from the mesh instead of the subscription table to save bandwidth - ie we can annoy each other but not really address leeches.

In this context, it's worth bringing up that light clients also rely on non-enforced subscription grey zone to operate, so before we venture into enforcement territory, we should settle a policy for those use cases. Indeed, broad enforcement carries with it risks of doing more harm than good.

@AgeManning
Copy link
Contributor Author

Sure. People can modify gossipsub and have a 0-mesh degree on some topics.

I feel like combating this, is more suited to gossipsub scoring and ensuring peers are actually gossiping on the topic they are subscribed to.

This PR isn't trying to solve everything, rather be a first step towards improving what we currently have.

I'm going to close this PR in favour of the simplest approach, which will be to use just the first prefix bits, in a static way to define attestation subnet subscriptions without rotation.

The current plan is for us is to implement this in Lighthouse and then we can modify the structure going forward to attempt to handle all of the possible complexities that this PR has raised.

@AgeManning
Copy link
Contributor Author

Our (Lighthouse) current plan is to adopt the simplest approach in the meantime and then progress from there.

Therefore, I'll close this PR in favour of #3735

@AgeManning AgeManning closed this Apr 30, 2024
@arnetheduck
Copy link
Contributor

I feel like combating this, is more suited to gossipsub scoring and ensuring peers are actually gossiping on the topic they are subscribed to.

Tricky, broadly it is legit to have a "full" mesh and prune others and not publish anything on subscribed topics (ie only occasionally publish some appropriately old ihaves, if at all since the ihave publishing isn't regular)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.