Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorg Support - Goals and Design #13837

Open
axelKingsley opened this issue Jan 17, 2025 · 0 comments
Open

Reorg Support - Goals and Design #13837

axelKingsley opened this issue Jan 17, 2025 · 0 comments

Comments

@axelKingsley
Copy link
Contributor

Here are some thoughts about Reorg Handling in the Supervisor

Types (sources) of Reorgs:

Unsafe Reorg

In this situation, the Supervisor is syncing nodes and making forward progress. A Node in Managed Mode receives an updated unsafe block from the gossip network, and drops an the old unsafe block it was carrying.

When this happens, a new unsafe block notification will travel to the Supervisor, and the Supervisor will find that it holds conflicting log data for the given block.

There are two ways to consider this disagreement:

  1. The Supervisor is always the source of truth, and so the Supervisor should reset the Node and expect it to return to the originally indexed data
  2. Unsafe data is arbitrary, and if a node replaces old unsafe data with new unsafe data, that new data is more likely to be correct

My vote is for #2, unsafe data cannot be considered Cannonical against any L1 because it is just delivered over the gossip network.

Local Safety can't be promoted to Cross Safety

In this situation, a block is derived from the L1 and is recorded into the Local Derivation DB. Because the block contains some cross dependency, the Cross Safety doesn't advance through the block. Then, new data arrives for the cross-dependencies which reveal that the block we recorded as Locally Safe is actually invalid because the interop claims it made are incorrect.

When this happens, the supervisor is the first to realize it, and must correct the node. The node must replace the invalid block with the deposit-only block. Furthermore, the Supervisor must ensure it doesn't reconsume the same Locally Safe data, causing this issue to loop.

This work has been started here: #13645

L1 change

In this situation, all derivation and syncing is happening correctly, and then the L1 itself reorgs. Now all the data used to derive blocks is called into question.

When this happens, all Unsafe, Local and Cross data must be evaluated. Local and Cross Derivation Data will certainly need to be purged, as the L2 blocks now derive from different L1 blocks. And it is possible that the L1 reorg changed the L2 reality, invalidating unsafe data as well.

Desync

This covers any situation in which the Node and Supervisor are working correctly, and then through some outside means (restart, cosmic radiation, an operator error) the database becomes inconsistent.

These situations are varied, so the recovery path is varried as well, but generally we will need to determine what data is missing or incorrect, and will need to prune back databases to that point and resync.

Thinking about Reorgs Architecturally

It would really suck if we had to implement N different reorg handlers based on the N different situations above. Really, they all fall into the same basic behavior, where the inconsistency is identified, the bad data is destroyed, and the Node and Supervisor are resync'd to canonical data.

We can think of canonicity as being an activity in two parts:

  • Allow valid data
  • Don't allow invalid data

We achieve the former through consistency checks on-write to the ChainsDB which already exist -- when new data is added to event and derivation databases, it must build upon the previously recorded blocks.

We can achieve the latter using a centralized system. I'm imagining this utility as "The Invalidator"

The Invalidator would be a new component of the Supervisor whose entire purpose is to destroy incorrect data between the Node and Supervisor. All of the above cases can generate event signals which flow to the Invalidator and trigger similar workflows:

  • Identify the incorrect data
  • Destroy (prune) the incorrect data
  • Insert data where required (like to mark a Locally Invalidated block so we don't retry it)
  • Initiate resync

Likely most of this code could be singleton, meaning we can keep the complexity very low and the reasoning high.

@axelKingsley axelKingsley added this to the Interop: Stable Devnet milestone Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant