-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Docs] upgrade/chain halt recovery (#837)
## Summary Performed the first upgrade on the Alpha TestNet. Add some documentation changes to prevent some issues in the future. ## Issue N/A ## Type of change Select one or more from the following: - [ ] New feature, functionality or library - [ ] Consensus breaking; add the `consensus-breaking` label if so. See #791 for details - [ ] Bug fix - [x] Code health or cleanup - [x] Documentation - [ ] Other (specify) ## Testing - [x] **Documentation**: `make docusaurus_start`; only needed if you make doc changes - [ ] **Unit Tests**: `make go_develop_and_test` - [ ] **LocalNet E2E Tests**: `make test_e2e` - [ ] **DevNet E2E Tests**: Add the `devnet-test-e2e` label to the PR. ## Sanity Checklist - [ ] I have tested my changes using the available tooling - [ ] I have commented my code - [ ] I have performed a self-review of my own code; both comments & source code - [ ] I create and reference any new tickets, if applicable - [ ] I have left TODOs throughout the codebase, if applicable --------- Co-authored-by: DK <[email protected]> Co-authored-by: Daniel Olshansky <[email protected]> Co-authored-by: Bryan White <[email protected]>
- Loading branch information
1 parent
f17ea06
commit ea89904
Showing
10 changed files
with
543 additions
and
38 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
196 changes: 196 additions & 0 deletions
196
docusaurus/docs/develop/developer_guide/recovery_from_chain_halt.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,196 @@ | ||
--- | ||
sidebar_position: 7 | ||
title: Chain Halt Recovery | ||
--- | ||
|
||
## Chain Halt Recovery <!-- omit in toc --> | ||
|
||
This document describes how to recover from a chain halt. | ||
|
||
It assumes that the cause of the chain halt has been identified, and that the | ||
new release has been created and verified to function correctly. | ||
|
||
:::tip | ||
|
||
See [Chain Halt Troubleshooting](./chain_halt_troubleshooting.md) for more information on identifying the cause of a chain halt. | ||
|
||
::: | ||
|
||
- [Background](#background) | ||
- [Resolving halts during a network upgrade](#resolving-halts-during-a-network-upgrade) | ||
- [Manual binary replacement (preferred)](#manual-binary-replacement-preferred) | ||
- [Rollback, fork and upgrade](#rollback-fork-and-upgrade) | ||
- [Troubleshooting](#troubleshooting) | ||
- [Data rollback - retrieving snapshot at a specific height (step 5)](#data-rollback---retrieving-snapshot-at-a-specific-height-step-5) | ||
- [Validator Isolation - risks (step 6)](#validator-isolation---risks-step-6) | ||
|
||
## Background | ||
|
||
Pocket network is built on top of `cosmos-sdk`, which utilizes the CometBFT consensus engine. | ||
Comet's Byzantine Fault Tolerant (BFT) consensus algorithm requires that **at least** 2/3 of Validators | ||
are online and voting for the same block to reach a consensus. In order to maintain liveness | ||
and avoid a chain-halt, we need the majority (> 2/3) of Validators to participate | ||
and use the same version of the software. | ||
|
||
## Resolving halts during a network upgrade | ||
|
||
If the halt is caused by the network upgrade, it is possible the solution can be as simple as | ||
skipping an upgrade (i.e. `unsafe-skip-upgrade`) and creating a new (fixed) upgrade. | ||
|
||
Read more about [upgrade contingency plans](../../protocol/upgrades/contigency_plans.md). | ||
|
||
### Manual binary replacement (preferred) | ||
|
||
:::note | ||
|
||
This is the preferred way of resolving consensus-breaking issues. | ||
|
||
**Significant side effect**: this breaks an ability to sync from genesis **without manual interventions**. | ||
For example, when a consensus-breaking issue occurs on a node that is synching from the first block, node operators need | ||
to manually replace the binary with the new one. There are efforts underway to mitigate this issue, including | ||
configuration for `cosmovisor` that could automate the process. | ||
|
||
<!-- TODO_MAINNET(@okdas): Add links to Cosmovisor documentation on how the new UX can be used to automate syncing from genesis without human input. --> | ||
|
||
::: | ||
|
||
Since the chain is not moving, **it is impossible** to issue an automatic upgrade with an upgrade plan. Instead, | ||
we need **social consensus** to manually replace the binary and get the chain moving. | ||
|
||
The steps to doing so are: | ||
|
||
1. Prepare and verify a new binary that addresses the consensus-breaking issue. | ||
2. Reach out to the community and validators so they can upgrade the binary manually. | ||
3. Update [the documentation](../../protocol/upgrades/upgrade_list.md) to include a range a height when the binary needs | ||
to be replaced. | ||
|
||
:::warning | ||
|
||
TODO_MAINNET(@okdas): | ||
|
||
1. **For step 2**: Investigate if the CometBFT rounds/steps need to be aligned as in Morse chain halts. See [this ref](https://docs.cometbft.com/v1.0/spec/consensus/consensus). | ||
2. **For step 3**: Add `cosmovisor` documentation so its configured to automatically replace the binary when synching from genesis. | ||
|
||
::: | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant DevTeam | ||
participant Community | ||
participant Validators | ||
participant Documentation | ||
participant Network | ||
DevTeam->>DevTeam: 1. Prepare and verify new binary | ||
DevTeam->>Community: 2. Announce new binary and instructions | ||
DevTeam->>Validators: 2. Notify validators to upgrade manually | ||
Validators->>Validators: 2. Manually replace the binary | ||
Validators->>Network: 2. Restart nodes with new binary | ||
DevTeam->>Documentation: 3. Update documentation (GitHub Release and Upgrade List to include instructions) | ||
Validators-->>Network: Network resumes operation | ||
``` | ||
|
||
### Rollback, fork and upgrade | ||
|
||
:::info | ||
|
||
These instructions are only relevant to Pocket Network's Shannon release. | ||
|
||
We do not currently use `x/gov` or on-chain voting for upgrades. | ||
Instead, all participants in our DAO vote on upgrades off-chain, and the Foundation | ||
executes transactions on their behalf. | ||
|
||
::: | ||
|
||
:::warning | ||
|
||
This should be avoided or more testing is required. In our tests, the full nodes were | ||
propagating the existing blocks signed by the Validators, making it hard to rollback. | ||
|
||
::: | ||
|
||
**Performing a rollback is analogous to forking the network at the older height.** | ||
|
||
However, if necessary, the instructions to follow are: | ||
|
||
1. Prepare & verify a new binary that addresses the consensus-breaking issue. | ||
2. [Create a release](../../protocol/upgrades/release_process.md). | ||
3. [Prepare an upgrade transaction](../../protocol/upgrades/upgrade_procedure.md#writing-an-upgrade-transaction) to the new version. | ||
4. Disconnect the `Validator set` from the rest of the network **3 blocks** prior to the height of the chain halt. For example: | ||
- Assume an issue at height `103`. | ||
- Revert the `validator set` to height `100`. | ||
- Submit an upgrade transaction at `101`. | ||
- Upgrade the chain at height `102`. | ||
- Avoid the issue at height `103`. | ||
5. Ensure all validators rolled back to the same height and use the same snapshot ([how to get a snapshot](#data-rollback---retrieving-snapshot-at-a-specific-height-step-5)) | ||
- The snapshot should be imported into each Validator's data directory. | ||
- This is necessary to ensure data continuity and prevent forks. | ||
6. Isolate the `validator set` from full nodes - ([why this is necessary](#validator-isolation---risks-step-6)). | ||
- This is necessary to avoid full nodes from gossiping blocks that have been rolled back. | ||
- This may require using a firewall or a private network. | ||
- Validators should only be permitted to gossip blocks amongst themselves. | ||
7. Start the `validator set` and perform the upgrade. For example, reiterating the process above: | ||
- Start all Validators at height `100`. | ||
- On block `101`, submit the `MsgSoftwareUpgrade` transaction with a `Plan.height` set to `102`. | ||
- `x/upgrade` will perform the upgrade in the `EndBlocker` of block `102`. | ||
- The node will stop climbing with an error waiting for the upgrade to be performed. | ||
- Cosmovisor deployments automatically replace the binary. | ||
- Manual deployments will require a manual replacement at this point. | ||
- Start the node back up. | ||
8. Wait for the network to reach the height of the previous ledger (`104`+). | ||
9. Allow validators to open their network to full nodes again. | ||
- **Note**: full nodes will need to perform the rollback or use a snapshot as well. | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant DevTeam | ||
participant Foundation | ||
participant Validators | ||
participant FullNodes | ||
%% participant Network | ||
DevTeam->>DevTeam: 1. Prepare & verify new binary | ||
DevTeam->>DevTeam: 2 & 3. Create a release & prepare upgrade transaction | ||
Validators->>Validators: 4 & 5. Roll back to height before issue or import snapshot | ||
Validators->>Validators: 6. Isolate from Full Nodes | ||
Foundation->>Validators: 7. Distribute upgrade transaction | ||
Validators->>Validators: 7. Start network and perform upgrade | ||
break | ||
Validators->>Validators: 8. Wait until previously problematic height elapses | ||
end | ||
Validators-->FullNodes: 9. Open network connections | ||
FullNodes-->>Validators: 9. Sync with updated network | ||
note over Validators,FullNodes: Network resumes operation | ||
``` | ||
|
||
### Troubleshooting | ||
|
||
#### Data rollback - retrieving snapshot at a specific height (step 5) | ||
|
||
There are two ways to get a snapshot from a prior height: | ||
|
||
1. Execute | ||
|
||
```bash | ||
poktrolld rollback --hard | ||
``` | ||
|
||
repeately, until the command responds with the desired block number. | ||
|
||
2. Use a snapshot from below the halt height (e.g. `100`) and start the node with `--halt-height=100` parameter so it only syncs up to certain height and then | ||
gracefully shuts down. Add this argument to `poktrolld start` like this: | ||
|
||
```bash | ||
poktrolld start --halt-height=100 | ||
``` | ||
|
||
#### Validator Isolation - risks (step 6) | ||
|
||
Having at least one node that has knowledge of the forking ledger can jeopardize the whole process. In particular, the | ||
following errors in logs are the sign of the nodes syncing blocks from the wrong fork: | ||
|
||
- `found conflicting vote from ourselves; did you unsafe_reset a validator?` | ||
- `conflicting votes from validator` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
--- | ||
title: Failed upgrade contingency plan | ||
sidebar_position: 5 | ||
--- | ||
|
||
:::tip | ||
|
||
This documentation covers failed upgrade contingency for `poktroll` - a `cosmos-sdk` based chain. | ||
|
||
While this can be helpful for other blockchain networks, it is not guaranteed to work for other chains. | ||
|
||
::: | ||
|
||
## Contingency plans <!-- omit in toc --> | ||
|
||
There's always a chance the upgrade will fail. | ||
|
||
This document is intended to help you recover without significant downtime. | ||
|
||
- [Option 0: The bug is discovered before the upgrade height is reached](#option-0-the-bug-is-discovered-before-the-upgrade-height-is-reached) | ||
- [Option 1: The migration didn't start (i.e. migration halt)](#option-1-the-migration-didnt-start-ie-migration-halt) | ||
- [Option 2: The migration is stuck (i.e. incomplete/partial migration)](#option-2-the-migration-is-stuck-ie-incompletepartial-migration) | ||
- [Option 3: The migration succeed but the network is stuck (i.e. migration had a bug)](#option-3-the-migration-succeed-but-the-network-is-stuck-ie-migration-had-a-bug) | ||
- [MANDATORY Checklist of Documentation \& Scripts to Update](#mandatory-checklist-of-documentation--scripts-to-update) | ||
|
||
### Option 0: The bug is discovered before the upgrade height is reached | ||
|
||
**Cancel the upgrade plan!** | ||
|
||
See the instructions of [how to do that here](./upgrade_procedure.md#cancelling-the-upgrade-plan). | ||
|
||
### Option 1: The migration didn't start (i.e. migration halt) | ||
|
||
**This is unlikely to happen.** | ||
|
||
Possible reasons for this are if the name of the upgrade handler is different | ||
from the one specified in the upgrade plan, or if the binary suggested by the | ||
upgrade plan is wrong. | ||
|
||
If the nodes on the network stopped at the upgrade height and the migration did not | ||
start yet (i.e. there are no logs indicating the upgrade handler and store migrations are being executed), | ||
we **MUST** gather social consensus to restart validators with the `--unsafe-skip-upgrade=$upgradeHeightNumber` flag. | ||
|
||
This will skip the upgrade process, allowing the chain to continue and the protocol team to plan another release. | ||
|
||
`--unsafe-skip-upgrade` simply skips the upgrade handler and store migrations. | ||
The chain continues as if the upgrade plan was never set. | ||
The upgrade needs to be fixed, and then a new plan needs to be submitted to the network. | ||
|
||
:::caution | ||
|
||
`--unsafe-skip-upgrade` needs to be documented in the list of upgrades and added | ||
to the scripts so the next time somebody tries to sync the network from genesis, | ||
they will automatically skip the failed upgrade. | ||
[Documentation and scripts to update](#documentation-and-scripts-to-update) | ||
|
||
<!-- TODO_MAINNET(@okdas): new cosmovisor UX can simplify this --> | ||
|
||
::: | ||
|
||
### Option 2: The migration is stuck (i.e. incomplete/partial migration) | ||
|
||
If the migration is stuck, there's always a chance the upgrade handler was executed on-chain as scheduled, but the migration didn't complete. | ||
|
||
In such a case, we need: | ||
|
||
- **All full nodes and validators**: Roll back validators to the backup | ||
|
||
- A snapshot is taken by `cosmovisor` automatically prior to upgrade when `UNSAFE_SKIP_BACKUP` is set to `false` (the default recommended value; | ||
[more information](https://docs.cosmos.network/main/build/tooling/cosmovisor#command-line-arguments-and-environment-variables)) | ||
|
||
- **All full nodes and validators**: skip the upgrade | ||
|
||
- Add the `--unsafe-skip-upgrade=$upgradeHeightNumber` argument to `poktroll start` command like so: | ||
|
||
```bash | ||
poktrolld start --unsafe-skip-upgrade=$upgradeHeightNumber # ... the rest of the arguments | ||
``` | ||
|
||
- **Protocol team**: Resolve the issue with an upgrade and schedule a new plan. | ||
|
||
- The upgrade needs to be fixed, and then a new plan needs to be submitted to the network. | ||
|
||
- **Protocol team**: document the failed upgrade | ||
|
||
- Document and add `--unsafe-skip-upgrade=$upgradeHeightNumber` to the scripts (such as docker-compose and cosmovisor installer) | ||
- The next time somebody tries to sync the network from genesis they will automatically skip the failed upgrade; see [documentation and scripts to update](#documentation-and-scripts-to-update) | ||
|
||
<!-- TODO_MAINNET(@okdas): new cosmovisor UX can simplify this --> | ||
|
||
### Option 3: The migration succeed but the network is stuck (i.e. migration had a bug) | ||
|
||
This should be treated as a consensus or non-determinism bug that is unrelated to the upgrade. See [Recovery From Chain Halt](../../develop/developer_guide/recovery_from_chain_halt.md) for more information on how to handle such issues. | ||
|
||
### MANDATORY Checklist of Documentation & Scripts to Update | ||
|
||
- [ ] The [upgrade list](./upgrade_list.md) should reflect a failed upgrade and provide a range of heights that served by each version. | ||
- [ ] Systemd service should include`--unsafe-skip-upgrade=$upgradeHeightNumber` argument in its start command [here](https://github.com/pokt-network/poktroll/blob/main/tools/installer/full-node.sh). | ||
- [ ] The [Helm chart](https://github.com/pokt-network/helm-charts/blob/main/charts/poktrolld/templates/StatefulSet.yaml) should point to the latest version;consider exposing via a `values.yaml` file | ||
- [ ] The [docker-compose](https://github.com/pokt-network/poktroll-docker-compose-example/tree/main/scripts) examples should point to the latest version |
Oops, something went wrong.