Skip to content

Commit

Permalink
WIP review on docusaurus/docs/protocol/upgrades/contigency_plans.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Olshansk committed Dec 12, 2024
1 parent 2a1b717 commit 58a7493
Showing 1 changed file with 31 additions and 24 deletions.
55 changes: 31 additions & 24 deletions docusaurus/docs/protocol/upgrades/contigency_plans.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@ There's always a chance the upgrade will fail.
This document is intended to help you recover without significant downtime.

- [Option 0: The bug is discovered before the upgrade height is reached](#option-0-the-bug-is-discovered-before-the-upgrade-height-is-reached)
- [Option 1: The upgrade height is reached and the migration didn't start (halted)](#option-1-the-upgrade-height-is-reached-and-the-migration-didnt-start-halted)
- [Option 2: The migration is stuck](#option-2-the-migration-is-stuck)
- [Option 3: The network is stuck at the future height after the upgrade](#option-3-the-network-is-stuck-at-the-future-height-after-the-upgrade)
- [Option 1: The migration didn't start (i.e. migration halt)](#option-1-the-migration-didnt-start-ie-migration-halt)
- [Option 2: The migration is stuck (i.e. incomplete/partial migration)](#option-2-the-migration-is-stuck-ie-incompletepartial-migration)
- [Option 3: The migration succeed but the network is stuck (i.e. migration had a bug)](#option-3-the-migration-succeed-but-the-network-is-stuck-ie-migration-had-a-bug)
- [Documentation and scripts to update](#documentation-and-scripts-to-update)

### Option 0: The bug is discovered before the upgrade height is reached
Expand All @@ -29,11 +29,13 @@ This document is intended to help you recover without significant downtime.

See the instructions of [how to do that here](./upgrade_procedure.md#cancelling-the-upgrade-plan).

Check warning on line 30 in docusaurus/docs/protocol/upgrades/contigency_plans.md

View workflow job for this annotation

GitHub Actions / misspell

[misspell] docusaurus/docs/protocol/upgrades/contigency_plans.md#L30

"cancelling" is a misspelling of "canceling"
Raw output
./docusaurus/docs/protocol/upgrades/contigency_plans.md:30:69: "cancelling" is a misspelling of "canceling"

### Option 1: The upgrade height is reached and the migration didn't start (halted)
### Option 1: The migration didn't start (i.e. migration halt)

This is unlikely to happen. Possible cases are if the name of the upgrade handler is
different from the one specified in the upgrade plan, or if the binary suggested by
the upgrade plan is wrong.
**This is unlikely to happen.**

Possible reasons for this are if the name of the upgrade handler is different
from the one specified in the upgrade plan, or if the binary suggested by the
upgrade plan is wrong.

If the nodes on the network stopped at the upgrade height and the migration did not
start yet (i.e. there are no logs indicating the upgrade handler and store migrations are being executed),
Expand All @@ -47,32 +49,37 @@ The upgrade needs to be fixed, and then a new plan needs to be submitted to the

:::caution

`--unsafe-skip-upgrade` needs to be documented in the list of upgrades and added to the scripts so the next time somebody tries to sync the network from genesis - they will automatically skip the failed upgrade. [Documentation and scripts to update](#documentation-and-scripts-to-update)
`--unsafe-skip-upgrade` needs to be documented in the list of upgrades and added
to the scripts so the next time somebody tries to sync the network from genesis,
they will automatically skip the failed upgrade.
[Documentation and scripts to update](#documentation-and-scripts-to-update)

<!-- TODO_IMPROVE(@okdas): new cosmovisor UX can simplify this -->
<!-- TODO_MAINNET(@okdas): new cosmovisor UX can simplify this -->

:::

### Option 2: The migration is stuck
### Option 2: The migration is stuck (i.e. incomplete/partial migration)

If the migration is stuck, there's always a chance the upgrade handler was executed on-chain as scheduled, but the migration didn't complete.

In such a case, we need to:
In such a case, we need:

- **All full nodes and validators**: Roll back validators to the backup. A snapshot is taken by `cosmovisor` automatically prior to upgrade when`UNSAFE_SKIP_BACKUP` is set to `false` (which is a default and recommended value -
[more information](https://docs.cosmos.network/main/build/tooling/cosmovisor#command-line-arguments-and-environment-variables)).
- **All full nodes and validators**: skip the upgrade by adding `--unsafe-skip-upgrade=$upgradeHeightNumber`
argument to your `poktroll start` command. Like this:
```bash
poktrolld start --unsafe-skip-upgrade=$upgradeHeightNumber # ... the rest of the arguments
```
- **Protocol team**: document and add `--unsafe-skip-upgrade=$upgradeHeightNumber` to the scripts (such as docker-compose and cosmovisor installer) so the next time somebody
tries to sync the network from genesis they will automatically skip the failed upgrade. [Documentation and scripts to update](#documentation-and-scripts-to-update)
- **Protocol team**: Resolve the issue with an upgrade and schedule another plan.
- **All full nodes and validators**: Roll back validators to the backup
- A snapshot is taken by `cosmovisor` automatically prior to upgrade when `UNSAFE_SKIP_BACKUP` is set to `false` (the default recommended value;
[more information](https://docs.cosmos.network/main/build/tooling/cosmovisor#command-line-arguments-and-environment-variables))
- **All full nodes and validators**: skip the upgrade
- Add the `--unsafe-skip-upgrade=$upgradeHeightNumber` argument to `poktroll start` command like so:
```bash
poktrolld start --unsafe-skip-upgrade=$upgradeHeightNumber # ... the rest of the arguments
```
- **Protocol team**: document the failed upgrade
- document and add `--unsafe-skip-upgrade=$upgradeHeightNumber` to the scripts (such as docker-compose and cosmovisor installer)
- The next time somebody tries to sync the network from genesis they will automatically skip the failed upgrade; see [documentation and scripts to update](#documentation-and-scripts-to-update)
- **Protocol team**: Resolve the issue with an upgrade and schedule a new plan.

<!-- TODO_IMPROVE(@okdas): new cosmovisor UX can simplify this -->
<!-- TODO_MAINNET(@okdas): new cosmovisor UX can simplify this -->

### Option 3: The network is stuck at the future height after the upgrade
### Option 3: The migration succeed but the network is stuck (i.e. migration had a bug)

This should be treated as a consensus or non-determinism bug that is unrelated to the upgrade. See [Recovery From Chain Halt](../../develop/developer_guide/recovery_from_chain_halt.md) for more information on how to handle such issues.

Expand All @@ -81,4 +88,4 @@ This should be treated as a consensus or non-determinism bug that is unrelated t
- The [upgrade list](./upgrade_list.md) should reflect a failed upgrade and provide a range of heights that served by each version.
- Systemd service should include`--unsafe-skip-upgrade=$upgradeHeightNumber` argument in its start command [here](https://github.com/pokt-network/poktroll/blob/main/tools/installer/full-node.sh).
- [Helm chart](https://github.com/pokt-network/helm-charts/blob/main/charts/poktrolld/templates/StatefulSet.yaml) (consider exposing via a `values.yaml` file)
- [docker-compose](https://github.com/pokt-network/poktroll-docker-compose-example/tree/main/scripts) example
- [docker-compose](https://github.com/pokt-network/poktroll-docker-compose-example/tree/main/scripts) example

0 comments on commit 58a7493

Please sign in to comment.