From 58a7493f6d00ca9e0b7bfba4e267e26c4eeb2382 Mon Sep 17 00:00:00 2001 From: Daniel Olshansky Date: Thu, 12 Dec 2024 09:55:29 -0800 Subject: [PATCH] WIP review on docusaurus/docs/protocol/upgrades/contigency_plans.md --- .../protocol/upgrades/contigency_plans.md | 55 +++++++++++-------- 1 file changed, 31 insertions(+), 24 deletions(-) diff --git a/docusaurus/docs/protocol/upgrades/contigency_plans.md b/docusaurus/docs/protocol/upgrades/contigency_plans.md index 16c88f26a..b23c00148 100644 --- a/docusaurus/docs/protocol/upgrades/contigency_plans.md +++ b/docusaurus/docs/protocol/upgrades/contigency_plans.md @@ -18,9 +18,9 @@ There's always a chance the upgrade will fail. This document is intended to help you recover without significant downtime. - [Option 0: The bug is discovered before the upgrade height is reached](#option-0-the-bug-is-discovered-before-the-upgrade-height-is-reached) -- [Option 1: The upgrade height is reached and the migration didn't start (halted)](#option-1-the-upgrade-height-is-reached-and-the-migration-didnt-start-halted) -- [Option 2: The migration is stuck](#option-2-the-migration-is-stuck) -- [Option 3: The network is stuck at the future height after the upgrade](#option-3-the-network-is-stuck-at-the-future-height-after-the-upgrade) +- [Option 1: The migration didn't start (i.e. migration halt)](#option-1-the-migration-didnt-start-ie-migration-halt) +- [Option 2: The migration is stuck (i.e. incomplete/partial migration)](#option-2-the-migration-is-stuck-ie-incompletepartial-migration) +- [Option 3: The migration succeed but the network is stuck (i.e. migration had a bug)](#option-3-the-migration-succeed-but-the-network-is-stuck-ie-migration-had-a-bug) - [Documentation and scripts to update](#documentation-and-scripts-to-update) ### Option 0: The bug is discovered before the upgrade height is reached @@ -29,11 +29,13 @@ This document is intended to help you recover without significant downtime. See the instructions of [how to do that here](./upgrade_procedure.md#cancelling-the-upgrade-plan). -### Option 1: The upgrade height is reached and the migration didn't start (halted) +### Option 1: The migration didn't start (i.e. migration halt) -This is unlikely to happen. Possible cases are if the name of the upgrade handler is -different from the one specified in the upgrade plan, or if the binary suggested by -the upgrade plan is wrong. +**This is unlikely to happen.** + +Possible reasons for this are if the name of the upgrade handler is different +from the one specified in the upgrade plan, or if the binary suggested by the +upgrade plan is wrong. If the nodes on the network stopped at the upgrade height and the migration did not start yet (i.e. there are no logs indicating the upgrade handler and store migrations are being executed), @@ -47,32 +49,37 @@ The upgrade needs to be fixed, and then a new plan needs to be submitted to the :::caution -`--unsafe-skip-upgrade` needs to be documented in the list of upgrades and added to the scripts so the next time somebody tries to sync the network from genesis - they will automatically skip the failed upgrade. [Documentation and scripts to update](#documentation-and-scripts-to-update) +`--unsafe-skip-upgrade` needs to be documented in the list of upgrades and added +to the scripts so the next time somebody tries to sync the network from genesis, +they will automatically skip the failed upgrade. +[Documentation and scripts to update](#documentation-and-scripts-to-update) - + ::: -### Option 2: The migration is stuck +### Option 2: The migration is stuck (i.e. incomplete/partial migration) If the migration is stuck, there's always a chance the upgrade handler was executed on-chain as scheduled, but the migration didn't complete. -In such a case, we need to: +In such a case, we need: -- **All full nodes and validators**: Roll back validators to the backup. A snapshot is taken by `cosmovisor` automatically prior to upgrade when`UNSAFE_SKIP_BACKUP` is set to `false` (which is a default and recommended value - - [more information](https://docs.cosmos.network/main/build/tooling/cosmovisor#command-line-arguments-and-environment-variables)). -- **All full nodes and validators**: skip the upgrade by adding `--unsafe-skip-upgrade=$upgradeHeightNumber` - argument to your `poktroll start` command. Like this: - ```bash - poktrolld start --unsafe-skip-upgrade=$upgradeHeightNumber # ... the rest of the arguments - ``` -- **Protocol team**: document and add `--unsafe-skip-upgrade=$upgradeHeightNumber` to the scripts (such as docker-compose and cosmovisor installer) so the next time somebody - tries to sync the network from genesis they will automatically skip the failed upgrade. [Documentation and scripts to update](#documentation-and-scripts-to-update) -- **Protocol team**: Resolve the issue with an upgrade and schedule another plan. +- **All full nodes and validators**: Roll back validators to the backup + - A snapshot is taken by `cosmovisor` automatically prior to upgrade when `UNSAFE_SKIP_BACKUP` is set to `false` (the default recommended value; + [more information](https://docs.cosmos.network/main/build/tooling/cosmovisor#command-line-arguments-and-environment-variables)) +- **All full nodes and validators**: skip the upgrade + - Add the `--unsafe-skip-upgrade=$upgradeHeightNumber` argument to `poktroll start` command like so: + ```bash + poktrolld start --unsafe-skip-upgrade=$upgradeHeightNumber # ... the rest of the arguments + ``` +- **Protocol team**: document the failed upgrade + - document and add `--unsafe-skip-upgrade=$upgradeHeightNumber` to the scripts (such as docker-compose and cosmovisor installer) + - The next time somebody tries to sync the network from genesis they will automatically skip the failed upgrade; see [documentation and scripts to update](#documentation-and-scripts-to-update) +- **Protocol team**: Resolve the issue with an upgrade and schedule a new plan. - + -### Option 3: The network is stuck at the future height after the upgrade +### Option 3: The migration succeed but the network is stuck (i.e. migration had a bug) This should be treated as a consensus or non-determinism bug that is unrelated to the upgrade. See [Recovery From Chain Halt](../../develop/developer_guide/recovery_from_chain_halt.md) for more information on how to handle such issues. @@ -81,4 +88,4 @@ This should be treated as a consensus or non-determinism bug that is unrelated t - The [upgrade list](./upgrade_list.md) should reflect a failed upgrade and provide a range of heights that served by each version. - Systemd service should include`--unsafe-skip-upgrade=$upgradeHeightNumber` argument in its start command [here](https://github.com/pokt-network/poktroll/blob/main/tools/installer/full-node.sh). - [Helm chart](https://github.com/pokt-network/helm-charts/blob/main/charts/poktrolld/templates/StatefulSet.yaml) (consider exposing via a `values.yaml` file) -- [docker-compose](https://github.com/pokt-network/poktroll-docker-compose-example/tree/main/scripts) example \ No newline at end of file +- [docker-compose](https://github.com/pokt-network/poktroll-docker-compose-example/tree/main/scripts) example