Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document a pattern for making Sled-Agent-owned config to be Nexus-owned instead #7279

Merged
merged 5 commits into from
Dec 21, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/control-plane-architecture.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,8 @@ It is also important to note why the previous strategy works well and is largely

Now we get to what has been the hairiest of the problems for data compatibility across versions. As we add more features, and make our system more consistent in its promise that Nexus manages state for the control plane instead of sled-agent, we have realized that Nexus sometimes doesn't have enough information to take over this responsibility. In such cases when performing https://github.com/oxidecomputer/omicron/blob/5b865b74208ce0a11b8aec1bca12e2a6ea538bb6/sled-agent/src/sim/server.rs#L254[RSS handoff to Nexus], we have had to add new state to the handoff message so that Nexus can create a blueprint to drive the rest of the system to its desired state via https://github.com/oxidecomputer/omicron/blob/main/docs/reconfigurator.adoc[Reconfigurator]. However, this only works for new rack deployments when we actually run RSS. For existing deployments that have already gone through initial rack setup, the new Nexus code does not have enough information to proceed with running reconfigurator. In this case we must **backfill** that information. This can, and has, been done a variety of ways. We sometimes may have to add new data to CRDB, and sometimes modify a schema and backfill columns. Othertimes, we may need to retrieve important data from sled-agent and store it in existing placeholders in blueprints. In any event, doing this is tricky and influences how legible the code is to read, how testable it is, and how correct it is under all circumstances. It's for this reason that we proposed the rules for data compatibility in the prior section, which largely align with how we do ledger updates.

For more on this, see the xref:reconfigurator.adoc#_incorporating_existing_configuration_into_reconfigurator[Reconfigurator docs on this subject].

== Cold start

"Cold start" refers to starting the control plane from a rack that's completely powered off. Achieving this requires careful consideration of where configuration is stored and how configuration changes flow through the system.
Expand Down
160 changes: 160 additions & 0 deletions docs/reconfigurator.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,166 @@ We're being cautious about rolling out that kind of automation. Instead, today,

To get to the long term vision where the system is doing all this on its own in response to operator input, we'll need to get confidence that continually executing the planner will have no ill effects on working systems. This might involve more operational experience with it, more safeties, and tools for pausing execution, previewing what it _would_ do, etc.

== Design patterns

=== Incorporating existing configuration into Reconfigurator

Something we've done several times now is taking some existing piece of configuration that was managed outside the control plane (i.e., not known to Nexus or CockroachDB) and brought it under the ownership of the control plane. Examples:

* Control plane zones: when we initially built the system, RSS deployed control plane zones to sleds and Nexus/CockroachDB was largely unaware of them. Nexus/CockroachDB did know about them, but did not have enough information to reconstruct the configuration on each sled. But of course the control plane _needs_ to be able to manage these components for upgrade, fault management, scale-out, etc. Migrating to a system that can do these things required that the control plane learn what zones were deployed already, where, and with what configuration.
* ZFS datasets: in releases prior to R12, Sled Agent automatically created ZFS datasets for zones that were requested by the control plane. Concretely, `PUT /omicron-zones` would create ZFS datasets for zones that needed a persistent dataset. Work is ongoing to make this more explicit so that the control plane manages datasets separately and then specifies with each zone which dataset it should use. Migrating to this approach on deployed systems requires that the control plane learn what datasets exist in the first place and what zones they're associated with.
* In the medium term, for online upgrade, we will be incorporating an image id or artifact id into the Omicron zone configuration. Currently, the artifact id is implied: sled agent uses whatever artifact was delivered by the last MUPdate. For online upgrade, the control plane will need to be able to specify a particular artifact.

In all of these cases:

* There's a piece of configuration managed outside of CockroachDB/Nexus (e.g., by Sled Agent, RSS, and/or MUPdate).
* We want to transition to a world where the configuration is owned by CockroachDB/Nexus.
* We need to bootstrap the initial control-plane-managed copy based on what's currently deployed.

The general pattern here is:

* Use the inventory system to collect the current configuration.
* Incorporate that configuration into the next blueprint generated by Reconfigurator.
* Incorporate the configuration from the blueprint into the next request to Sled Agent.

```mermaid
sequenceDiagram
participant SledAgent as Sled Agent
participant Nexus

Note over SledAgent: owns some piece<br/> of configuration
SledAgent ->> Nexus: reports current configuration<br/>via inventory
Note over Nexus: incorporates latest inventory<br/>into next blueprint
Note over Nexus: now owns the configuration

loop
Note over Nexus: changes configuration as needed
Nexus ->> SledAgent: sends new configuration<br/>
end
```

Below is a proposed pattern for doing this over two releases. We'll call the new piece of config `my_config`, represented with type `MyConfig`. This could be arbitrarily complicated. In our examples above, this could be a list of zones and their detailed configurations (IP addresses, ports, and all the properties they need to start), a list of ZFS dataset structs, a `dataset_id` property on an existing struct, an `artifact_id` property on an existing struct, etc. It may hang directly off the Sled Agent `Inventory` or it might be embedded in some existing struct.

NOTE: This is a work in progress. We hope this closely enough matches what we've done in the past that it should work. We should update this documentation as we discover better ways to do this.

CAUTION: This isn't the _only_ way to do it. However, many other ways to do it come with non-obvious problems. When we diverge from this, we should first try to understand why this procedure looks the way it does.

**In the first release** (we'll call it "release 1"): the configuration is totally managed in Sled Agent and unknown to Nexus and CockroachDB.

**In the next release** (we'll call it "release 2"):

. Add `my_config: MyConfig` to appropriate spot in Sled Agent inventory. Note that it is _not_ an optional field. In this use case, Sled Agent always knows the current value of this field.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in a case like adding the artifact/image ID, we're also adding the requirement that sled agent needs to learn about this value explicitly, right?

As an example -- "release 1" of the sled agent might launch instances using /opt/oxide/propolis-server.tar.gz -- and use whatever file that is, regardless of "version".

Just wanted to call that out, as "part of release 2 may involve making sled agent explicitly aware of what state it happens to be managing", since we've declared it should be non-optional in inventory.

(also, to be clear, it's non-optional in the sled agent API call to Nexus, not the inventory database structure, right? Otherwise, old inventory collections would be invalidated)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in a case like adding the artifact/image ID, we're also adding the requirement that sled agent needs to learn about this value explicitly, right?

As an example -- "release 1" of the sled agent might launch instances using /opt/oxide/propolis-server.tar.gz -- and use whatever file that is, regardless of "version".

Just wanted to call that out, as "part of release 2 may involve making sled agent explicitly aware of what state it happens to be managing", since we've declared it should be non-optional in inventory.

The code needs to be aware that this value, which may previously have been hardcoded or even absent altogether, is now a variable to be reported in inventory (this item) and controlled by Nexus (item 4). Is that what you mean by "learn about this value explicitly"?

(also, to be clear, it's non-optional in the sled agent API call to Nexus, not the inventory database structure, right? Otherwise, old inventory collections would be invalidated)

Good point! I will update this to be more clear that (1) the inventory API structure can make this required but the inventory database structures should have this value optional, and (2) in release 3, this can probably be made required in the inventory database structures, too. (This is annoyingly tricky: we might still have old collections at this point but it seems unlikely.)

Copy link
Collaborator

@smklein smklein Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code needs to be aware that this value, which may previously have been hardcoded or even absent altogether, is now a variable to be reported in inventory (this item) and controlled by Nexus (item 4). Is that what you mean by "learn about this value explicitly"?

Yeah, basically just calling this out -- the phrasing "In this use case, Sled Agent always knows the current value of this field" is true, but if it was an "implicit" value before, Sled Agent might need to do some legwork to get an explicit value that can be reported.

Re: Inventory stuff

Yeah, totally - another spot where it might be useful to somehow provide a barrier. If we could ensure that all inventory collections are "from this release or the one prior, and no older", that could be a useful way to gradually make these values non-optional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is annoyingly tricky: we might still have old collections at this point but it seems unlikely.)

Is this almost identical (or even exactly identical) to "changing an optional field to required in a blueprint requires dropping old blueprints", with the exception that inventory already rotates old collections out with relatively high frequency except in the case of errors (where we keep around the last good collection, which might be months old on a system like dogfood)? Do you think we #7278 should explicitly include inventory collections too? (It already implicitly covers them to some extent, since saving the reconfigurator state already includes them, but doesn't mention deleting old ones if they're still around.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is annoyingly tricky: we might still have old collections at this point but it seems unlikely.)

Is this almost identical (or even exactly identical) to "changing an optional field to required in a blueprint requires dropping old blueprints", with the exception that inventory already rotates old collections out with relatively high frequency except in the case of errors (where we keep around the last good collection, which might be months old on a system like dogfood)? Do you think we #7278 should explicitly include inventory collections too? (It already implicitly covers them to some extent, since saving the reconfigurator state already includes them, but doesn't mention deleting old ones if they're still around.)

Yes, it's analogous. Sure -- I'll update that issue.

. Add `my_config: Option<MyConfig>` to the blueprint structures (both in-memory and in the database). This field has to be optional so that when updating to this release, the system can still read the current target blueprint (that was written in the previous release that didn't have this field).
. In the Reconfigurator planner, when generating a blueprint based on a parent blueprint where `my_config` is `None`, fill in `my_config` (using a `Some` value) based on the contents in inventory.
. Add `my_config` to the Sled Agent request that will be used by Reconfigurator to _configure_ this on each sled.
** If a request already exists (e.g., if this will be part of `OmicronZoneConfig` that already gets sent by Reconfigurator to Sled Agent, as in the case of ZFS dataset id or artifact id): the new field should be optional: `my_config: Option<MyConfig>`. This is required for the system to be able to execute the last target blueprint that was written in the _previous_ release. This is typically also necessary because it's usually the same struct that Sled Agent records persistently. See the next item.
** If no request already exists for this purpose, then you'll be adding a whole new one (e.g., when we added a new `PUT /datasets`). The body of this request will generally be type `MyConfig` (_not_ optional). During execution, Reconfigurator can avoid making this request altogether if the blueprint does not specify it.
. Add `my_config` to the Sled Agent ledger that will store this information persistently. _This will almost always be the same as the previous step_. The structure that Sled Agent stores is generally the same one it accepts from Nexus.
+
This explains another reason why `my_config` should be optional in this structure: Sled Agent _must_ be able to read ledgers written by a previous release and those won't have this field.

**During the upgrade to this release:**

. Wait for at least one inventory cycle to complete successfully and verify that it contains the expected `my_config` field.
. Generate a new blueprint, make it the current target, and ensure that it executes successfully. It should make no actual changes to the system, but it will propagate the current values for `my_config` to the blueprint system and to sled ledgers.
Comment on lines +244 to +245
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree checking for the blueprint + ledgers is probably an ad-hoc operation that we need a human to do, but getting this standard process of:

  • Collect inventory
  • Make a new blueprint, check the diff, make sure it looks good
  • Set it as target, execute it

As a part of each release seems like it would be very useful for us to rely on.

(I think it would be nice to be able to assume "if you have a rack running on release N, part of that upgrade involved creating a blueprint at release N and making sure it could run")

. Verify that:
** the new blueprint has `my_config` filled in
** all Sled Agent ledgers have `my_config` filled in (value `Some`)

**In the next release** (we'll call it "release 3"): all the optional fields can be made non-optional:

* Blueprints' in-memory structure can go from `my_config: Option<MyConfig>` to `my_config: MyConfig`.
* Blueprints' database representation can go from NULL-able columns to non-NULL-able ones, though only if we can populate the value or drop it from old blueprints. More work is needed here.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another spot where "if we know we added a blueprint during upgrade", it also might give us the leniency to "delete all older blueprints as the final step of an upgrade". Or perhaps as one of the first steps of upgrading to N+1

* The Sled Agent API input types and ledgers that refer to `my_config` can go from `my_config: Option<MyConfig>` to `my_config: MyConfig`. No on-disk changes are needed for this.

**During the upgrade to the next release**: Blueprints that do not have `my_config` set will need to be deleted from the database prior to the upgrade. See https://github.com/oxidecomputer/omicron/issues/7278[omicron#7278] for more on operationalizing this.

Visually:

```mermaid
flowchart TD
subgraph R1 [Release 1]
Initial["**Config owned by Sled Agent**"]
end

subgraph R2 [Release 2]
Inventory["Sled Agent: reports current config in inventory"]
Blueprint["Reconfigurator Planner: incorporates latest inventory into blueprint"]
SledAgent["Reconfigurator Executor: sends blueprint config (unchanged) as configuration to Sled Agent"]
Handoff["**Config owned by Nexus**"]
Change21["Nexus wants to change the config"]
Change22["Reconfigurator Planner: uses new value in blueprint"]
Change23["Reconfigurator Executor: sends new value as new configuration to Sled Agent"]

Inventory --> Blueprint
Blueprint --> SledAgent
SledAgent --> Handoff
Handoff --> Change21
Change21 --> Change22
Change22 --> Change23
Change23 --> Change21
end

subgraph R3 [Release 3]
Owned["**Config owned by Nexus**"]
Cleanup["**Blueprint field, Sled Agent field are now required**"]
Change31["Nexus wants to change the config"]
Change32["Reconfigurator Planner: uses new value in blueprint"]
Change33["Reconfigurator Executor: sends new value as new configuration to Sled Agent"]
Owned --> Cleanup
Cleanup --> Change31
Change31 --> Change32
Change32 --> Change33
Change33 --> Change31
end

R1 --> R2
R2 --> R3
```

During release 1 and during release 2 _before_ Sled Agent has reported the configuration in inventory, things look like this:

```mermaid
sequenceDiagram
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW since you mentioned going overboard on Mermaid: I found this diagram and the one below it very helpful. (I didn't find the ones above as helpful, personally, but I don't think they detract anything from the doc, either.)

box Nexus
participant Planner as Reconfigurator Planner
participant Executor as Reconfigurator Executor
end
participant SledAgent as Sled Agent
participant Database


loop while config is not part of inventory
Database ->> Planner: load latest inventory: config NOT present
Planner ->> Executor: generate blueprint:<br />config NOT present
Executor ->> SledAgent: write config:<br />config NOT present
Note over SledAgent: missing config<br/>treated as<br />"no change"
end
```

Shortly after the system comes up in release 2, Sled Agent starts reporting the config in inventory. After that point, things look like this:

```mermaid
sequenceDiagram
box Nexus
participant Planner as Reconfigurator Planner
participant Executor as Reconfigurator Executor
end
participant SledAgent as Sled Agent
participant Database

loop
SledAgent ->> Database: report config<br />in inventory
end

loop
Database ->> Planner: load latest inventory: config IS present
Planner ->> Executor: generate blueprint:<br />config IS present
Executor ->> SledAgent: write config:<br />config IS present
Note over SledAgent: config is present<br/>and honored
end
```

[bibliography]
== References

Expand Down
Loading