Skip to content

Commit

Permalink
ota-lite outage
Browse files Browse the repository at this point in the history
Signed-off-by: Andy Doan <[email protected]>
  • Loading branch information
doanac committed Oct 30, 2024
1 parent 787530b commit 56b2678
Show file tree
Hide file tree
Showing 2 changed files with 50 additions and 0 deletions.
1 change: 1 addition & 0 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ on our [Slack channel](https://join.slack.com/t/foundriesio/shared_invite/enQtNT
* [2020-02-12 CI worker network maintenance](maintenance/2020-02-13-online-net)

### Previous Outages:
* [2024-10-09 api.foundries.io outage](outage/2024-10-09-ota-lite.md)
* [2024-06-06 Foundries.io resources outage](outage/2024-06-06-api-disruption.md)
* [2023-09-28 Foundries.io resources outage](outage/2023-09-28-resources-outage.md)
* [2023-07-26 Foundries.io resources outage](outage/2023-07-26-dashboard.md)
Expand Down
49 changes: 49 additions & 0 deletions outage/2024-10-09-ota-lite.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
## 2024-10-09 api.foundries.io outage

We have a service referred to as "ota-lite" that is the user facing portion
of our device management platform.

The service was unavailable from 20:13UTC to 20:35UTC. During this time, actions
like publishing TUF Targets and managing devices (from user facing APIs)
were not possible. The device-gateway remained up during this outage.

### What Happened

On the previous day, 2024-10-08, we performed a task where we periodically
rotate secure keys and tokens used by our backend. After rotating keys, we
restarted each container affected by the key rotation to make sure they
continued to function properly. We also deleted some old service accounts
and tokens no longer used by our system.

The next day while trying to deploy a minor API change, the service was failing to start.
An attempt to rollback the change couldn't run either.

The failure logs pointed at the fact we were using a key that didn't exist.

The missing key was for a minor part of the system where we keep backups of
each version of Targets.json published by our customers. It's not strictly required,
so the quick temporary fix was to comment out this code to give us breathing room.

The root cause wound up being a combination of a bad/naive Grep expression combined with a bad variable naming.
As a result, we de facto rotated not only a secret content, but also a secret name —while the API K8s specs were still pointing to the old secret name.
As the API was not restarted during this change, it continued to work as if nothing changed.
But at the time of the roll out of a new API version, it started failing due to the missing required secret.

### Timeline of Events

* **20:12 UTC** - We rolled out a minor change to OTA Lite.
* **20:13 UTC** - We notice the deployment has failed.
* **20:18 UTC** - We had a first workaround. However, GCP Cloud Build had a huge cache-miss and building took minutes rather than seconds.

The service was unavailable from 20:13UTC to 20:35UTC. During this time, actions
like publishing TUF Targets and managing devices (from customer facing APIs)
were not possible. The device-gateway remained up during this outage.
* **20:30 UTC** - We get a 2nd workaround to get the system up.
* **20:35 UTC** - The workaround was built and deployed. Things were functional again.
* **21:00 UTC** - The real fix was deployed.

## Lessons Learned

* We need to fix the GCP Cloud Build caching problem by how we build our container. In a pinch - it was our bottleneck.
* We need better container health checks. The service was passing a check that just needed it to be up for about a second. However, it was crashing shortly after that.
* Kubernetes secrets, GCP service-accounts, and keys need 2-way links so that things can be looked up from either direction.

0 comments on commit 56b2678

Please sign in to comment.