From 56b26786aec7eece0b162d8759dbae885e23ed48 Mon Sep 17 00:00:00 2001 From: Andy Doan Date: Thu, 10 Oct 2024 15:33:16 -0500 Subject: [PATCH] ota-lite outage Signed-off-by: Andy Doan --- index.md | 1 + outage/2024-10-09-ota-lite.md | 49 +++++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) create mode 100644 outage/2024-10-09-ota-lite.md diff --git a/index.md b/index.md index 6ca54f7..715aa1e 100644 --- a/index.md +++ b/index.md @@ -28,6 +28,7 @@ on our [Slack channel](https://join.slack.com/t/foundriesio/shared_invite/enQtNT * [2020-02-12 CI worker network maintenance](maintenance/2020-02-13-online-net) ### Previous Outages: + * [2024-10-09 api.foundries.io outage](outage/2024-10-09-ota-lite.md) * [2024-06-06 Foundries.io resources outage](outage/2024-06-06-api-disruption.md) * [2023-09-28 Foundries.io resources outage](outage/2023-09-28-resources-outage.md) * [2023-07-26 Foundries.io resources outage](outage/2023-07-26-dashboard.md) diff --git a/outage/2024-10-09-ota-lite.md b/outage/2024-10-09-ota-lite.md new file mode 100644 index 0000000..a863b4a --- /dev/null +++ b/outage/2024-10-09-ota-lite.md @@ -0,0 +1,49 @@ +## 2024-10-09 api.foundries.io outage + +We have a service referred to as "ota-lite" that is the user facing portion +of our device management platform. + +The service was unavailable from 20:13UTC to 20:35UTC. During this time, actions +like publishing TUF Targets and managing devices (from user facing APIs) +were not possible. The device-gateway remained up during this outage. + +### What Happened + +On the previous day, 2024-10-08, we performed a task where we periodically +rotate secure keys and tokens used by our backend. After rotating keys, we +restarted each container affected by the key rotation to make sure they +continued to function properly. We also deleted some old service accounts +and tokens no longer used by our system. + +The next day while trying to deploy a minor API change, the service was failing to start. +An attempt to rollback the change couldn't run either. + +The failure logs pointed at the fact we were using a key that didn't exist. + +The missing key was for a minor part of the system where we keep backups of +each version of Targets.json published by our customers. It's not strictly required, +so the quick temporary fix was to comment out this code to give us breathing room. + +The root cause wound up being a combination of a bad/naive Grep expression combined with a bad variable naming. +As a result, we de facto rotated not only a secret content, but also a secret name —while the API K8s specs were still pointing to the old secret name. +As the API was not restarted during this change, it continued to work as if nothing changed. +But at the time of the roll out of a new API version, it started failing due to the missing required secret. + +### Timeline of Events + +* **20:12 UTC** - We rolled out a minor change to OTA Lite. +* **20:13 UTC** - We notice the deployment has failed. +* **20:18 UTC** - We had a first workaround. However, GCP Cloud Build had a huge cache-miss and building took minutes rather than seconds. + +The service was unavailable from 20:13UTC to 20:35UTC. During this time, actions +like publishing TUF Targets and managing devices (from customer facing APIs) +were not possible. The device-gateway remained up during this outage. +* **20:30 UTC** - We get a 2nd workaround to get the system up. +* **20:35 UTC** - The workaround was built and deployed. Things were functional again. +* **21:00 UTC** - The real fix was deployed. + +## Lessons Learned + +* We need to fix the GCP Cloud Build caching problem by how we build our container. In a pinch - it was our bottleneck. +* We need better container health checks. The service was passing a check that just needed it to be up for about a second. However, it was crashing shortly after that. +* Kubernetes secrets, GCP service-accounts, and keys need 2-way links so that things can be looked up from either direction.