ota-lite outage

Signed-off-by: Andy Doan <[email protected]>
foundriesio · Oct 30, 2024 · 56b2678 · 56b2678
1 parent 787530b
commit 56b2678
Show file tree

Hide file tree

Showing 2 changed files with 50 additions and 0 deletions.
diff --git a/index.md b/index.md
@@ -28,6 +28,7 @@ on our [Slack channel](https://join.slack.com/t/foundriesio/shared_invite/enQtNT
  * [2020-02-12 CI worker network maintenance](maintenance/2020-02-13-online-net)
 
 ### Previous Outages:
+ * [2024-10-09 api.foundries.io outage](outage/2024-10-09-ota-lite.md)
  * [2024-06-06 Foundries.io resources outage](outage/2024-06-06-api-disruption.md)
  * [2023-09-28 Foundries.io resources outage](outage/2023-09-28-resources-outage.md)
  * [2023-07-26 Foundries.io resources outage](outage/2023-07-26-dashboard.md)

diff --git a/outage/2024-10-09-ota-lite.md b/outage/2024-10-09-ota-lite.md
@@ -0,0 +1,49 @@
+## 2024-10-09 api.foundries.io outage
+
+We have a service referred to as "ota-lite" that is the user facing portion
+of our device management platform.
+
+The service was unavailable from 20:13UTC to 20:35UTC. During this time, actions
+like publishing TUF Targets and managing devices (from user facing APIs)
+were not possible. The device-gateway remained up during this outage.
+
+### What Happened
+
+On the previous day, 2024-10-08, we performed a task where we periodically
+rotate secure keys and tokens used by our backend. After rotating keys, we
+restarted each container affected by the key rotation to make sure they
+continued to function properly. We also deleted some old service accounts
+and tokens no longer used by our system.
+
+The next day while trying to deploy a minor API change, the service was failing to start.
+An attempt to rollback the change couldn't run either.
+
+The failure logs pointed at the fact we were using a key that didn't exist.
+
+The missing key was for a minor part of the system where we keep backups of
+each version of Targets.json published by our customers. It's not strictly required,
+so the quick temporary fix was to comment out this code to give us breathing room.
+
+The root cause wound up being a combination of a bad/naive Grep expression combined with a bad variable naming.
+As a result, we de facto rotated not only a secret content, but also a secret name —while the API K8s specs were still pointing to the old secret name.
+As the API was not restarted during this change, it continued to work as if nothing changed.
+But at the time of the roll out of a new API version, it started failing due to the missing required secret.
+
+### Timeline of Events
+
+* **20:12 UTC** - We rolled out a minor change to OTA Lite.
+* **20:13 UTC** - We notice the deployment has failed.
+* **20:18 UTC** - We had a first workaround. However, GCP Cloud Build had a huge cache-miss and building took minutes rather than seconds.
+
+The service was unavailable from 20:13UTC to 20:35UTC. During this time, actions
+like publishing TUF Targets and managing devices (from customer facing APIs)
+were not possible. The device-gateway remained up during this outage.
+* **20:30 UTC** - We get a 2nd workaround to get the system up.
+* **20:35 UTC** - The workaround was built and deployed. Things were functional again.
+* **21:00 UTC** - The real fix was deployed.
+
+## Lessons Learned
+
+* We need to fix the GCP Cloud Build caching problem by how we build our container. In a pinch - it was our bottleneck.
+* We need better container health checks. The service was passing a check that just needed it to be up for about a second. However, it was crashing shortly after that.
+* Kubernetes secrets, GCP service-accounts, and keys need 2-way links so that things can be looked up from either direction.