Finalize details of aws issue

Signed-off-by: Andy Doan <[email protected]>
foundriesio · Apr 8, 2022 · 7e0e03d · 7e0e03d
1 parent b4175f7
commit 7e0e03d
Show file tree

Hide file tree

Showing 2 changed files with 66 additions and 5 deletions.
diff --git a/index.md b/index.md
@@ -5,9 +5,10 @@ they are worked on. You can also get real-time updates of service outages
 on our [Slack channel](https://join.slack.com/t/foundriesio/shared_invite/enQtNTc5NDkxNTI5NTExLWQ1Yjc3NDA2MjI3NzA3YzkxYjEzNzlhZjQ0M2QxYTIzYmIzZjlmOThmZGU0NTk5MWEwZGIwMTU2YWE4N2I5NWQ).
 
 ### Planned/Ongoing Events:
- * ONGOING [2022-04-06 CI failures in AWS](outage/2022-04-06-aws.md)
+no events
 
 ### Past Events:
+ * [2022-04-06 CI failures in AWS](outage/2022-04-06-aws.md)
  * [2021-10-06 Compute cluster upgrade](maintenance/2021-10-06-infra-compute-upgrade)
  * [2021-07-07 hub.foundries.io upgrade](maintenance/2021-07-07-hub-upgrade.md)
  * [2021-05-19 Compute cluster upgrade](maintenance/2021-05-19-infra-compute-upgrade)

diff --git a/outage/2022-04-06-aws.md b/outage/2022-04-06-aws.md
@@ -1,16 +1,76 @@
 ## CI Failures in AWS
 
 **NOTE** - This is an ongoing issue. We still have a limited number of
-customers who's LmP builds are done in AWS us-east-1 that may see intermittent issues.
+customers whose LmP builds are done in AWS us-east-1 that may see intermittent issues.
 
-* **23:00 UTC(*April 5*)** - See a CI run fail with network connectivity issues
+* **23:00 UTC(*April 5*)** - A CI run fail with network connectivity issues is seen
 * **11:00 UTC** - Increased number of CI failures
 * **12:00 UTC** - Problems isolated to CI runs inside AWS's us-east-2 region
-* **16:00 UTC** - Experiment with moving from AWS NAT Gateway
+* **16:00 UTC** - Experiment with moving away from AWS NAT Gateway
 * **17:00 UTC** - Problem persists - has nothing to do with NAT Gateway
 * **17:30 UTC** - Move to stand up container build infrastructure in us-west-1
 * **20:30 UTC** - Tests in us-west-1 look promising.
 * **21:00 UTC** - All containers builds are moved from us-east-1 and into us-west-1 and online.net
-* **04:00 UTC(*April 7*)** - LmP builds are much more stable. us-east-1 is showing connection timeouts, but these patches have dramatically reduced the frequency:
+* **04:00 UTC(*April 7*)** - LmP builds are much more stable. us-east-1 is showing connection timeouts, but the following patches have dramatically reduced the frequency:
   * https://github.com/foundriesio/jobserv/commit/b0327a09c4c0271412c93ac5c0af258fe1755411
   * https://github.com/foundriesio/jobserv/commit/b4272c43effb0dc09e003efa3ff54e583cfa76e9
+
+Throughout the course of April 7th, our us-east-2 connection issues began to
+improve. By the morning of April 8th, we had observed no CI failures caused by
+AWS networking issues.
+
+
+### Background
+
+Our main service is hosted in Google's Kubernetes Engine. However, we have a
+multi-cloud solution. The primary reason is not that we love pain and
+complexity, but because:
+
+ * We build containers natively on [ARM](https://foundries.io/insights/blog/2020/02/25/20200226-armh-containers/).
+ * Yocto Project builds have historically had a better price/performance value
+   from places that host bare-metal servers, such as online.net.
+
+Over the past year, we have continued to increase our use of AWS for CI builds. The
+price/performance value has improved and also allows us to be much more
+dynamic. We are to the point where almost all new customers have their
+CI jobs running in AWS's [us-east-2](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html) region.
+
+Code comes into our backend (GCP), CI job (AWS) runs and interacts with
+the backend.
+
+### What Happened
+
+On April 5th, things went bad. After some [strace](https://strace.io/)
+debugging, we observed a high rate of [connect](https://man7.org/linux/man-pages/man2/connect.2.html)
+system calls were unresponsive when hitting GCP services. It took us a while to
+prove this was not our own misconfiguration or mistake, but eventually we could
+see that AWS us-east-2 could not make reliable TCP connections to our multiple
+services inside GCP.
+
+At this point, it was a scramble. Our container builds are the easiest ones
+to manage, so we were able to move those into a new AWS region, us-west-1.
+This fixed container builds, but our Yocto Project builds, while less frequent, were
+still having issues.
+
+Yocto builds are more complex for us to host. It builds an entire operating
+system from scratch, without managing a [cache](https://docs.yoctoproject.org/singleindex.html#shared-state-cache)
+correctly, builds become painfully slow/expensive. That cache is handled via
+a set of sharded NFS servers. For us to move a customer to a new data center
+means we also need to help setup NFS and caching.
+
+We were able to move some workloads over to a different data center and
+in the meantime, AWS gradually resolved their networking issue(s).
+
+By yesterday afternoon, I was able to breath again. There was still a focus
+on monitoring every single CI job for anomalies, but things were looking
+stable. As of now, we've gone about 24 hours without seeing a CI job inside
+us-east-2 fail due to infrastructure issues.
+
+### What we learned
+
+* Our ability to manage container builds has really matured.
+* Our ability to manage Yocto builds needs to improve.
+* Multi-cloud can be as much of a liability as it can be an asset.
+* Even when you are "in the cloud", you need people that understand tools
+  like strace and understand system's programming concepts. Luckily, that's
+  in our DNA.