Skip to content

Commit

Permalink
Finalize details of aws issue
Browse files Browse the repository at this point in the history
Signed-off-by: Andy Doan <[email protected]>
  • Loading branch information
doanac committed Apr 8, 2022
1 parent b4175f7 commit 7e0e03d
Show file tree
Hide file tree
Showing 2 changed files with 66 additions and 5 deletions.
3 changes: 2 additions & 1 deletion index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,10 @@ they are worked on. You can also get real-time updates of service outages
on our [Slack channel](https://join.slack.com/t/foundriesio/shared_invite/enQtNTc5NDkxNTI5NTExLWQ1Yjc3NDA2MjI3NzA3YzkxYjEzNzlhZjQ0M2QxYTIzYmIzZjlmOThmZGU0NTk5MWEwZGIwMTU2YWE4N2I5NWQ).

### Planned/Ongoing Events:
* ONGOING [2022-04-06 CI failures in AWS](outage/2022-04-06-aws.md)
no events

### Past Events:
* [2022-04-06 CI failures in AWS](outage/2022-04-06-aws.md)
* [2021-10-06 Compute cluster upgrade](maintenance/2021-10-06-infra-compute-upgrade)
* [2021-07-07 hub.foundries.io upgrade](maintenance/2021-07-07-hub-upgrade.md)
* [2021-05-19 Compute cluster upgrade](maintenance/2021-05-19-infra-compute-upgrade)
Expand Down
68 changes: 64 additions & 4 deletions outage/2022-04-06-aws.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,76 @@
## CI Failures in AWS

**NOTE** - This is an ongoing issue. We still have a limited number of
customers who's LmP builds are done in AWS us-east-1 that may see intermittent issues.
customers whose LmP builds are done in AWS us-east-1 that may see intermittent issues.

* **23:00 UTC(*April 5*)** - See a CI run fail with network connectivity issues
* **23:00 UTC(*April 5*)** - A CI run fail with network connectivity issues is seen
* **11:00 UTC** - Increased number of CI failures
* **12:00 UTC** - Problems isolated to CI runs inside AWS's us-east-2 region
* **16:00 UTC** - Experiment with moving from AWS NAT Gateway
* **16:00 UTC** - Experiment with moving away from AWS NAT Gateway
* **17:00 UTC** - Problem persists - has nothing to do with NAT Gateway
* **17:30 UTC** - Move to stand up container build infrastructure in us-west-1
* **20:30 UTC** - Tests in us-west-1 look promising.
* **21:00 UTC** - All containers builds are moved from us-east-1 and into us-west-1 and online.net
* **04:00 UTC(*April 7*)** - LmP builds are much more stable. us-east-1 is showing connection timeouts, but these patches have dramatically reduced the frequency:
* **04:00 UTC(*April 7*)** - LmP builds are much more stable. us-east-1 is showing connection timeouts, but the following patches have dramatically reduced the frequency:
* https://github.com/foundriesio/jobserv/commit/b0327a09c4c0271412c93ac5c0af258fe1755411
* https://github.com/foundriesio/jobserv/commit/b4272c43effb0dc09e003efa3ff54e583cfa76e9

Throughout the course of April 7th, our us-east-2 connection issues began to
improve. By the morning of April 8th, we had observed no CI failures caused by
AWS networking issues.


### Background

Our main service is hosted in Google's Kubernetes Engine. However, we have a
multi-cloud solution. The primary reason is not that we love pain and
complexity, but because:

* We build containers natively on [ARM](https://foundries.io/insights/blog/2020/02/25/20200226-armh-containers/).
* Yocto Project builds have historically had a better price/performance value
from places that host bare-metal servers, such as online.net.

Over the past year, we have continued to increase our use of AWS for CI builds. The
price/performance value has improved and also allows us to be much more
dynamic. We are to the point where almost all new customers have their
CI jobs running in AWS's [us-east-2](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html) region.

Code comes into our backend (GCP), CI job (AWS) runs and interacts with
the backend.

### What Happened

On April 5th, things went bad. After some [strace](https://strace.io/)
debugging, we observed a high rate of [connect](https://man7.org/linux/man-pages/man2/connect.2.html)
system calls were unresponsive when hitting GCP services. It took us a while to
prove this was not our own misconfiguration or mistake, but eventually we could
see that AWS us-east-2 could not make reliable TCP connections to our multiple
services inside GCP.

At this point, it was a scramble. Our container builds are the easiest ones
to manage, so we were able to move those into a new AWS region, us-west-1.
This fixed container builds, but our Yocto Project builds, while less frequent, were
still having issues.

Yocto builds are more complex for us to host. It builds an entire operating
system from scratch, without managing a [cache](https://docs.yoctoproject.org/singleindex.html#shared-state-cache)
correctly, builds become painfully slow/expensive. That cache is handled via
a set of sharded NFS servers. For us to move a customer to a new data center
means we also need to help setup NFS and caching.

We were able to move some workloads over to a different data center and
in the meantime, AWS gradually resolved their networking issue(s).

By yesterday afternoon, I was able to breath again. There was still a focus
on monitoring every single CI job for anomalies, but things were looking
stable. As of now, we've gone about 24 hours without seeing a CI job inside
us-east-2 fail due to infrastructure issues.

### What we learned

* Our ability to manage container builds has really matured.
* Our ability to manage Yocto builds needs to improve.
* Multi-cloud can be as much of a liability as it can be an asset.
* Even when you are "in the cloud", you need people that understand tools
like strace and understand system's programming concepts. Luckily, that's
in our DNA.

0 comments on commit 7e0e03d

Please sign in to comment.