-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Andy Doan <[email protected]>
- Loading branch information
Showing
2 changed files
with
66 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,76 @@ | ||
## CI Failures in AWS | ||
|
||
**NOTE** - This is an ongoing issue. We still have a limited number of | ||
customers who's LmP builds are done in AWS us-east-1 that may see intermittent issues. | ||
customers whose LmP builds are done in AWS us-east-1 that may see intermittent issues. | ||
|
||
* **23:00 UTC(*April 5*)** - See a CI run fail with network connectivity issues | ||
* **23:00 UTC(*April 5*)** - A CI run fail with network connectivity issues is seen | ||
* **11:00 UTC** - Increased number of CI failures | ||
* **12:00 UTC** - Problems isolated to CI runs inside AWS's us-east-2 region | ||
* **16:00 UTC** - Experiment with moving from AWS NAT Gateway | ||
* **16:00 UTC** - Experiment with moving away from AWS NAT Gateway | ||
* **17:00 UTC** - Problem persists - has nothing to do with NAT Gateway | ||
* **17:30 UTC** - Move to stand up container build infrastructure in us-west-1 | ||
* **20:30 UTC** - Tests in us-west-1 look promising. | ||
* **21:00 UTC** - All containers builds are moved from us-east-1 and into us-west-1 and online.net | ||
* **04:00 UTC(*April 7*)** - LmP builds are much more stable. us-east-1 is showing connection timeouts, but these patches have dramatically reduced the frequency: | ||
* **04:00 UTC(*April 7*)** - LmP builds are much more stable. us-east-1 is showing connection timeouts, but the following patches have dramatically reduced the frequency: | ||
* https://github.com/foundriesio/jobserv/commit/b0327a09c4c0271412c93ac5c0af258fe1755411 | ||
* https://github.com/foundriesio/jobserv/commit/b4272c43effb0dc09e003efa3ff54e583cfa76e9 | ||
|
||
Throughout the course of April 7th, our us-east-2 connection issues began to | ||
improve. By the morning of April 8th, we had observed no CI failures caused by | ||
AWS networking issues. | ||
|
||
|
||
### Background | ||
|
||
Our main service is hosted in Google's Kubernetes Engine. However, we have a | ||
multi-cloud solution. The primary reason is not that we love pain and | ||
complexity, but because: | ||
|
||
* We build containers natively on [ARM](https://foundries.io/insights/blog/2020/02/25/20200226-armh-containers/). | ||
* Yocto Project builds have historically had a better price/performance value | ||
from places that host bare-metal servers, such as online.net. | ||
|
||
Over the past year, we have continued to increase our use of AWS for CI builds. The | ||
price/performance value has improved and also allows us to be much more | ||
dynamic. We are to the point where almost all new customers have their | ||
CI jobs running in AWS's [us-east-2](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html) region. | ||
|
||
Code comes into our backend (GCP), CI job (AWS) runs and interacts with | ||
the backend. | ||
|
||
### What Happened | ||
|
||
On April 5th, things went bad. After some [strace](https://strace.io/) | ||
debugging, we observed a high rate of [connect](https://man7.org/linux/man-pages/man2/connect.2.html) | ||
system calls were unresponsive when hitting GCP services. It took us a while to | ||
prove this was not our own misconfiguration or mistake, but eventually we could | ||
see that AWS us-east-2 could not make reliable TCP connections to our multiple | ||
services inside GCP. | ||
|
||
At this point, it was a scramble. Our container builds are the easiest ones | ||
to manage, so we were able to move those into a new AWS region, us-west-1. | ||
This fixed container builds, but our Yocto Project builds, while less frequent, were | ||
still having issues. | ||
|
||
Yocto builds are more complex for us to host. It builds an entire operating | ||
system from scratch, without managing a [cache](https://docs.yoctoproject.org/singleindex.html#shared-state-cache) | ||
correctly, builds become painfully slow/expensive. That cache is handled via | ||
a set of sharded NFS servers. For us to move a customer to a new data center | ||
means we also need to help setup NFS and caching. | ||
|
||
We were able to move some workloads over to a different data center and | ||
in the meantime, AWS gradually resolved their networking issue(s). | ||
|
||
By yesterday afternoon, I was able to breath again. There was still a focus | ||
on monitoring every single CI job for anomalies, but things were looking | ||
stable. As of now, we've gone about 24 hours without seeing a CI job inside | ||
us-east-2 fail due to infrastructure issues. | ||
|
||
### What we learned | ||
|
||
* Our ability to manage container builds has really matured. | ||
* Our ability to manage Yocto builds needs to improve. | ||
* Multi-cloud can be as much of a liability as it can be an asset. | ||
* Even when you are "in the cloud", you need people that understand tools | ||
like strace and understand system's programming concepts. Luckily, that's | ||
in our DNA. |