Distributed systems are complex beasts and notoriously hard to debug. Sometimes it’s hard to understand how an outage on one service will affect another, and no matter how much we think we understand a given system, it will still surprise us in new and interesting ways. What follows is the story of one of those moments, when an outage of a provider that we don’t use, directly or indirectly, resulted in our service becoming unavailable.
It’s a quiet Friday night, when at 23:28 UTC we start receiving the first reports of trouble. Our monitoring tells us that our website is intermittently becoming unavailable (periodic checks to both our website and our graph rendering API are timing out). As we always do during user impacting incidents, we proceed to update our status page, to ensure our users have up-to-date information on the status of our service.
While we begin our investigation, we start gathering all we know about this incident so far. However, what we find only manages to confuse us even further:
- We observe an overall drop in traffic across all our ingestion protocols. This, coupled with the initial reports of unavailability on both our website and render API, makes us immediately suspect network connectivity issues.
- By checking our internal and external canaries we notice that ingestion is affected in certain AWS regions (us-east1 and us-west2 in particular), but ingestion through our HTTP API is affected regardless of the location.
- To further compound the issue, some of our internal services (non user-facing) start to report timeouts.
At this point we are struggling to find a common thread to tie everything together. The data indicates that there are connectivity issues with certain AWS regions, which would seem to explain the decrease in ingested traffic (given that many of our users are hosted in AWS themselves). However, there’s still a problem – it doesn’t quite explain how all locations are having trouble sending data through our HTTP API, or how our website is still intermittently becoming unresponsive. So we’re not left with many options – given that AWS connectivity issues are the best lead we have, we decide to follow it through.
We Need to Go Deeper
After some further digging we realise that the internal services that are experiencing issues rely on S3, and that the integrations we have that have dependencies on AWS (such as Heroku) are severely affected by this issue. This is later confirmed by the AWS status page itself, reporting connectivity issues in both us-east1 and us-west2 regions. Unfortunately, this still doesn’t help us understand how an AWS outage can affect how we serve our own site, which is neither hosted in AWS nor physically located anywhere near the affected AWS regions.
Cause or symptom?
At this point our attention turns to the only part of our HTTP frontend that’s hosted in AWS, our Route53 health checks for our own (self-hosted) load balancers. To quickly detect issues with our load balancer, we rely on AWS health checks. This lets us identify an unhealthy load balancer and remove it from the DNS results of hostedgraphite.com, giving us a simple yet effective way to ensure we only advertise healthy load balancers to serve production traffic. The health check logs report failures from a variety of locations, and as it stands we’re not sure if these failures are the cause or a symptom of our problem. We decide to temporarily disable the health checks to either confirm or rule out that theory.
Chatbot down
Our SRE team relies heavily on Glitter - our trusty SRE chatbot - to assist us with many of our operational tasks, so under normal circumstances, it’d be a simple operation to disable these health checks in our Slack channel. However, things aren’t quite as straightforward as we’d expect, and the AWS outage has impacted Slack’s IRC gateway, which our chatbot relies on. This means we have to make all the necessary changes manually...
Unfortunately, disabling the health checks doesn’t quite fix things, so we continue digging for clues. One of the things we notice is that our load balancer logs report failure to complete SSL handshakes for IP addresses outside of the normal AWS ranges. We test from several locations and confirm that the SSL handshake takes unacceptably long regardless of the location we try.
It is at this point when, in a rather anticlimactic manner, the AWS incident gets resolved and our traffic rates start to slowly recover, further confirming that the AWS outage had been the trigger of this incident. It’s now 01:46 UTC.
So things go back to normal. However, now we know what happened, but not why, so we dig deeper into our monitoring tools and come up with two visualizations built from our load balancer logs. These help us to paint a clear picture of exactly what happened.
The first shows the number of active connections to our load balancers, which happens to increase sharply and flatline exactly during the time period when we were experiencing issues.
This, in itself, explains our SSL handshake woes, as our load balancers wouldn’t be accepting any new connections if the limit had been reached. As a result, all new connections will be queued, and might end up timing out before they can be processed.
That goes a long way towards solving our mystery. However, we’re still not quite sure where these connections were originating from, as we know that there was no increase in the number of requests received (we had actually seen a decrease, same as with the rest of our ingestion pipeline). ThankfullyA, our load balancer logs are complete enough that we can build the following visualization, which helps us finally crack the case:
The visualization above is a heatmap, and (aside from looking like an enemy straight from Atari’s Adventure) is trying to pack a lot of information in a very limited space, so let’s try to decipher it. The heatmap represents the average time it took for hosts from different ISPs to send a full HTTP request to our load balancers. Each row represents a different ISP, showing how long it took for hosts on that ISP to send a full request over time (times are in milliseconds, with a warmer colours meaning longer times). In this particular case, the top row represents requests coming from AWS hosts (mainly EC2 instances). This allows us to see how, during the incident, they were taking up to 5 seconds to make a full request, while other ISPs remained largely unaffected.
A clear picture
After putting all these pieces together, a clear picture emerged. AWS hosts from the affected regions were experiencing connectivity issues that significantly slowed down their connections to our load balancers (but not bad enough to break the connection). As a result, these hosts were hogging all the available connections until we hit a connection limit in our load balancers, causing hosts in other locations to be unable to create a new connection. Confirming this was the first step towards making changes to prevent it from happening again, which in our case included enforcing stricter client timeouts and raising our connection limits. The new visualizations have also now become a common tool that both our SRE and Support teams use often when troubleshooting our load balancing layer, and everything we learned from this incident was shared with our users as a postmortem.
And that’s the story of how an AWS outage managed to break a service that’s not hosted in AWS (or anywhere near it!).