We gave a presentation a couple of weeks ago Python Ireland's April meetup where we described our experiences with PySyncObj, a relatively new but solid library for building fault tolerant distributed systems in Python. Most of the services that run Hosted Graphite are built in Python, and this includes our alerting system. While that talk wasn't recorded, this blog post discusses what we did, the tech we chose, and why.
A monitoring system without alerting
We've been running Hosted Graphite as a big distributed custom time-series database since 2012, and once we mastered the monitoring side of things, the next obvious step was an integrated alerting system.
We released a beta version of the alerting feature in 2016, and we've kept it technically in "beta" for the same reasons Google kept GMail in beta for more than five years: people are using it for real work, we're supporting it, but we know we still have work to do and so there's still a "beta" label on it for now.
One of the biggest things that was holding us back from calling the alerting feature fully baked was the failover characteristics. Almost everything else at Hosted Graphite is a distributed, fault-tolerant distributed system, and we knew we needed the same for an alerting system. After all, we're on-call 24/7 for it!
Keeping state
Some parts of the alerting system are easy to do failover for because they keep no state, like the streaming and polling monitors in this slide:
The trouble starts when we consider the state kept by the Alerter component. The Alerter receives a series of "OK", "OK", "not OK!" messages from the streaming and polling monitors, and needs to keep a lot of those in memory in order to make decisions about which notifications to send, if we need to wait a few more minutes to satisfy a "must be alerting for more than five minutes" constraint, when we've already sent enough notifications, and so on. Discarding the alerter state each time there is a failure would cause a pretty poor user experience with duplicate, delayed and missing notifications and we can't tolerate that.
Fine - just duplicate all the traffic!
We considered just duplicating all the traffic to keep multiple alerters in sync with checkpointing to allow new nodes to catch up. Kafka looked like a pretty good option at first, but it didn't fare well in a review with the SRE team.
Here's why:
- The data volume flowing to the alerter nodes isn't huge - only a few thousand messages per second.
- We run everything on bare metal hardware for high performance and avoiding the noisy neighbour problem.
- While most of our users are served by our shared production infrastructure, we have many users whose needs demand a dedicated cluster. This isn't a problem in itself, but it does mean that we'll need to duplicate the services for each cluster, no matter how much load that customer puts on us.
- Kafka requires Zookeeper for cluster management too, so that's yet another thing needing several machines. We already run etcd anyway, and we don't want to run two services that do similar things.
Basically, Kafka is designed for bigger data volumes than providing high availability for this one service, and it didn't make a lot of sense to run it here.
Re-evaluating: what do we actually want?
We zoomed out a bit and reconsidered. What failover properties do we want the alerting system to have? Here's what we settled on:
- Nodes should exchange state among themselves.
- Nodes should detect failure of a peer.
- Nodes should figure out a new primary after the old one fails.
- Nodes should assemble themselves into a new cluster automatically when we provision them.
- ... all without waking anyone up.
PySyncObj - pure Python Raft consensus
PySyncObj is a pure Python library that solves all the requirements except the basic service discovery one, which we figured we could handle with etcd. PySyncObj allows us to define an object that is replicated among all nodes in the cluster, and it takes care of all the details It uses the Raft protocol to replicate the data, elect a leader, detect when nodes have failed, deal with network partitions, etc.
Following Python's "batteries included" philosophy, PySyncObj includes a set of basic replicated data types: a counter, dictionary, list, and a set. You can also define custom replicated objects with a couple of extra lines and a decorator, which is pretty awesome.
Here's a simple example of how easy it is to replicate a dictionary across three servers with PySycObj:
https://gist.githubusercontent.com/HG-blog/92cd27b159b63fb750a76b178d82898f/raw/d44b561c75aa643ac588d74c94ffc0ccda630e05/gistfile1.txt
All the distributed Raft magic is done for you. Pretty fantastic. For more examples, check out other examples on the PySyncObj github page.
Monitoring the new Raft cluster
PySyncObj exports a bunch of data about the state of the Raft cluster with the getStatus() function:
https://gist.githubusercontent.com/HG-blog/f7899cf640b4a64d48f7da4004901bf2/raw/31ca4adaf8349ebb3d5e9361383182b2f17987cb/gistfile1.txt
To monitor this data while we built confidence in the new service before rolling it out, we just fired it into Hosted Graphite using the dead simple graphiteudp library:
https://gist.githubusercontent.com/HG-blog/015dcc21ffc2bd6c579f9166445dd5f6/raw/2ede63d083f9825203c911daba4d89257bfd427f/gistfile1.txt
The result is a pretty dashboard detailing the Raft cluster state for production and staging environments:
Chaos monkey testing
At this stage, we were pretty confident that this was the right route to take, but we wanted more. The only way to make your failover handling works is to continuously make it fail and see how it reacts, and that's exactly what we did. We stood up a full-size test cluster and ran a copy of all of the alerting traffic from the production environment through it. The output was discarded so customers didn't get duplicate alerts, but we were collecting all the same metrics about the output that we usually do.
In the style of the Netflix chaos monkey, we wrote a small tool to wait a random amount of time, and then restart one of the nodes without warning. We ran this for several days and it worked flawlessly - every time we lost a node, another was elected the new leader in a few seconds, the restarted node recovered the state it lost, and the overall output of the cluster (alerts checked, notifications that would have been sent, etc) was steady throughout.
Emergency options
Distributed systems are pretty fantastic when they work well, but it's easy to get into a state where you just want a simple option to take control of a situation. In case we ever need it, we built in a 'standalone' mode, where all the state is persisted to disk regularly. During a particularly chaotic incident, we have the option of quickly bypassing the clustered automatic failover functionality while we get a situation under control. We hope never to need it, but it's nice that it's there.
Current state
After a couple of weeks building confidence in the new alerter cluster, we quietly launched it to... no fanfare at all and no user interruption during the migration. In the weeks since launch, it has already faced several production incidents and fared well in all but one: a small bug that took the alerting service out of the 99.9% SLA for the month. Oops. Despite all the testing, there'll always be something to catch you out.
After more than a year of effort by our dev and SRE teams, the alerting service will soon have the 'beta' label torn off and this automatic failover work already in production forms a crucial piece of that.
Here's what it looks like now:
Service discovery for the alerter cluster is handled by etcd, and PySyncObj takes care of all other cluster operations. We're pretty happy with this - we're able to avoid waking someone on the SRE team for losing a machine or two, resize and upgrade the cluster without maintenance periods, and scale it up and down to trade off SLA requirements against cost, all of which is pretty powerful stuff.
To learn more, visit hostedgraphite.com or follow us on twitter.