Can your infrastructure withstand the failure of an entire datacenter? How about three?
Traditional IT is dedicated to perfecting and protecting critical infrastructure, and control over hardware is the key to maintaining high availability. High availability is measured in uptime ex post facto; despite top-shelf hardware, double backup generators, etc., it is impossible to empirically prove availability. Engineers cannot truly test what might happen if that data center fails completely or a random and unpredictable error triggers the failure of multiple components.
Failure is inevitable and unpredictable, whatever the quality of engineers or hardware. This is equally or more true of cloud infrastructure. This is not to say that cloud infrastructure cannot become highly available, but that the failure of a single component may be more difficult to predict or control.
This is why it more important to invest in preparing for and expecting failure in cloud infrastructure, and the most effective way to improve and prove resiliency in AWS is Netflix’s Simian Army.
Why Destructive Testing?
With little to no control over the condition of underlying infrastructure, engineers working in the cloud do not carefully guarantee the viability of critical hardware, they control the results of failure. They cannot predict latency or instance failure, so they build failover systems and distribute load. IaaS platforms like AWS provide additional tools to protect against downtime (auto scaling, deployment automation), and configuring these tools is of course the first step to creating resilient architecture.
However, mission-critical applications cannot rely on the promise of availability in cloud infrastructure, no matter how sophisticated failover and auto scaling capabilities. Testing can simulate failure in development environments or experts can analyze the system analytically, but both strategies only approximate the conditions in which availability can be proven. They will have limited success in large-scale environments where it is often impossibly complex to develop testing models to capture production dependencies across multiple deployments. These strategies can also only test the conditions engineers know exist, so the random and unpredictable remain difficult or impossible to simulate.
There is only one real way to prove availability on the cloud empirically: break, remove, overburden the infrastructure, and see if the applications stay up. Break everything, learn from what fails, eliminate single points of failure, and keep testing.
This is the philosophy behind the Simian Army, Netflix’s well-known suite of destructive, autonomous monkeys that wreak havoc on production environments, and are part of a sophisticated strategy that has made Netflix the object of admiration among AWS availability enthusiasts. Instead of simulating failure, the philosophy is to remove or modify infrastructure resources on purpose and learn from what breaks.
Ariel Tseitlin, in an excellent article about the Simian Army, explains this philosophy:
“A complex system is constantly undergoing varying degrees of failure….Increasing the frequency of failure reduces its uncertainty and the likelihood of an inappropriate or unexpected response.”
These are just a few of the monkeys Netflix has developed:
- Chaos Monkey: Randomly terminates virtual instances. As those familiar with AWS know, this is the most common type of failure. Netflix runs this every hour.
- Chaos Gorilla: All instances in an AZ are terminated or are isolated from any service outside the AZ.
- Janitor Monkey: Searches for unused resources and disposes of them.
Preparing for Chaos
Not many enterprise AWS systems are ready for Netflix’s army. Even those that are ready may not be willing to take the risk of wiping out entire availability zones on customer-facing environments. This is a very reasonable reaction. Netflix is undoubtedly several years beyond most enterprises in terms of the maturity of cloud engineering practices. But as the industry reaches maturity, and IT teams demonstrate the success of such models, such testing will become more common.
For those interested in implementing such practices sooner rather than later, here are a number of factors that can reduce the risk of potentially customer-impacting events:
- “War Room”: This is how Netflix first ran their Simian Army tests, by bringing together senior engineers in a single room to monitor how the infrastructure was handling the test, and potentially stop or reverse the test if performance issues arose.
- Monitoring and detailed reporting: Whenever failure occurs, the first question is always what changed, when. As explained above, this can be especially different in enterprise environments with multiple datacenters and multiple vendors, and even potentially more difficult in hybrid deployments if the appropriate monitoring has not be set up. Depending on resources, this can either be a custom monitoring interface or one of a handful of 3rd party tools, such as EM7 and New Relic.
- Begin with Chaos Monkey. Instance failure is the most common type of failure but also the failure engineers most frequently protect against.
- Run the monkey in a test environment first. Again, Tseitlin describes this process here.
- Hire a Managed Service Provider to manage your environment and take responsibility for Simian Army testing. Of course, very few MSPs offer this as a service, because it has the potential to highlight errors in their work; Logicworks uses these tools to harden client environments prior to handing them over for production. In this way, we guarantee our work.
- IT leaders can encourage a culture of failure-tolerance. Easier said than implemented, but this part of agile philosophy can only be implemented by leaders and nurtured in standups and postmortems. It means that a single mistake is a learning experience, and repeated mistakes are the only true failures. This is often challenging in enterprise IT environments where the threat of downtime is constant and incentives are weighted towards stasis over innovation.
Even with a sophisticated destructive testing suite, 100% availability is difficult to achieve. Netflix still goes down. Part of this is because Netflix has prioritized velocity; some enterprises will take a considerably more cautious approach. But a new breed of enterprise IT staff will know that there is no greater badge of high availability (and frankly, courage) than surviving the Simian Army. And as yet, there is no other testing methodology that guarantees infrastructure with the same level of confidence.
If your infrastructure would fail a Simian Army test, it may be time to have a managed service provider optimize your cloud deployments. Logicworks’ team of senior DevOps engineers create fault-tolerant infrastructure and stand behind our work with controlled destructive testing. Contact us to learn more.