Talk to an Expert

Are Your Deployments Working? Simian Army on AWS

DevOps
Tags: AWS, Cloud Infrastructure, Risk Management

Can your infrastructure withstand the failure of an entire datacenter? How about three?

Traditional IT is dedicated to perfecting and protecting critical infrastructure, and control over hardware is the key to maintaining high availability. High availability is measured in uptime ex post facto; despite top-shelf hardware, double backup generators, etc., it is impossible to empirically prove availability. Engineers cannot truly test what might happen if that data center fails completely or a random and unpredictable error triggers the failure of multiple components.

Failure is inevitable and unpredictable, whatever the quality of engineers or hardware. This is equally or more true of cloud infrastructure. This is not to say that cloud infrastructure cannot become highly available, but that the failure of a single component may be more difficult to predict or control.

This is why it more important to invest in preparing for and expecting failure in cloud infrastructure, and the most effective way to improve and prove resiliency in AWS is Netflix’s Simian Army.

Why Destructive Testing?

With little to no control over the condition of underlying infrastructure, engineers working in the cloud do not carefully guarantee the viability of critical hardware, they control the results of failure. They cannot predict latency or instance failure, so they build failover systems and distribute load. IaaS platforms like AWS provide additional tools to protect against downtime (auto scaling, deployment automation), and configuring these tools is of course the first step to creating resilient architecture.

However, mission-critical applications cannot rely on the promise of availability in cloud infrastructure, no matter how sophisticated failover and auto scaling capabilities. Testing can simulate failure in development environments or experts can analyze the system analytically, but both strategies only approximate the conditions in which availability can be proven. They will have limited success in large-scale environments where it is often impossibly complex to develop testing models to capture production dependencies across multiple deployments. These strategies can also only test the conditions engineers know exist, so the random and unpredictable remain difficult or impossible to simulate.

There is only one real way to prove availability on the cloud empirically: break, remove, overburden the infrastructure, and see if the applications stay up. Break everything, learn from what fails, eliminate single points of failure, and keep testing.

This is the philosophy behind the Simian Army, Netflix’s well-known suite of destructive, autonomous monkeys that wreak havoc on production environments, and are part of a sophisticated strategy that has made Netflix the object of admiration among AWS availability enthusiasts. Instead of simulating failure, the philosophy is to remove or modify infrastructure resources on purpose and learn from what breaks.

Ariel Tseitlin, in an excellent article about the Simian Army, explains this philosophy:

“A complex system is constantly undergoing varying degrees of failure….Increasing the frequency of failure reduces its uncertainty and the likelihood of an inappropriate or unexpected response.”

These are just a few of the monkeys Netflix has developed:

Chaos Monkey: Randomly terminates virtual instances. As those familiar with AWS know, this is the most common type of failure. Netflix runs this every hour.
Chaos Gorilla: All instances in an AZ are terminated or are isolated from any service outside the AZ.
Janitor Monkey: Searches for unused resources and disposes of them.

Preparing for Chaos

Not many enterprise AWS systems are ready for Netflix’s army. Even those that are ready may not be willing to take the risk of wiping out entire availability zones on customer-facing environments. This is a very reasonable reaction. Netflix is undoubtedly several years beyond most enterprises in terms of the maturity of cloud engineering practices. But as the industry reaches maturity, and IT teams demonstrate the success of such models, such testing will become more common.

For those interested in implementing such practices sooner rather than later, here are a number of factors that can reduce the risk of potentially customer-impacting events:

“War Room”: This is how Netflix first ran their Simian Army tests, by bringing together senior engineers in a single room to monitor how the infrastructure was handling the test, and potentially stop or reverse the test if performance issues arose.
Monitoring and detailed reporting: Whenever failure occurs, the first question is always what changed, when. As explained above, this can be especially different in enterprise environments with multiple datacenters and multiple vendors, and even potentially more difficult in hybrid deployments if the appropriate monitoring has not be set up. Depending on resources, this can either be a custom monitoring interface or one of a handful of 3rd party tools, such as EM7 and New Relic.
Begin with Chaos Monkey. Instance failure is the most common type of failure but also the failure engineers most frequently protect against.
Run the monkey in a test environment first. Again, Tseitlin describes this process here.
Hire a Managed Service Provider to manage your environment and take responsibility for Simian Army testing. Of course, very few MSPs offer this as a service, because it has the potential to highlight errors in their work; Logicworks uses these tools to harden client environments prior to handing them over for production. In this way, we guarantee our work.
IT leaders can encourage a culture of failure-tolerance. Easier said than implemented, but this part of agile philosophy can only be implemented by leaders and nurtured in standups and postmortems. It means that a single mistake is a learning experience, and repeated mistakes are the only true failures. This is often challenging in enterprise IT environments where the threat of downtime is constant and incentives are weighted towards stasis over innovation.

Even with a sophisticated destructive testing suite, 100% availability is difficult to achieve. Netflix still goes down. Part of this is because Netflix has prioritized velocity; some enterprises will take a considerably more cautious approach. But a new breed of enterprise IT staff will know that there is no greater badge of high availability (and frankly, courage) than surviving the Simian Army. And as yet, there is no other testing methodology that guarantees infrastructure with the same level of confidence.

If your infrastructure would fail a Simian Army test, it may be time to have a managed service provider optimize your cloud deployments. Logicworks’ team of senior DevOps engineers create fault-tolerant infrastructure and stand behind our work with controlled destructive testing. Contact us to learn more.

May 7, 2015

4 Comments

Pingback: AWS Week in Review – May 4, 2015 | php Technologies

Pingback: AWS Week in Review – May 4, 2015 - Browser Zone

Pingback: AWS Week in Review – May 4, 2015 | SMACBUZZ

krishna

April 23, 2019

Which all organisation are using simian army tool???

Logicworks Control Tower

AWS Control Tower is a purpose-built management utility for building, organizing, and maintaining multiple AWS Accounts. Control Tower allows you to deploy accounts programmatically by using predetermined templates that assign specific guardrails. Security, identitify management, logging, cost management, and other key business functions can be defined and executed through a successful Control Tower implementation. Control Tower operates across Organizational Units and defines rulesets through Service Control Policies. Control Tower Account Factory automates the deployment and configuration of new accounts.

Sessions & Milestones

Briefing & Discovery

Logicworks will lead a workshop to introduce core concepts including use cases, management, automation, and governance. The requirements for your deployment will be identified and documented, to align our technical resources around your project goals & objectives.

Architecture Design

Based on your requirements, Logicworks will present the recommended architecture design. Our team will share a diagram of the proposed configuration and review the specifics points of your deployment.

Transfer Knowledge

When your deployment is complete, Logicworks will present the details to your team and provide a guided walkthrough of the environment.

Scope & Details

Scope

Organization Units
Governance Requirements
Security Guardrails Definition
Service Control Policies.
AWS Config Rules
Service Control Policy Definitions
Guardrail Deployment

Deliverables

Default Control Tower in Desired Region
Administer Guardrails
Configure Account Factory
Provide Reusable IAC Template for Default VPC
Standardized Networking & Route Tables
Administer AWS SSO Configuration (can include integration with Active Directory)
Document Multi-Account Structure and Governance Strategy
Deploy Up To 2 Customizations for Control Tower (CfCT)
Cloud Solution Documentation detailing Control Tower Solution
Architecture Diagram and Technical Specifications

We're ready to help

Are Your Deployments Working? Simian Army on AWS

Why Destructive Testing?

4 Comments

krishna

April 23, 2019

Leave A Comment

Logicworks Control Tower

Logicworks Control Tower

Get started with a Cloud Refresh Evaluation

Please complete this form to have a specialist contact you.

Get a Free Expert
Cloud Assessment

Consult with a Sr. AWS Solutions Architect to learn how you can improve cost efficiency, security, performance, and compliance. This session is free with no strings attached.

Identify quick wins to improve performance

Improve cost efficiency by 20-30%

Get ready for a compliance audit

We're ready to help

Are Your Deployments Working? Simian Army on AWS

Why Destructive Testing?

Share this:

4 Comments

krishna

April 23, 2019

Leave A Comment