Every public cloud service is susceptible to outages. It is possible to design fail-over systems on AWS with lower fixed costs than an on-premises DR solution and zero single points of failure. Ideally, failure of an entire datacenter or Availability Zone in AWS does not affect the availability of the application.
In traditional IT environments, engineers might duplicate mission-critical tiers to achieve resiliency. This can cost thousands or hundreds of thousands of dollars to maintain and is not even the most effective way to achieve resiliency.
Hundreds of small activities contribute to the overall resiliency of the system, but below are the most important foundational principles and strategies.
1. CREATE A LOOSELY COUPLED, LEAN SYSTEM
This basic system design principle bears repeating: decouple components such that each has little or no knowledge of other components. The more loosely coupled the system is, the better it will scale.
Loose coupling isolates the components of your system and eliminates internal dependencies so that the failure of a single component of your system is unknown by the other components. This creates a series of agnostic black boxes that do not care whether they serve data from EC2 instance A or B, thus creating a more resilient system in the case of the failure of A, B, or another related component.
- Deploy Vanilla Templates. At Logicworks, our standard practice is to use a “vanilla template” and configure at deployment time through configuration management. This gives us fine-grain control over instances at the time of deployment so that if, for example, we need to deploy a security update to our instance configuration, we only touch the code once on the Puppet/Chef/Ansible manifest, rather than having to manually patch every instance deployed with Golden Template. By eliminating your new instances’ dependency on your Golden Template, you reduce the failure risk of the system and allow the instance to be spun up more quickly.
- Simple Queuing Service or Simple Workflow Service. When you use a queue or buffer to relate components, the system can support spillover during load spikes by distributing requests to other components. Put SQS between layers so that the number of instances can scale on its own as needed based the length of the queue. If everything were to be lost, a new instance would pick up queued requests when your application recovers.
- Make your applications as stateless as possible. Application developers have long employed a variety of methods to store session data for users. This almost always makes the scalability of the application suffer, particularly if session state is stored in the database. If you must store state, saving it on the client reduces database load and eliminates server-side dependencies.
- Minimize interaction with the environment using CI tools, like Jenkins.
- Elastic Load Balancers. Distribute instances across multiple Availability Zones (AZs) in Auto Scaling groups. Elastic Load Balancers (ELBs) should distribute traffic among healthy instances based on frequent health checks, which you control the criteria for.
- Store static assets on S3. On the web serving front, best practice is storing static assets on S3 instead of going to the EC2 nodes themselves. Putting AWS CloudFront in front of S3 will let you deploy static assets so you do not have the throughput of those assets going to your application. This not only decreases the likelihood that your EC2 nodes will fail, but also reduces cost by allowing you to run leaner EC2 instance types that do not have to handle content delivery load.
2. AUTOMATE YOUR INFRASTRUCTURE
Human intervention is itself a single point of failure. To eliminate this, we create a self-healing, auto scaling infrastructure that dynamically creates and destroys instances and gives them the appropriate roles and resources with custom scripts. This often requires a significant upfront engineering investment.
However, automating your environment before build significantly cuts development and maintenance costs later. An environment that is fully optimized for automation can mean the difference between hours and weeks to deploy instances in new regions or create development environments.
- The infrastructure in action. In the case of the failure of any instance, it is removed from the Auto Scaling group and another instance is spun up to replace it.
- CloudWatch triggers the new instance spun up from an AMI in S3, copied to a hard drive about to be brought up.
- The CloudFormation template allows us to automatically set up a VPC, a NAT Gateway, basic security, and creates the tiers of the application and the relationship between them. The goal of the template is to minimally configure the tiers and then get connected to the Puppet master (or another configuration management tool). This template can then be held in a repository, from where it can be checked out as needed, by version (or branch), making it reproducible and easily deployable as new instances when needed – i.e., when existing applications fail or when they experience degraded performance.
- This minimal configuration lets the tiers be configured by configuration management. Configuring Puppet/Chef/Ansible manifests and making sure master knows what each instance they are spinning up does is one of the more time-consuming and custom solutions you can architect.
- Simple failover RDS. RDS offers a simple option for multiple availability-zone failover during disaster recovery. It also attaches the SQL Server instance to an Elastic Block Store with provisioned IOPS for higher performance.
3. BREAK AND DESTROY
If you know that things will fail, you can build mechanisms to ensure your system persists no matter what happens. In order to create a resilient application, cloud engineers must anticipate what could possibly develop a bug or be destroyed and eliminate those weaknesses.
This principle is so crucial to the creation of resilient deployments that Netflix – true innovators in resiliency testing – has created an entire squadron of Chaos Engineers “entirely focused on controlled failure injection.” Implementing best practices and then constantly monitoring and updating your system is only the first step to creating a fail-proof environment.
- Performance testing. In software engineering as in IaaS, performance testing is often the last and most-frequently ignored phase of testing. Subjecting your database or web tier to stress or performance tests from the very beginning of the design phase – and not just from a single location inside a firewall – will allow you to measure how your system will perform in the real world.
- Unleash the Simian Army. If you a survive Simian Army attack on your production environment with zero downtime or latency, it is proof that your system is truly resilient. Netflix’s open-source suite of chaotic destroyers is on GitHub. Induced failures prevent future failures.
Unfortunately, deploying resilient infrastructure is not just a set of to-dos. It requires a constant focus throughout AWS deployment on optimizing for automatic fail-over precise configuration of various native and 3rd party tools.
Want to build a more resilient AWS environment? Our engineers can provide advice, architect a new solution, or build a new cloud environment. Get in touch with our team of dedicated AWS DevOps engineers.