Written by Donovan Brady
Congratulations – You’ve successfully migrated to the cloud! After a long journey, you finally convinced your enterprise to forgo the traditional CapEx model in favor of an OpEx model, you designed a secure, foundational architecture that is the cloud equivalent of Fort Knox, and you’ve moved your critical workloads to AWS. First and foremost, give yourself a pat on the back – that’s quite the accomplishment!
The morning after your cut-over, you kick your feet up and sit down to enjoy your morning cup of coffee when you receive a call from your executive team inquiring about the most recent AWS outages and how you will guard your enterprise against challenges like these in the future.
“How could it go down?!”, they ask. “I thought the cloud was always supposed to be up! After all, that’s why we moved there in the first place!”
Now, you may be confident your environment is Well-Architected, but it is a great time to take the opportunity to modernize your monolithic architecture. Luckily, you don’t have to do this alone. For example, we specialize in helping customers architect and manage resilient cloud architectures to enable high security, availability, and scale. In the upcoming months, we will release a series of detailed eBooks about how to design and implement auto-scaling groups, decoupling your applications, and many more! But you need answers now, so we won’t leave you hanging. In this blog post, we will briefly address three methodologies to create a resilient architecture and ensure your applications are always up and running.
Everything Breaks, All the Time
One critical thing to keep in mind is that everything breaks, all the time. You knew this was true in your datacenter, but hoped it would change by migrating to the cloud. Unfortunately, it didn’t. Preventing failure is impossible, but moving to the cloud makes it much easier to architect for failure and keep your apps online, despite a component outage.
The easiest way to guard against a component failure is to design your environment for High Availability. High Availability (HA) by definition refers to a system that runs continuously without failure. Understanding your applications’ availability requirements is key to determining Service Level Agreements (SLAs) with your customers, and from there, defining monitoring and self-healing strategies to ensure your SLAs are met.
Typically, HA is achieved by having redundant deployments of your services, whether that is two load-balanced web servers, primary and secondary databases, etc. It’s always a good idea to have redundancy built-in with multiple components, but you can take that a step further in the cloud by having multiple Availability Zone deployments. Availability Zones are geographically isolated data centers within the same region. If one Availability Zone (AZ) were to experience an outage, there’s no need to worry, as your applications are redundant across multiple AZs. Further, cloud-native load balancers have the capacity to detect the availability of resources in a single AZ and seamlessly and automatically fail over to the second (or third) AZ, maintaining your 100% uptime. You can even autoscale your resources across multiple AZs if necessary.
However, it is important to be aware of your applications’ ability to support concurrent or parallelized processes, and what other servers are dependent on the now highly available servers. If your application is stateless, you may need to refactor its stateful components. If refactoring is out of the question, sticky sessions on your load balancer may be your best option.
If refactoring isn’t out of the question, you may want to go the extra mile to see which resources can be decoupled and replatformed.
One of the primary causes for a system outage is when too many components are interdependent, which means when one fails, the rest will also fail by default. Decoupling these components opens a whole new world of modernization – containerization, cloud-native solutions, autoscaling, and much more.
Monolithic applications carry additional risks aside from availability, including security and cost implications, but the most frequent issue is their availability. If possible, you should decouple the dependent components and replatform them to cloud-native tools. Replatforming to cloud-native tools puts the burden of availability and resiliency of the component on the cloud provider. Do you have an API? Leverage a cloud-native API gateway – you only need to worry about your application code, as the service itself is already highly available and managed.
There are cloud-native solutions for most common application components. However, always make sure you understand the complexity of your applications’ dependencies, because these can quickly become hairy depending on your application requirements.
Failover & Disaster Recovery
Now you have a highly available application that’s replatformed to use cloud-native tools, but you remember that many of the AWS outage issues were isolated to a single region, which just so happens to be the region you’re deployed in. D’oh!
Disaster Recovery is a very vague term and can refer to a number of different architectures and solutions. However, Disaster Recovery (DR) in the cloud usually refers to a multi-regional deployment that follows one of four basic strategies: Backup and Restore, Pilot Light, Warm Standby, and Active-Active. Which DR strategy you choose is determined by what your business Recovery Time and Recovery Point Objectives (RTO/RPO) are. Logicworks recommends starting with a Pilot Light deployment and moving up, if necessary (the higher-level strategies can become quite expensive).
Pilot Light strategies assume a baseline deployment in a secondary region with your most critical components deployed and always on and replicated. Typically, this is the primary database and a web server. If a failure in your primary region occurs, your application can automatically failover to the secondary region and scale-out components, when necessary, to meet the application’s demand.
When determining a DR strategy, make sure you have both the infrastructure, as well as the required protocols, procedures, and escalation policies necessary to successfully complete a failover event. One cannot succeed without the other!
Logicworks has extensive content with respect to this specific topic – our Disaster Recovery for the Cloud Era eBook is a great place to start.
Migrating to the cloud is a great decision and is almost always recommended for its numerous benefits with respect to security, cost, and resiliency. However, as with all good things, it requires time and effort to prepare and design the proper path forward to take advantage of all that the cloud has to offer. Now that you have three straightforward strategies to improve resiliency in your environment, you can walk into Monday’s Executive meeting with your head held high, knowing you have the perfect answers to quell any of your team’s concerns. And if you need more details on how we can help you architect a cloud resilient architecture, please schedule a call with us.
Stay tuned for more resiliency content from Logicworks!
Donovan Brady is the Director of Solutions Architecture at Logicworks. He’s been helping customers achieve success in the cloud with Logicworks since 2014.