If you listen to marketing language about the public cloud, you’d be forgiven for believing the cloud requires little to no ongoing maintenance. You’ve probably heard stories of the IT team that manages fleets of thousands of instances with two engineers, or how Netflix launches and destroys entire systems without human intervention.
The consequences of this is that 76% of IT leadership underestimates the time and cost of cloud management. So in this epically long post, we’ve set out to document exactly how much time and cost it actually takes to manage a cloud environment in terms of staffing and tooling. (Download this Guide as a PDF)
Intro: Why Do Companies Underestimate Cloud Management?
Security and Compliance
Putting it Together: Budgeting for Cloud Management
Why Do Companies Underestimate Cloud Management?
The cost and time required for cloud management can vary widely depending on how frequently you add/remove cloud resources, how many VMs/instances you manage, and whether or not you take advantage of fully-managed cloud-native services (i.e. Amazon RDS). But by far the biggest impact on these estimates is whether or not you automate certain cloud management tasks, such as, instance build-out, patching, or alerting. The enterprises that manage huge cloud environments with “minimal” effort actually spend a significant amount of time and budget automating every aspect of their cloud environment, and have an entire staff of people dedicated to maintaining this automation. Additionally, it is probable they rebuilt their software to work with serverless services. This can cause unrealistic expectations of your own internal team, and is one of the main reasons leadership underestimate cloud management and under-hire.
Insufficient planning for cloud management can have serious consequences for long-term success of cloud projects, possibly resulting in:
a. Stalled cloud migration projects due to overwhelmed operations staff
b. Isolated cloud projects without common, shared standards
c. Ad hoc security configurations, due to a lack of common benchmark or inconsistent application of that benchmark
d. Shortage of shared resources and learnings across teams, which could lead to further inefficiencies
e. Inadequate, inconsistent financial data for tracking purposes
When you combine insufficient cloud planning and increasing pressure to deliver infrastructure faster and more reliably, you can see why IT teams are struggling to keep up. These pressures sometimes cause cloud projects to falter and stagnate after the first wave of migration; they have one or two moderately successful projects, but do not know how to expand usage with the proper controls across the enterprise. The company usually gets some cost benefits from migrating, but does not get the agility benefits they expected.
The foundation of any cloud management practice is your support desk / Network Operations Center (NOC) / operations support team. This is the team or individual responsible for responding to monitoring alerts, supporting and troubleshooting issues, and performing basic maintenance tasks like adding or changing instance configurations manually.
- 24×7 Support, 5 Days a Week: 3 Employees Minimum
- 24×7 Support, 7 Days a Week: 6 Employees Minimum
- M-F Business Hours Support: 1.2 Employees Minimum
Employee Skill Requirements
- 1+ years of hands-on experience with Linux Administration, Engineering, and Automation (i.e. Redhat, Debian/Ubuntu, CentOS, etc.)
- Strong understanding of the OSI Model and how it relates to cloud computing
- Ability to manage the lifecycle of IaaS VMs – troubleshoot VM boot issues, Backup/Restore VM and Disks (Snapshot and Recovery Vault), Imaging, Resizing and Scaling VMs
- Expert understanding and experience in monitoring tool diagnostic settings and log analytics
- Understanding of cloud resource templates (i.e. Azure Resource Manager templates or CloudFormation templates)
Annual Salary Requirements per Employee: $45,000 – $75,000
Annual Salary Requirements for Support Team Lead: $80,000 – $95,000
Common Tasks of Operations Support:
*The following chart assumes approximately 1 Hour per VM patched.
Common Tasks for OS Patching:
This can vary widely based on how you take backups. At the lower end, this requires reviewing automated backups. Higher end estimates reflect backups that frequently fail or must be taken manually.
Common Tasks for Backups:
Incident response is obviously highly variable, so we won’t be providing time estimates for this section. But incidents do occur and must be planned for.
All too often, there are business leaders operating at 30,000 feet and engineering teams at ground level. Engineers see a bulleted list of strategies that have little impact on their tickets, while business leaders see a list of in-progress features with unclear ties back to their vision.
For a mature cloud environment to be managed properly, it needs a Project Manager (PM). Sometimes this role can belong to the manager of the cloud support team, or the CIO or CTO, but mature cloud teams know that having a separate office for the PM function is critical to keeping cloud projects moving. Most companies are never “done” migrating to the cloud; their current cloud environment is always growing. Even cloud projects in “steady state” have myriad issues that need good project management.
Project Managers are normally the first to identify when strategy and execution diverge. After the initial planning phase of a project, CTOs are typically not involved in the execution of the day-to-day. PMs fill the gap and report directly to CTOs and line of business directors. When engineers go off-plan, the PM has the authority to say “no” — and their knowledge of the full plan allows them to back up their case.
Similarly, when business leaders make the case for the engineers to work on a new project, PMs can articulate exactly what the impact of such a decision would be. They know the backlog, can assess what other high-priority items would get pushed back, how this would affect delivery timelines and budgets, and make a business case for either maintaining the current plan or changing it.
On a daily basis, PMs schedule engineering resources, email the engineers at the cloud platforms for special requests, attend daily standups, troubleshoot access issues, train engineers, field questions about new cloud products — things that no engineer should do, and no business manager has time to do either. Sometimes, the single most important thing a cloud PM can do is answer a question. End users often have relatively complex questions, and if they tried to find an answer alone, it could take hours or even days of trial and error. A PM’s job is to prioritize that question, either with their own engineers or with the cloud platform, and get an answer quickly so their projects stay on track.
When a strong project manager is not in place on a cloud project, engineers are often the ones who suffer most. Frequently, it means an overload of tickets for unrelated projects in multiple lines of business, which lowers efficiency and increases dissatisfaction.
Successful project managers for cloud projects should be experienced technologists with the skills mentioned above. This gives them a place of authority, allowing them to drive the discussion around planning with engineers and business leaders alike.
- 5+ years managing projects, preferably including client facing role
- Familiarity with JIRA, Excel/Office, Salesforce
- Ability to learn new systems/software quickly
- Strong IT background, preferably in IAAS environments (AWS/Azure)
- Exceptional organizational, presentation, and communication skills
- Demonstrated ability to deal with change and quick deadlines
Annual Salary Requirements per Employee: $75,000 – $120,000
New Resource Creation & Configuration
Your cloud environment will change over time, even if your applications are relatively static. A critical part of planning for cloud management is identifying who is responsible for these changes:
- Creating and configuring new instances
- Creating and configuring new environments (dev/test/QA)
- Creating and configuring new accounts (if applicable)
- Changing instance type or size
Each of these tasks can be done manually in the cloud management console. If you don’t manage a large cloud environment, launching new resources manually in the console might actually be the fastest way of accomplishing something. It could work for your team, however, it is not recommended for the long-term, especially if you have high security or compliance requirements.
Humans are fallible. We make mistakes, especially under pressure. The bigger the infrastructure, the easier it is to forget to close a security loophole when launching a new instance, or remember to require MFA, etc. Engineers are a smart group, but automation created by expert engineers is smarter.
- Option #1: Build new resources manually in the console
- Option #2: Build instances/VMs using golden AMIs
- Option #3: Use a configuration management tool to configure vanilla AMIs
- Option #4 (the most common): A combination of all of the above
Most companies will settle on some system that involves a combination of AMI maintenance and configuration management. According to a recent survey, 42% of enterprises use Puppet and 37% use Chef (many survey respondents likely overlapped). Nearly 20% of enterprises plan to adopt Puppet or Chef next year.
A mature IT team relies on configuration management to maintain a single source of consistent, documented system configuration. As enterprise infrastructure becomes code and instances can be spun up or down with a few clicks, the protection afforded is absolutely essential for complex deployments. An instance-resident configuration tool like Puppet is also the key configuration engine during Auto Scaling events. It provides version control in a variety of ways and has monitoring and reporting capacities, along with other benefits.
It is often tempting to not bother with configuration management in an initial cloud set-up, especially if the team is new to using it, but the benefit of doing the hard work upfront is that every consumer-facing application changes. The more that is automated, the more time engineers can spend on new projects, and the quicker the team can adapt to change.
It is often very difficult to find the balance between what is baked into the AMI (to create a “Golden AMI or Master”) and what is done on launch with a configuration management tool (on top of a “Vanilla AMI”). In reality, how you configure an instance depends on how fast the instance needs to be spun up, how often auto scaling events happen, the average life of an instance, etc.
What we’ve discussed here is just scratching the surface of this very complex topic. We’ve talked about automation extensively in other resources:
- DevOps on AWS: eBook
- Puppet and AWS: DevOps Best Practices
- Continuous Compliance eBook
- Vanilla AMI vs. Golden AMI
- Common Misconceptions about Auto Scaling
Environment or Account Creation
Whether you’re just spinning up and down dev/test environments on weekends, or have to build entire new accounts whenever you add a new software client, automation is critical to any IT team where environment or account creation is frequent.
The foundation of this automation is infrastructure templates. CloudFormation, Terraform, and ARM templates are popular examples.
An infrastructure template is a very simple concept. You tell the cloud platform what you want the environment to look like (in JSON), and the platform takes care of performing the manual actions of provisioning those services. Hand-coding JSON is not a pleasant experience, but it’s not complicated. You can get a head start with pre-built templates.
In an ideal world, your systems engineers create these templates and then version-control them, either in a GitHub repository or using a tool like AWS Service Catalog. You’re not just using these templates to build out an environment once; rather than manually changing your environment, your engineers will change the template and relaunch the entire stack. That means your template is always a true reflection of the configuration of your live systems. Security professionals love this, and your auditors will, too.
If you frequently need to launch entire new accounts, an additional layer of templating and automation is required. This is often the case with SaaS companies with single tenant application structures, for example, who need to build out a new account whenever they onboard a new customer. Setting up a new account is time-consuming and extremely complex. You can use a tool like AWS Control Tower and Account Factory to automate this.
Automation Takes Time
None of the automation we’ve discussed comes out-of-the-box with any cloud platform. You must build and maintain AMIs, configuration management scripts, and related documentation yourself. The more complex your environment becomes, the more valuable this approach is.
Employee Skill Requirements for DevOps Engineer:
- 5+ years DevOps/Sys Engineer experience in a production capacity
- 5+ years of Linux and/or Window system administration experience
- CI/CD platform experience, including installation and configuration and CI platforms, proficiency with application build/package/distribution tools, ideally experienced in creating pipelines from scratch
- Production experience in Terraform, CloudFormation, or other IaC tools
- Production experience with container technology for application packaging, delivery and operation
- Production experience with distributed systems architectures and related tools
- Experience maintaining and troubleshooting backend and AWS infrastructure
Annual Salary Requirement of DevOps Engineer: $110,000 – $150,000
Common Tasks of Resource Creation:
Agent Management is a subset of resource creation and maintenance that’s often overlooked when planning for cloud management. When you operate in the cloud, each instance will have several agents, including, but not limited to:
- Cost management agent (i.e. a tool like CloudHealth)
- Monitoring agent (i.e. CloudWatch agent or Datadog agent)
- IDS agent (i.e. Threatstack)
- Antivirus agent (i.e. Trend Micro)
Not only do you have to remember to install these agents on every instance, but if you have an Auto Scaling event, they often fall off, fail, or have other problems. It’s a tedious and sometimes annoying aspect of any cloud management practice, but a configuration management tool can help you manage agents.
Cost control is a crucial aspect of any cloud management practice. According to a recent survey, 37 percent of organizations list unpredictable costs as a top issue, and nearly one-third struggle with a lack of visibility into cloud resource usage.
These frustrations have big down-stream impacts and can cause companies to stop migrating workloads to the cloud, stall major projects, and cause long-term tension between technical and business teams.
Cost Management Platforms Are Great…But Not Enough
Many IT teams try to control costs by using a cost management platform. There are hundreds of great cloud cost management platforms on the market and all of them provide better alerting and historical tracking for cloud costs.
Cost management platforms can tell you that costs are spiking or unusual, but they can’t provide engineers with alerts to investigate and fix issues, and can’t report to the finance team for you. In addition, they can’t purchase and manage Reserved Instances/VMs — which in some companies, is a full-time job.
Cloud cost management requires engineer hours every month and regular review from IT leadership.
Who Owns Cost Management?
The #1 struggle with cloud cost management we’ve seen is lack of ownership in cloud teams. . Billing is not usually the responsibility of engineers, and cloud bills are too technical for finance teams, so it often ends up landing on the IT manager or CIO/CTO’s desk, who don’t have time to examine costs closely. This is why it’s so easy for costs to get out of control.
Common Cost Management Tasks
Security and Compliance
Many of the items already listed in this Guide are directly related to your security and compliance efforts: installing third party security tool agents, monitoring alerts, incident response, etc. But what we haven’t discussed is the ongoing effort of maintaining your security model, regularly assessing your systems, and meeting compliance requirements.
Cloud Security: Who Owns It?
In an ideal world, your CISO owns cloud security and is deeply involved in security planning and execution. However, only 65% of companies have a CISO or an equivalent who reports to the CIO function. Many mid-sized companies don’t have anyone in their IT department with a security title, falling on the cloud engineers to train themselves and follow official guidance from organizations like the Cloud Security Alliance and NIST.
More than 77% of IT leaders don’t think their company would pass all of its cloud compliance audits if they happened today. 88% say that having to meet a compliance standard prevents them from migrating an application with regulated data to the cloud to begin with.
So, why is compliance such a big roadblock? Is it because of the lack of experience among IT teams or is it the cloud platforms themselves?
Major cloud platforms like Amazon Web Services (AWS) and Microsoft Azure service thousands of customers in highly-regulated industries. AWS has over 20 certifications and is annually audited for an additional 25 regulations or frameworks. It has prioritized services that automate security and compliance tasks.
In short, public cloud providers have made significant investments in tools, documentation, and audits to enable compliance on their platforms. It is unlikely that cloud platforms themselves are inhibiting adoption and more likely that compliance on the cloud is just confusing, hard, and expensive.
Confusion About Compliance on the Cloud
49% of IT decision makers believe cloud providers are more responsible for compliance in the cloud. Unfortunately…they’re wrong.
Cloud providers like AWS are very clear on this point: “While AWS manages security of the cloud, customers remain responsible for compliance and security in the cloud.” The company, not the cloud provider, bears contractual responsibility for compliance, unless a special contract or BAA is signed. Even in this circumstance cloud providers assume very limited liability, usually only limited to physical security.
Executives must understand compliance responsibility in order to successfully operate on the cloud. Further training is required to educate IT decision makers and engineers on how to host regulated data on the public cloud, which will reduce resistance to migrating compliant workloads to the cloud in the future.
Compliance is Time-Consuming and Expensive
A recent Globalscape and Ponemon study found that on average, 14.3 percent of total IT budgets were spent on compliance. A mid-sized company in a highly regulated industry — like a healthcare SaaS company — can spend an even more significant portion of its overall budget on compliance. All of the third party tools, licenses, and machine images you pay for or custom build on-premises now have to be duplicated in the cloud, causing a major drain on engineering hours and IT budgets. It’s no surprise then that 82% of IT professionals wish more cloud compliance tasks were automated.
On top of that, hiring the right engineers to complete cloud compliance tasks is also time-consuming and expensive. A majority of IT leaders say it is difficult to find engineers with compliance expertise. This confirms related reports that major financial firms are paying a large premium for compliance talent. With high turnover rates, existing compliance staff is difficult or impossible to replace.
Common Cloud Security Tasks
**Does not include operations support tasks (ex. Patching, agent management) listed above.**
Annual Cloud Security Engineer Salary: $100.000 – $150,000
Annual CISO Salary: $180,000 – $300,000
Cloud Security Engineer Skills Requirements:
- Experience creating and managing enterprise information security architectures and solutions across multiple disciplines: network, cloud, endpoint, software development, etc.
- In depth understanding and knowledge of network security capabilities and best-practices (e.g. IPS/IDS, firewalls, proxies, BYOD, SIEM, wireless security)
- Experience performing security/vulnerability reviews of network environments
- Experience with SAML / Single Sign On tools, techniques, and authentication with SaaS applications
- CISSP or other security architecture methodology certifications
Putting it Together: Budgeting for Cloud Management
Now that we’ve broken down cloud management tasks by group, let’s take a look at the full picture and try to estimate the time and cost of cloud management.
Cloud Management Time and Staff Estimates
Staff Salary Estimates
Insource or Outsource Cloud Management?
40% of IT managers plan to hire outside consultants to help them with cloud management. This number is highest among companies that have tried to manage the cloud on their own for several years.
Companies that want to reduce the IT management burden on their in-house team hire a Managed Services Provider (MSP).
What does an MSP do?
MSPs do most of the tasks described in this Guide. In some cases, they replace an in-house cloud support team and in other cases, MSPs act as an extension of the in-house team, doing the “heavy lifting” so that the in-house cloud team can do more differentiating tasks, like innovation and experimentation.
Early in the process of interviewing MSPs, you should ask for a copy of their RACI charts to see exactly who owns what. (If they don’t even have one, run away!)
Cost of MSP vs. In-House
Use the estimates in the previous section about the cost of managing cloud in-house, and then compare them with several quotes from MSPs.
In our experience, outsourcing cloud management costs significantly less than maintaining resources in-house; see chart below.
Of course, these numbers vary based on what your environment looks like, but the team at Logicworks has been repeatedly told that we cost a fraction of what it would cost to build a team in-house.
Choosing the Right Cloud Partner
- Ask for NPS Scores. See how all customers rate the MSP, not just how their references rate them.
- Cloud Platform Validation. Limit your choices to the top partners at your cloud platform of choice. MSPs have to go through extensive business and technical validations to be in those programs.
- Gartner Magic Quadrant (MQ). Gartner puts out an annual Magic Quadrant for Public Cloud Infrastructure Managed Services Providers. It’s external validation that a company meets stringent platform, staffing, financial, and customer success standards.
- Takeovers. It’s not discussed often, but most MSPs are unable to take over your current cloud account and provide critical services. They must rebuild your entire cloud account in order to manage it, then migrate your existing apps over to the new target environment. So before you engage a partner, make sure they can manage the accounts they takeover, sometimes called a “brownfield takeover”. It’s an often overlooked but crucial feature that very few MSPs offer.
- Cloud R&D. Cloud technology changes every day. Old-guard MSPs are highly proficient in maintaining a system, but may not build cloud infrastructure that can evolve efficiently. Business should find an MSP that prioritizes ongoing changes, not just ongoing monitoring.
- Compliance. If you have a compliance obligation that you must meet on the cloud, hire an MSP that’s certified in that compliance framework. It can save you hundreds or even thousands of hours of in-house efforts if you can rely on your MSP’s Attestation of Compliance in your own audits. Additionally, the ability to earn such qualifications indicates that the MSP possesses a high level of security and compliance expertise. They require extensive (and expensive) investigations by third party auditors of physical infrastructure and team practices.
Many companies are struggling to maintain their cloud environments, not because cloud management is harder than traditional server management, but because they’ve either underestimated the time and effort of cloud management, don’t have enough staff, or haven’t automated common management tasks.
Planning for cloud management is the first step to getting back on the right track. Whether you’re planning to migrate to the cloud or trying to reform an existing cloud team, we hope that this guide helps.
Logicworks is a cloud consulting and managed services company that helps organizations plan, architect, and manage complex cloud environments. Our team of cloud experts have helped 400+ organizations migrate to AWS and Azure with our unique approach to cloud strategy and design, including MassMutual, Major League Soccer, and Choice Hotels.
As an AWS Premier Consulting Partner and Azure Expert MSP with HIPAA, HITRUST, PCI, ISO 27001, SOC1, and SOC2 certifications, Logicworks specializes in complex workloads for companies with high security and compliance requirements. If you’re planning new cloud projects in 2020 and want expert help in avoiding common migration stumbling blocks, visit www.logicworks.com or contact us at (212) 625-5300.