Usually when we think of the cloud, we think of computers and databases and stuff someplace far away. It “just works” and we do not think of the numerous folks that maintain it and countless people that use it. But it has a different feel now, as folks around the world are hunkering down at home or in crisis command centers, facing the new realities COVID-19 has brought us, and how its impact continues to evolve and develop.
We’re aware that many individuals and industries are facing challenges they’ve never faced before. Those challenges will vary, from wild fluctuations in supply chains; to enormous digital demand on the technologies and platforms that enable us to stay connected and live productive lives.
This is why Jay Chapel, CEO of ParkMyCloud is looking for ways to help organizations reduce costs across the board. He’s collected a lot of knowledge about saving money in the public cloud from some great companies and am happy to share some ideas that may help in some small way.
M.R. Rangaswami: How do you best describe cloud waste, and what customers can be hurt by it?
Jay Chapel: Cloud waste occurs when you consume more cloud resources than you actually need. It can take many different forms. Perhaps the most common is simply resources left running 24×7 in development, test, demo, and training environments. In a lot of cases, this is a bad habit that started in the previous era of on-premise data centers. Then, users thought, “It’s a sunk cost so why bother turning it off?” Of course, with cloud resources charged on an on-demand model, it is not a sunk cost anymore.
Some more examples of cloud waste are instances left running, orphaned volumes (volumes not attached to any servers), old snapshots of other volumes, old, out-of-date machine images, inefficient containerization, underutilized databases, instances running on legacy resource types, unused reserved instances and more.
As for who is hurt by cloud waste, the answer is almost everyone. For cloud customers, it erodes their return on assets, return on equity and net revenue. All of these ultimately impact earnings per share for their investors as well.
It also hurts the public cloud providers and their bottom line. Providers are most profitable when they can oversubscribe their data centers. Cloud waste forces them to build more data centers than they need, killing their oversubscription rates and hurting their profitability as well. This is why you see cloud providers offering certain types of cost cutting solutions. For example, AWS offers Reserved Instances, where you can pay up front for break in on-demand pricing.
M.R.: By your calculations, how big do you see the cloud waste problem being?
Jay: We recently estimated that cloud waste will exceed $17.6 billion in 2020. That works out to about $5 million worth of cloud waste every day! So, it is a growing problem that not many people are talking about.
While I personally do find that cloud customers are more aware of the potential for wasted spending than they were just a few years ago, this does not seem to be correlated with cost optimized infrastructure from the beginning. The fact is, it’s simply not a default behavior. After all, engineering and development teams are more focused on delivering value through product development than on optimizing costs – as they should be. But the costs still add up.
I frequently run reports for companies having issues with cloud waste. Invariably, I find wasted spend in these accounts. For example, one healthcare IT provider was found to be wasting $5.24 million annually on their cloud spend—an average of more than $1,000 per resource per year.
In general, the total waste is coming from:
Idle Resources: Idle resources are VMs and instances being paid for by the hour, minute or second, that are not actually being used 24/7. Typically, these are non-production resources being used for development, staging, testing and QA. Based on data collected from our users, about 44% of their compute spend is on non-production resources. Most non-production resources are only used during a 40-hour work week, and do not need to run 24/7. That means that for the other 128 hours of the week (76%), the resources sit idle, but are still paid for.
So, I find the following wasted spend from idle resources: $33.3 billion in compute spend, times 0.44 non-production, times 0.76 of week idle, equals to $11 billion wasted on idle cloud resources in 2020.
Overprovisioned Resources: Another source of wasted cloud spend is overprovisioned infrastructure—that is, paying for resources are larger in capacity than needed. That means you’re paying for resource capacity you’re rarely (or never) using. About 40% of instances are sized at least one size larger than needed for their workloads. Just by reducing an instance by one size, the cost is reduced by 50%. Downsizing by two sizes saves 75%.
The data I see in client infrastructure confirms this—and the problem may well be even larger. Infrastructure I see has an average CPU utilization of 4.9%. Of course, this could be skewed by the fact that resources I deal with are more commonly for non-production resources. However, it still paints a picture of gross underutilization, ripe for rightsizing and optimization.
M.R.: What are your thoughts on how organizations can reduce cloud waste?
Jay: The easiest way to reduce cloud waste is by simply turning off non-production environments when they are not being used. Companies are eliminating wasted cloud spending through things like scheduling, rightsizing, and optimization. A few more ways include:
When you turn on resources in non-production environments, turn on the minimum size needed to get the job done and only grudgingly move up to the next size.
Clean up old volumes, snapshots and machine images.
Use AWS Savings Plans, Azure Reserved Instances, and Google Committed Use Discounts for your production environments, but make sure you manage them closely, so that they actually match what your users are provisioning, otherwise you could be double paying.
Investigate Spot fleets for your production batch workloads that run at night. It could save you a bundle.