An Eagle’s Eye on the Cloud: Early-Warning Systems for Cloud Outages

Too often, companies learn too late that their site is crawling to a halt or that their provider is experiencing an outage. By then, the damage is done and the cleanup can be nasty. Customers might leave your site for a competitor or disparage your name on social media. Your IT department may spend precious hours restoring the site and developing workarounds. Sensitive data might be gone, forever.
Despite enterprise-level features and services that major providers such as Amazon and Rackspace offer, there’s no guarantee in the cloud and the snowball effect of an outage can be daunting. In the shared, public cloud infrastructure, when performance begins to suffer, hundreds or thousands of clients will attempt to get their applications off the problematic provider at once, creating a bottleneck. It’s like a run on the bank — and no one can get their money.
If your business is running critical, customer-facing applications in the cloud, a few minutes of downtime could cost thousands of dollars and harm your reputation in a very public way. Depending upon your industry, products or brand, you may not be able to afford for applications or sites to be offline for even a few minutes.
Let’s take the Obama Campaign’s expansive data-mining project, which was hosted entirely on AWS. The 180 terabytes of data housed in various systems helped the President and his massive army of staffers and volunteers keep their finger on the pulse of campaign needs up to the very final hour of voting. One of the project’s key functions was to run real-time calculations across these large and diverse data sets to determine where and how the campaign should focus its grassroots efforts. Imagine the disaster if those systems had gone down during the last few days before the election.
So, is an effective early-warning system possible to avoid both the stampede of panic and the potential loss of revenues, reputation and critical intelligence? Yes, although warning systems differ from company to company based upon the applications and their specific performance and networking requirements. There’s no silver bullet in constructing such a warning system; you’ll need to draw upon a few methods and tools.
Naturally, application issues can stem from problems occurring at the cloud provider — but not always. The cloud service might be running perfectly for everyone else — except for your particular application. In a perfect world, your company will combine monitoring and alert data from the cloud provider with data you collect using your own systems or third-party services.
Breaking down the early-warning system
Early-warning solutions can be characterized into three main categories; and for mission-critical applications in the cloud, you need all three:

Code instrumentation solutions that provide insight into the performance of application transactions by instrumenting the application code
Systems monitoring solutions that monitor the cloud instance for system metrics such as CPU and memory
Cloud infrastructure monitoring solutions that provide insight into how the underlying cloud/hosting infrastructure is affecting your application performance

These solutions should offer early-warning signs that application performance or underlying cloud infrastructure behavior has changed and also information around what has changed.
For example, you may see a fourfold increase in packet retransmissions in the underlying cloud network between your application and the database tier. IT needs to determine whether the problem was introduced by someone internally, such as pushing a service configuration change, or if the problem originates from the cloud provider and/or ISP.
The following metrics can help deliver this information through email alerts and/or on a dashboard:
1. Application transaction performance: This metric measures the performance of an application transaction, for example a database call, at each tier in the application. It is achieved by adding specific monitoring vendor code or libraries to your application source code and tracked through monitoring tools.
2. Instance systems monitoring: This metric measures key performance metrics of the instance itself, such as CPU and memory. Typically, the cloud provider provides tools that collect and deliver this data, such as Amazon CloudWatch.
3. Cloud infrastructure monitoring: This refers to the measure of the underlying health of the cloud infrastructure and how it is affecting your application. This cannot be done with manual methods; you must use a monitoring tool.
4. Change log: This metric is a real-time list of all changes that have happened, from application deployments (code changes) to configuration changes (to a service or to the underlying infrastructure), alerts on abnormal behavior, notices from the cloud provider, and so on. This can be done manually, although for large enterprises, that’s not realistic. IT will need a monitoring tool or log management solution.
If your company is running apps in the cloud, you are giving up some measure of control by default in exchange for scalability, flexibility and quick access to on-demand resources. However, a multifaceted early-warning system along with a solid dose of proactive vigilance enables the IT department to take action early on pending problems. Just like an insurance policy, this strategy prevents against revenue and reputation loss and is absolutely critical for success in the cloud today.
Gary Read, CEO and president of Boundary, previously served as CEO of Nimsoft, providers of the award-winning cloud monitoring solution, where he grew the business from zero to over $100 million in bookings and 300 people. As CEO, Gary guided Nimsoft to a successful acquisition by CA for $350 million. Prior to Nimsoft, Gary held executive positions at BMC Software, Riversoft, and Boole and Babbage.

Tags:

An Eagle’s Eye on the Cloud: Early-Warning Systems for Cloud Outages

Fresh From The Blog

Learn More