There have been a rash of shocking and reputation-damaging headlines that have exposed a potential lack of testing in very high-specification tech infrastructures. NASDAQ’s recent technology “glitch” meant that investors were unable to view their stock quotes for around six minutes, while Goldman Sachs missed a multibillion-dollar order at an auction of Treasury bills as a result of another “glitch in a computer system.”
Instances at both organizations were cited in a recent Standard & Poor’s report, which warned that issues like this could put their ratings at risk. The report highlights the real and lasting damage that can result from technology failures. The implications cost more than just reputations, too. If glitches like this accelerate sinking market confidence, borrowing costs could rise, leaving exchanges at a significant disadvantage.
The media and organizations involved have been very quick to label such issues as technical “glitches,” but the administrative leave rumored to have been given to some Goldman Sachs employees involved in errors points directly to the real issue here. While they can be covered by the catch-all term, “glitch,” this type of incident is most often the result of human error — not unreliable technology.
The errors typically begin in the data-collection phase or in the testing and operation of the system itself. In our experience, working with major businesses around the world, the common element in almost all incidents like this is an overlooked need for risk assessment and evaluation.
Use the “check engine” light
Automation can help companies resolve and avoid a wide range of errors often associated with repetitive business processes and tasks left to manual effort. As a result, automation is often thought of as something of a panacea — keeping an eye on the problems the IT team doesn’t have time to worry about or manage. To an extent, that’s true. But, as with all technology, automation is only as good as the information underpinning it. This is where things can go awry.
There is a tendency for IT teams, working with very high-specification and sophisticated environments, to rely on their systems to automatically failover in the event of a problem. However, whenever planning for such events, the IT department needs to identify at what point, and how, human intervention should happen. Notification is a critical element in using automation wisely.
One thing that each of these recent incidents had in common is that the point of human intervention either had not been identified or planned for. In each case this resulted in catastrophic failure, as the technology didn’t behave as expected when the parameters around it changed. In order to plan beyond the assumption that the technology will do what is needed in all situations, the IT department needs to identify all likely scenarios and preemptively test against them. It also needs to build in automated notification.
In situations such as the one that recently impacted NASDAQ, the most likely cause is often as simple as an unexpectedly high volume of transactions taking place simultaneously across the network. Because the network had always coped under normal volumes, it is possible that no one had anticipated, or more importantly, tested what the increased loads would do to performance. Furthermore, no parameters had been set to alert the team that something was going wrong. That left the IT department essentially driving a car without a “check engine” light. The first time the team knew anything was amiss was when the car slowed to a stop.
Murphy’s Law says that “anything that can go wrong will.” To a certain extent, this is the adage that should inform all business technology planning. No matter how a business operates, the IT team should always examine and assess where automation can add value. They should use automation to run processes and automatically notify the right people if an incident occurs.
This can be used to prevent unplanned failures at both the infrastructure and applications layers. Automation shouldn’t be used in spite of potential errors. In fact, it can (and should) help you and your team find, diagnose and correct them before they become a real problem. At the application level, this can speed recovery from any issue. Nevertheless, it requires consistent testing to prevent a failure from happening in a real-world scenario.
No pain, No gain
Netflix offers a shining example of this type of approach. It uses an automated problem-creating application called “Chaos Monkey” in its network. This application frequently causes failures by seeking out Auto Scaling Groups (ASGs) and terminating instances (virtual machines) in each group. It does this during certain days and hours when the IT team can respond to the outages.
It may seem ridiculous for the organization to cause problems in its own systems, but this way the team can preemptively solve them, taking action against their own imitation outages before a real one sneaks up on them. By forcing ad-hoc failures across the system, the IT department builds in resilience against outages without risking performance or reputation by falling victim to a failure that results in a true business loss. This kind of testing is an exercise in pushing limits for top performance. It also helps protect against issues that come from real-world complexity, which can build up from small problems.
There is rarely one clear reason for a failure of the magnitude of the recent NASDAQ or Goldman Sachs incidents. Instead, situations like this are more often the result of a collection of smaller issues that build the perfect, unforeseen — but completely preventable — storm. To ensure that businesses make the most of the open road ahead, they need to examine the worst-case scenario as the key driver for business improvement. Let automation sweat the small stuff and inform humans to take the wheel when necessary. Keeping everything running is a collaboration between automated processes and human planning, intelligence and insight.
Jeff Rauscher, director of solutions design for Redwood Software, has more than 31 years of diversified MIS/IT experience working with a wide variety of technologies including SAP, HP, IBM and many others. He has worked in operations management, data center relocation, hardware planning, installation and de-installation, production control management, quality assurance and customer support.