Reliability: How to Stop Your Business From Coming to a Standstill

by Richard Raistrick July/August 2011
Technology, for everything it brings us, can let us down. The more we rely on it, the more we are vulnerable to its occasional failings, and this is true for our lives both business and personal. But while components can fail, there is plenty that can be done to minimize business impact. CHP Consulting’s Richard Raistrick advises how businesses can set a gold standard for their commercial websites.

As a business grows, so do its systems, supporting infrastructure and processes, but most importantly its dependency on these. Some organizations make use of the latest technologies, thinking to optimize uptime of key systems and mitigating the impact of any failure. They analyze the key risks, then implement the systems and processes that provide a suitable balance between cost and system availability. Others run aging, complex and fragile systems, and concerns over reliability and resilience can be key factors in their decision to renew technology platforms.

Today’s business is 24/7, it is multilingual, it spans continents and crosses time zones. Growth — be it through success or acquisition — brings increased volumes, longer working days and new and expanded systems to support these demands. But with the advantages growth brings, the potential risks of new points of failure and the resulting business costs are often underestimated.

How much business disruption is acceptable? How expensive is it, and how can we measure this? What benefits does technology bring us? Forward-thinking businesses need to use the latest technologies and thinking to ensure their suffering is minimized.

Sometimes, Things Fail
There aren’t many among us who can say they’ve not been stopped in their tracks by an Outlook server going down, a file server crash, a laptop hard drive failure, an office-wide power outage or even a simple nosedive in network performance. Seeing your server room flooded is not as common, but these things happen too, just as malware and viruses are a regular threat, and if you’re in the public eye, denial of service attacks.

A well-designed system recognizes that at some point in time components will fail. Think about the business impact of how your systems cope with the unexpected. How well does your system cope if bad data is entered, a remote server’s name has changed or a core database is not responding — does it simply go down and stay there until someone discovers the problem, or does it have built-in resilience and recover gracefully? Is data left in a workable state, with key individuals notified automatically that the problem needs to be fixed? Are end-users given meaningful information about what’s happening, or an impenetrable string of technical jargon?

For some companies, even a poorly thought out configuration change or missing currency exchange rate can mean a day lost. So in addition to robust and resilient systems, thorough change management processes to ensure that human factors do not impact reliability are required.

How Bad Can it Get?
In 1999, eBay’s main website went down for 22 hours, causing a stock price drop from $182 to $135 and reduction in its market capitalization by $5.7 billion. Business dependency today is such that outages like this are unthinkable, but even in 2008 a failure at Amazon took its site down for two hours. The losses incurred were estimated at $31,000 for every minute.

Clearly, failure like this can be costly, and the damage is never temporary when the marketplace is so competitive; if customers can’t use your site then they’ll use the competition’s and quite possibly stay there.

For internal business outages, not only can work be lost, but when information isn’t available, your team loses the opportunity to do any work at all. There’s a knock-on effect on your partners, customers and suppliers, and whoever is relying on them is affected too.

How Modern Systems Mitigate Against Downtime
Businesses that are smart about reliability look to use technology to minimize the risk of downtime. They ensure there are plans to guarantee reliability wherever there are potential points of failure. They achieve reliability and resilience through techniques such as clustering: running at least two servers for each requirement (firewalls, DNS, database, application and web servers), so that if one fails — and it will one day fail — another is there ready to carry on the work.

At times, however, one or more components will fail and require a more substantial restoration. Where once a restore time of six to eight hours would be viewed by the business as reasonable, today half an hour can be too long. The disaster recovery process (including both taking backups and performing restoration) is a complex process, and many companies running legacy applications still depend on restoration processes put in place years ago.

Retrieving and restoring backup tapes stored safely off-site can take hours or even days, and setting up replacements for failed hardware can be just as tiresome. Businesses that understand the importance of their key systems have disaster recovery designed to enable them to switch to the backup hardware and system instantaneously. They run virtualized server farms that let them create new server environments in moments. And if they run “hot” backup, a replicated database running in Houston can become the primary information source for the whole organization if the original Chicago one fails.

Another option that many companies are taking more seriously is moving systems to the cloud. There are potentially significant associated cost savings, but also considerable improvements in reliability.

Striking a Balance
The key to understanding how reliable your business should be is knowing how reliable you want it to be. Most companies are able to accept some downtime because to guarantee 100% uptime is prohibitively expensive for them. But how quickly can your key systems switch to backup infrastructure in the event of a failure, and are your processes tested and robust? Find out today what the main risks are, and the recovery lead times on your most critical systems. This will help you make a judgment on whether the current situation is acceptable.

It’s important not to address systems in isolation. Map out the dependencies in place across your systems landscape — software, hardware, systems, people — then make sure your recovery processes are tested fully and regularly; it’s too easy to assume that everything will happen as promised. Netflix has a system in place that brings down a random part of production infrastructure deliberately to see what the impact is, ensure the whole system is able to return to normal service, and make sure everyone is on the ball. If someone walked into your server room and unplugged something at random, how confident are you that the business will cope? How soon will it get back to normal? Will it recover at all?

Set targets, monitor failure rates and business downtime, then schedule regular reviews to ensure you are meeting your objectives. The gold standard for the top commercial websites is 99.999% uptime (or 5.26 minutes of downtime per year), and the few that achieve this do so only through very careful design and hard work. Everyone should aspire to match the best.


Richard Raistrick is global marketing director at CHP Consulting. Along with his marketing brief, Raistrick is responsible for project delivery for some of CHP’s largest clients. He has carried out consultancy and project management engagements around the globe and has worked in the asset finance sector for 15 years. Raistrick holds a Master’s degree in Electrical and Information Sciences from Cambridge University. He worked as an IT consultant at Capgemini before joining CHP. For more information on CHP Consulting, visit www.chpconsulting.com.

Leave a comment

No tags available