The latest research from Vertiv reminds me of “The Old Man In the Cave” episode of the Twilight Zone. We experience a data center outage; we recover from it, and ask the old man in the cave for advice on how to make it never happen again. But then, we repeat the cycle.
Vertiv has been working with The Ponemon Institute regularly to look at data center outages. The recently released report, Data Center Downtime at the Core and the Edge: A Survey of Frequency, Duration and Attitudes, paints a discouraging picture – discouraging, at least, to those of us who understand that downtime is largely avoidable when data centers ascribe to best practices. But also one of opportunity. We just need to seize it.
Current Levels of Data Center Downtime Are Unacceptable
Unfortunately, the Ponemon Institute study makes it clear that even today, when the data center is more important than ever, a failure to follow best practices leads to unacceptable instances of downtime. The core data centers surveyed in the new research experienced, on average, 2.4 total facility shutdowns per year and an additional 10 downtime events isolated to specific racks or servers.
Not only are data center outages still happening too often (the fact that they are occurring at all is too often), but total outage duration has gotten longer. The average duration of a total outage in a core data center rose to 138 minutes — an increase of eight minutes over the previous study. At the edge, the frequency of total facility shutdowns was even higher, although the duration of those outages was less than half that of those in core data centers.
Simply put: Downtime events are still happening, and they are lasting longer than at any point in the last 10 years. With the advances in infrastructure technologies and architectures we’ve seen over that time, these results are unacceptable.
Interestingly, the Ponemon Institute study showed only 46% of core data centers reported utilizing best practices in design and redundancy to maximize availability, while 62% of edge locations did the same. More companies felt cost constraints increased the risk of downtime at the core (69%) than at the edge (62%). Edge data centers did lag core facilities in having ample resources to bring facilities up and running after an unplanned outage.
The Hardening of the Edge Data Center
These numbers, though they represent a challenging picture overall, do reflect what we are seeing among our customers: the hardening of the edge data centers.
Edge locations, by nature, are remote and often without IT expertise or personnel on site — at the bottom of a cell tower, in a cornfield, in a train station, or on the roof of a car park.
Downtime would mean someone traveling to the edge location, which is just one example of the resources that aren’t available to recover from a downtime event.
The scarcity of talent and the growing responsibilities of already time-strapped IT staff is another reason hardening at the edge is becoming a trend.
Customer satisfaction is another element. Edge data centers often enable end-user connectivity, which is nearly as expected an amenity as breathable air. That’s likely why we are seeing higher adoption of best practices at the edge, and why cost-cutting measures impose less risk than at the core.
Heeding the Advice of the Old Man in the Cave
Deploying redundant data center infrastructure was a strategy cited almost equally at the core (50%) and edge (47%). Though the survey recipients don’t suggest a lot of redundancy additions in response to equipment downtime, we at Vertiv are seeing an increasing use of higher redundancy architectures at the edge. In fact, redundancy is a best practice we suggest in our accompanying piece, Reversing the Trend of Rising Data Center Downtime.
Data center infrastructure monitoring and management systems are another impactful strategy we talk about in that companion piece, though this is a best practice that remains relatively low on respondents’ lists — 38% at the core and 36% at the edge referenced it. Yet, monitoring and management is even more important at the edge, since many sights are not manned.
Remote IT management systems, such as serial consoles and KVMs, can be heroes, especially at the edge. Approximately 80% of IT equipment failures are software or firmware related. In these cases, engineers using remote access tools can typically resolve the situation quickly and remotely to minimize the duration of downtime events.
But when you look at the Ponemon Institute survey, and actions that can be taken to prevent data center outages in the future, the number one tactic cited at both the core and the edge was investment in IT equipment. This is likely due to the high frequency of downtime events related to individual servers.
To simply repair the problem and move on is like putting a Band-Aid on it. Just because you use it doesn’t mean it’s going to heal the wound. Unless the root cause is studied and comprehensive solutions implemented — including procedures, training, and measures to harden infrastructure — we’re going to see this trend of downtime with increasing durations continue, and that’s even more detrimental in our digitalized world. We just have to heed the advice of the old man In the cave. Start by reading Data Center Downtime at the Core and the Edge: A Survey of Frequency, Duration and Attitudes for the full results and Reversing the Trend of Rising Data Center Downtime for strategies to prevent downtime in your facilities.