As much as I’d love to write about Negan and that cliffhanger, this post is not about Rick, Daryl, Michonne and The Walking Dead. No, we are not about to suggest you roam the halls of your data center with a Colt Python, crossbow, or Samurai sword. However you may want to bring a similar zeal and your best survival skills to the task of identifying and eliminating zombie servers within your data center. We know from Dr. Koomey’s report that there may be a lot of those zombies hiding within your facility.
Let’s start by defining what we mean by zombie, comatose, orphaned, idle or underutilized servers. Each of these categories of devices has a similar negative impact on data center productivity and total energy consumption. They also occupy valuable rack space while consuming critical power and cooling resources. But with a smart power distribution system working in concert with an IT asset management DCIM monitoring and controlling the PDU at the individual socket level, we can find and eliminate those zombies. Our weapon? A rules- and policy-based DCIM application that reduces total energy consumption within the data center and frees up valuable space, conditioned power, and cooling in the process.
Zombie servers come in many forms. However we define them, they all consume energy and eat up valuable space, power and cooling. Some may warrant rehabilitation, but the question you have to ask is this: If they idled this long without impact, isn’t it time to pull the plug?
The Big List of Energy Sucking Zombie Servers
|Category||DCIM Status||Network Status||Ownership Status||>Load/Use Case|
|Zombie||In DCIM*||Not in DNS||Owner Unknown||No Load|
|Orphan||In DCIM*||In DNS||No Owner||No Load|
|Abandoned||In DCIM*||In DNS||Owner||No Load|
|Underutilized||In DCIM*||In DNS||Owner||Low Use|
|Ghost||Not in DCIM||Not in DNS||Owner Unknown||Unknown|
*DCIM giving you the benefit of doubt
As you can see, we’re looking at a large list of zombies and similar data center monsters. Incredibly, the average enterprise data center, computer room, and network closet may well have 30 percent of installed servers falling into these categories.
Developing a rules- and policy-based automatic zombie killer is a fairly simple task, certainly on a go-forward basis and well worth the investment of time and resources as an update to the existing suite or as part of a comprehensive DCIM rollout. It all starts with understanding a few key points of your server’s power consumption profile: sleep, idle, power saving mode (OEM application), partial load, normal mode, and peak, just in case. We then validate this information once deployed within the data center via the intelligent PDU socket-level power meter. Establish a degree of certainty (meter accuracy plus a small safety margin) and begin real-time data collection via the PDU and DCIM interface.
Before we get too crazy, allow the servers to normalize and settle into “production.” Once they’re up and running normally, we collect a little more data then begin to develop our rules and policies. One important rule will be the definition of “idle” for this specific class of server and application. Your data may show the server’s power draw to be 2.5 amps @ 208V AC (assuming UPS with tight voltage regulation). Your testing and real-world data indicate “low-load” as 2.9 amps and “normal” mode as above 3.3 amps. You would then set the DCIM monitor function to automatically start a time clock when it sees this device drop to 2.5 amps and reset whenever the server exceeds 2.8 amps, allowing for accuracy, drift, uncertainty, etc. (For more on the impacts of idle servers on your data center’s efficiency read our recent post calling for a new idle performance standard.)
We now have a time stamp and running clock for an idle server. Koomey’s report suggests a zombie or comatose server has no network demands or executable actions for well over six months, but there’s no reason to wait that long to take action. We would recommend setting additional rules within our DCIM to flag and report idle server status with 30-day notice and 60-day notice being sent to local IT administration and the identified owner. If no owner is identified, the search begins in earnest.
Upon reaching 90 days of continuous idle performance, the server is reported to appropriate IT administrators with the understanding that upon reaching 120-day inactive/idle status, the facilities manager and/or IT manager may flip the switch to turn the server off. Upon reaching 150 days of idle, the DCIM system will send a notice advising of impending shutdown. At 180 days, the DCIM tool will shut down any asset that remains in idle mode.
Your use rules and shutdown policies will vary. I suggest 90 days is too long with perhaps the exception of an initial install, in which case it may be time to institute tighter approval policies for acquiring and deploying new IT hardware and software. Consider a 30/60/90 day program with the DCIM having total autonomy to shut off any server that remains continuously idle for 90 days.
We have the basic resources to automate zombie server identification, reporting, and shutdown. All that remains is to remove the server, perform a thorough data wiping, and return to vendor (or authorized third-party) for responsible disposal. In the meantime, we have cleared up much-needed data center space along with mission critical power and cooling to make room for new, more powerful servers capable of running the company’s vital workloads. Now you can effectively automate a key component of Energy Logic 2.0.
One parting thought: Cybersecurity should stay top-of-mind. Please consider the implications associated with network-connected infrastructure as bad actors lurk on the internet. You may want to consider an out-of-band network for your mission critical infrastructure. Talk with your DCIM security expert about moving beyond firewalls.