Helping Our Data Center Managers Sleep at Night

Data Center managers have lots of reasons not to sleep at night – they are responsible for very critical I.T. operations and must keep them running through all types of catastrophic events, mechanical and electrical system failures, fires, storms, and other potential external threats.  The data center continual operation with operation should nearly read like the postman’s creed- “Neither snow nor rain nor heat nor gloom of night”.

Attempting to step into the gap for the data center manager, to keep the facility up and operational, is the field of data center monitoring.   Data center monitoring can be defined as the process of gathering real-time data on systems associated with the data center, for purposes of ascertaining the operational state or status of the data center.    This is to be differentiated from “DCIM”, which is associated with a best practices process for change control built on the philosophy of “Model, then build”.    There seems to be some confusion in this regards since many DCIM tools incorporate real time monitoring features and many monitoring systems refer to themselves as “DCIM” systems.

The ultimate goal with data center monitoring is not only peace of mind, but the confidence to know the operational status of the data center to maintain operational services for the corporation, institution or organization.   Data center managers also want to have some advance warning when things are starting to go haywire.   It sounds fairly simple and I would venture to say that EVERY data center makes use of some form of remote monitoring.   Data center monitoring systems also can assist with preventative maintenance, energy efficiency and, if properly understood and implemented optimization of the data center operation.

That said, one of the biggest problems with data center monitoring systems is contention between two apparently contradictory streams of thought:   First, there isn’t nearly enough information being monitored.   It is a fallacy to believe that if you have pulled a cable over to every major piece of equipment in a mission critical infrastructure, then you have all the information you need to know to be sure things are “good”.    That simply isn’t true.   Equipment manufacturer’s do their best to provide the operational status of their devices over the network interface card, but there are many failures that are not positively sensed.   This is the problem of a false sense of security – if I have no alarms on my monitoring system, then I am OK.   It would be great if that was the case!

The second stream of thought is that there is way too much information being monitored.  If a monitoring system is connected to all the major site infrastructure components (e.g. generators, UPS, service entrance switchgear, chillers, air conditioners, etc), the system may have access to a very detailed diagnostic information that is primarily intended for service technicians to use in troubleshooting failures and is of limited use for a data center manager to know if he has a real issue to deal with.   Attempting to monitor and alarm off of equipment diagnostic information can results in alarm overload and eventual desensitization toward alarms over time.  Essentially it’s the monitoring system that cried “wolf.”

Too much information, and not enough.

 

Meet the Author:

NatNate-Clyde-175x21-finale Clyde, PE-

Nate has worked since 1995 in the design and specification of high-reliability HVAC systems for support of critical data processing applications. He joined Parallel Technologies in January of 2012 as a Senior Mechanical Engineer after serving as Vice President of an IT Services company for over 16 years. He moved into his Director role here at Parallel in January, 2013. Nate has enthusiastic passion for what he does and has a talent for communicating its complexities to clients.