Information Center

How to get better data center optimization indicators

  

Due to the islands or gaps between IT operations, security operations and facilities, data center availability has long plagued IT operations. Enterprises must address these gaps to make more accurate and comprehensive decisions, especially in data center optimization.

 computer room

The data center optimization plan draft released in November 2018 proposed some new indicators that can be used to measure the optimization work of the US federal data center, including new indicators around the availability of the data center. If mandatory, the US government's implementation of the availability indicators of the Data Center Optimization Plan (DCOI) may bring new challenges. Although the availability of data center facilities can be measured by a metric, it has proved to be very inaccurate, and may actually stifle the ability of research institutions to predict and solve problems necessary to maintain the availability of data centers and any interdependencies that are critical to the mission of institutions.

This is why U.S. federal agencies can benefit from measuring sub indicators that represent the health, availability, and risk of data centers and their infrastructure. This business service approach (dynamic grouping of components by geographic location, application type, or technology stack) for data center optimization enables agents to predict and solve problems faster, thus better ensuring availability.

The business service structure is used to collect metrics about the health, availability, and risk of the underlying IT components of the business service, as well as the dynamic real-time mapping of the infrastructure and applications that support the service. It can provide a real-time operational view for IT managers to support the identification of fundamental problems that isolate the impact of the service. Devices can be abstracted and individual devices and IT services can be "bubbled" into composite metrics representing the overall state of business services. However, the representation of sub metrics can enable the execution or management view of business services to really provide a deeper understanding of the overall availability status of the data center.

Suppose an agent has four identical servers that can carry the entire workload, and one of the servers can run normally. These three redundant servers are essentially backups and can be used in case of failure of one of the other systems. In this example, if a server fails, the service is still 100% available. However, the health status of the system will drop to 75%; Therefore, the risk rises to 25%. These indicators are important because they remove barriers that prevent executives from monitoring business services. Previously, a data center administrator might receive an alert indicating that the server CPU utilization level has dropped below a certain threshold. With more detailed indicators, utilization alerts can automatically trigger the addition of another server or two servers to support more traffic, and can automatically adjust business service policies to recalculate new health, availability, and risk indicators without human intervention. Redundancy and self-healing functions should be incorporated into each layer of the data center.

In terms of data center optimization, the definitions of health, availability and risk cannot be generalized. The IT operations team can define them and create automation and event policies as needed. As more and more software definition services, artificial intelligence, machine learning and advanced analysis enter the data center, the IT operation team will have more ways to gain operational IT insight, understand the interdependence between infrastructure and applications, and automate manual tasks to improve efficiency. Topology mapping methods between business processes and the systems running them can facilitate automation, including repair, configuration management database enhancement, and advanced event expansion, thereby reducing management, maintenance, and troubleshooting.