Information Center

Tips for monitoring OpenStack

  

If you have worked on the cloud platform before, you must be familiar with the distributed and decoupled nature of these systems. Decoupled distributed systems rely on microservices to perform specific tasks, and each microservice will expose its own REST (representing state transition) API. These microservices usually communicate with each other through the lightweight message layer in the form of message middleware such as RabbitMQ or QPID.

This is exactly how OpenStack works. Each major OpenStack component (Keystone Glance, Cinder, Neutron, Nova, etc.) expose REST endpoints, and components and subcomponents communicate through message middleware (such as RabbitMQ). The advantage of this method is that it allows faults to be allocated to specific components. Secondly, cloud infrastructure operators can expand all services in a horizontal manner and intelligently distribute loads.

However, although this distributed decoupling system is very beneficial, it also brings an inherent challenge - how to correctly monitor OpenStack services, and more specifically, how to identify possible single points of failure.

The following content focuses on the real challenges faced by the specific situation of OpenStack service monitoring, as well as the possible solutions to each problem.

Challenge 1: The system is not a whole

OpenStack's non integrity and decoupling are usually emphasized as its main advantages. This is of course an important advantage. However, this obviously complicates any attempt to monitor the overall service status. In a distributed system where each component performs a specific task, each component is further distributed to multiple subcomponents. Therefore, it is not difficult to understand how difficult it is to determine the impact on services when a specific part of software fails.

The first step to overcome this difficulty is to solve the cloud. You need to determine the relationship between all major components, and then determine the relationship between each individual specific service. Their failure may affect the overall service. Simply put, you need to know the relationship between all components in the cloud.

With this in mind, you need not only to monitor the status of each individual component (running or failing), but also to determine how other services are affected by the failure.

For example, if Keystone crashes, no one can obtain the service directory or log in to any service, but this usually does not affect the virtual machine or other established cloud services (object storage, block storage, load balancer, etc.), unless the service is restarted and Keystone is still down. However, if Apache fails, Keystone and other similar API services working through Apache may be affected.

Therefore, the monitoring platform or solution must not only be able to assess the status of each service, but also be able to correlate service failures to check the real impact on the entire system and send alerts or notifications accordingly.

Challenge 2: OpenStack is not just OpenStack

OpenStack based cloud is not only a distributed and decoupled system, but also an orchestration solution that can create resources in the operating system and other cloud infrastructure or related devices. These resources include virtual machines (Xen KVM or other hypervisor software components), persistent volumes (NFS storage servers Ceph clusters, SAN based LVM volumes or other storage back ends), network entities (ports, bridges, networks, routers, load balancers, firewalls, VPN, etc.) and temporary disks (Qcow2 files residing in the operating system directory), as well as many other small systems.

Therefore, the monitoring solution must take these basic components into account. Although these resources may be less complex and less prone to failure, when they stop running, the logs in the main OpenStack service may cover the real reason. They only display the results in the affected OpenStack service, not the actual root cause of the device or failed operating system software.

For example, if libvirt fails, the component Nova cannot deploy the virtual instance. Nova compute will be started and run as a service, but the instance will fail in the deployment phase (instance status: error). To detect this, you need to monitor libvirt (service status, indicators and logs) in addition to the nova compute logs.

Therefore, it is necessary to check the relationship between the underlying software and the main components, monitor the final link, and consider the consistency test of all final services. You need to monitor everything: storage, network The hypervisor layer, each individual component, and the relationships between them.

Challenge 3: Jump out of the inherent thinking mode

Cacti, Nagios, and Zabbix are good examples of OpenSource monitoring solutions. These solutions define a set of very specific metrics to identify possible problems on the operating system, but they do not provide specific indicators needed to determine more complex failure conditions or even service status.

This is where you need to be creative. You can implement special indicators and tests to define whether the service is normal, degraded, or completely failed.

In a distributed system such as OpenStack, each core service exposes a REST API and connects to TCP based messaging services. It is vulnerable to network bottlenecks, connection pool depletion, and other related problems. Many related services connect to SQL based databases, which may exhaust their maximum connection pool. This means that it is necessary to implement correct connection status monitoring indicators (establishment, dispersion waiting, shutdown, etc.) in the monitoring solution to detect possible connection related problems that affect the API. In addition, a cli test can be built to check the status of the endpoint and measure its response time, which can be converted into an indicator that actually shows the true status of the service.

Each of the above monitoring solutions and most other commercial or OpenSource solutions can be expanded by designing special indicators.

The command "time OpenStack catalog list" can measure the response time of Keystone API, evaluate the results, and generate a manual fault state when the results do not meet the expectations. In addition, you can use simple operating system tools, such as "netstat" or "ss", to monitor different connection states of API endpoints and understand possible problems in the service. This can also be done for key parts of OpenStack cloud dependencies, such as message brokers and database services. Please note that the failure of message middleware will basically "kill" OpenStack cloud.

The key is not to be lazy! Instead of using only the default indicators, you should use indicators related to your own services.

Challenge 4: Human Factors

Human factors are everything. As the saying goes, the craftsman who complains about tools is not a good craftsman.

Without a tested scenario response program, a single failure is not only a problem in itself, but also will cause more problems. In your monitoring solution, any accident of the cloud infrastructure and its related alerts should be clearly recorded, with clear steps to explain how to detect, contain and solve problems.

Human factors must be considered, even if you have a smart system (a system with a certain degree of artificial intelligence) that can correlate events and suggest appropriate solutions to detect accidents. It is important to remember that if the system is incorrect or incomplete, the output will also be inaccurate or incomplete.

To sum up, OpenStack monitoring is not necessarily difficult, but the most important thing is to be thorough. Each individual service and interaction with other services needs to be carefully monitored. Special indicators can even be achieved by themselves. With some TLCs, you can easily and successfully monitor your OpenStack.