High availability

Describe that the system has been specifically designed to reduce downtime while maintaining high availability of its services
Collection
zero Useful+1
zero
"High Availability" usually describes that a system has been specially designed to reduce downtime while maintaining high availability of its services.
Chinese name
High availability
Foreign name
High Availability
Measures
Mean time between failures
Abbreviation
HA
Purpose
Reduce downtime
Role
Maintain high availability of system services

introduce

Announce
edit
High availability of computers
For non repairable systems, the average life of the system refers to the average working (or storage) time or working times of the system before failure, which is also called the average time of the system before failure, and recorded as MTTF (Mean Time To Failure). For repairable systems, the life of the system refers to the working time between two adjacent failures (failures), rather than the scrapping time of the whole system. The mean life is the mean time between failures (MTBF), also known as the mean time between failures (MTBF) of the system. Repairable product Mean time to repair The time between failure and repair is recorded as MTTR (Mean Time To Repair). The shorter the MTTR, the better the recoverability.
Availability (also known as effectiveness) refers to the ability of a repairable product to have or maintain its function when used under specified conditions. Its quantitative parameter is availability, which represents the probability of the maintainable product having or maintaining its function at a certain time when it is used under specified conditions. Availability (also called availability) is usually recorded as A, which can be calculated by mean time between failures (MTBF) and mean time to repair (MTTR): A=MTBF/(MTBF+MTTR).
Load balancing server High availability of
In order to shield the failure of the load balancing server, a backup machine needs to be established. Both the primary server and the backup machine run the High Availability monitor program to monitor each other's health by transmitting information such as "I am alive". When the backup machine cannot receive such information within a certain time, it will take over the service IP of the primary server and continue to provide services; When Backup Manager When it receives the message "I am alive" from the primary manager, it releases the service IP address, and the primary manager starts again colony Management work. In order to ensure that the system can work normally when the primary server fails, we synchronize and backup the configuration information of the load cluster system between the primary and backup machines, and maintain the basic consistency of the two systems.
Fault tolerant backup operation process of HA
The Auto Detect phase consists of host The software on is passed redundancy Detection lines, via complex monitoring procedures. Logical judgment is used to detect each other's operation. The checked items include: host hardware (CPU and peripheral), host network Host operating system , database engine and other applications, host and disk array Connect. In order to ensure the correctness of the detection and prevent wrong judgment, the security detection time can be set, including the detection time interval, detection times to adjust the security factor, and the redundant communication connection of the host will record the collected messages for maintenance reference.
In the Auto Switch phase, if a host confirms that the other host is faulty, the normal host will not only continue with the original task, but also take over the preset backup operation procedures according to various fault tolerant backup modes, and carry out subsequent procedures and services.
Automatic recovery (Auto Recovery) After the normal host works instead of the faulty host, the faulty host can be repaired offline. After the failed host is repaired redundancy The communication line is connected to the original normal host and automatically switches back to the repaired host. The whole recovery process is automatically completed by EDI-HA, and the recovery action can be selected as semi-automatic or non recovery according to the pre configuration.

operation mode

Announce
edit
(1) Master slave mode (asymmetric mode)
Working principle: the host works, and the standby machine is in monitoring readiness; When the host goes down, the standby machine takes over all the work of the host. After the host returns to normal, the service is switched to the host automatically or manually according to the user's settings. Data consistency is shared storage system solve.
(2) Dual duplex mode (mutual backup and mutual assistance)
Working principle: two hosts run their own services at the same time and monitor each other. When either host goes down, the other host immediately takes over all its work to ensure real-time work. The key data of the application service system is stored in Shared Storage In the system.
(3) colony Working mode (multi server mutual backup mode)
Working principle: multiple hosts work together, running one or more services respectively, and defining one or more standby hosts for each service. When a host fails, the services running on it can be taken over by other hosts.

Measures

Announce
edit
Calculation formula of availability:
%availability=(Total Elapsed Time-Sum of Inoperative Times)/ Total Elapsed Time
Elapsed time is operating time+downtime.
Availability is related to the failure rate of system components. An indicator to measure the failure rate of system equipment is the "mean time between failures" MTBF (mean time between failures). Usually this indicator measures system components, such as disks.
MTBF=Total Operating Time / Total No. of Failures
Operating time is the time the system is in use (excluding downtime).

System design

Announce
edit
To design the usability of the system, the most important thing is to meet the needs of users. The failure of the system will affect the availability index only when the service failure caused by it is enough to affect the requirements of the system users. The sensitivity of users depends on the applications provided by the system. For example, some failures that can be repaired within 1 second [1] Online Transactions The system will not be perceived, but it is unacceptable for a real-time scientific computing application system.
The high availability design of the system depends on your application. For example, if a few hours of planned downtime is acceptable, perhaps storage system It is unnecessary to design the disk as hot pluggable. On the contrary, you may use a disk system that is hot swappable, hot swappable, and mirrored.
So when it comes to high availability systems, consider:
Determine the duration of the business interruption. According to the indicators for measuring HA calculated by the formula, the time that can be interrupted in a period of time can be obtained. But perhaps a large number of short interruptions are tolerable, while a small number of long interruptions are intolerable.
Statistics show that not all unplanned downtime is caused by hardware problems. Hardware problems only account for 40%, software problems 30%, human factors 20%, and environmental factors 10%. Your high availability system should take all of the above factors into account as much as possible.
In case of business interruption, it is a means to recover as soon as possible. [2]

Cluster system

Announce
edit
It is also a very effective way to create a high availability computer system on the UNIX system, which is the common practice in the industry Host system Form a group organically through the network or other means to jointly provide external services. Create a cluster system, and use the high availability software to redundancy High availability hardware components and Software components Combine, eliminate Single point of failure
Eliminate single point of failure of power supply
Eliminate single points of disk failure
Eliminate SPU (System Process Unit) single point of failure
Eliminate single points of network failure
Eliminate single points of software failure
Try to eliminate single point of failure during single system operation