The most serious service failure in the history of Didi, the culprit is the underlying software or "reduce costs and increase laughter"?

Source: contribution
2023-11-29 11:14:00

On the evening of November 27, 2023, Didi's App service is abnormal due to system failure , location is not displayed and taxi is unavailable. On the evening of November 27, Didi Chuxing replied: I'm very sorry because of the system failure.

In the morning of November 28, 2023, Didi Travel reported that online car hailing and other services had been restored, and cycling and other services were being repaired in succession. On November 28, when Didi issued the announcement, the reporter used Didi Call to hail cars in Shanghai, Shenzhen and other places, and found that the online car hailing function had not been restored, the network loading was abnormal, and the taxi still could not be taken. On November 28, Didi replied to reporters that the online car hailing service has been restored, and the rights and interests of drivers and passengers have been restored.

On November 29, Didi sent an apology again, saying that it was initially determined that the cause of the accident was the failure of the underlying system software

Source: https://weibo.com/2838754010/NuMAAaUEl

Before Didi officially released this announcement, a senior IT technician had analyzed: "From the performance point of view, taxi taking and bike sharing are all linked, and there should be isolation between different business segments, indicating that the problem lies in the underlying infrastructure. Attackers can only access the application layer, but not the infrastructure. Either the attacker pierced it or the system operation failed carelessly. Even the former is a kind of system defect, which will be punctured. "

360 security experts believe that there may be six technical reasons behind Didi flash crash:

First, there are programming errors, logic errors or unhandled exceptions in the process of system updates and upgrades: generally, Internet manufacturers release updates at night, which corresponds to the time of Didi's failure. Of course, business upgrades and maintenance are large-scale updates, but now Didi's entire platform and business have failed, indicating that it must be his "home" problem.

Second, server failure: for example, the core computer room of Didi may have a problem with the constant temperature and humidity environment, resulting in overheating of the server The CPU is burned, or a natural disaster such as earthquake, flood, tsunami occurs in the location of the core computer room. In this case, the hardware needs to be replaced, and the service software inside needs to be reconfigured. The recovery cycle is relatively long, but this possibility is relatively small.

Third, third-party service failure: Didi's background architecture may use third-party services or components. If a third party has a problem, it may also affect the normal operation of Didi. However, for security reasons, Didi may not trust its core business to a third party, but this possibility is also small.

Fourth, DDOS attacks: hackers use distributed denial of service to preempt a large number of server resources, resulting in users being unable to access, but this is unlikely, because DDOS will not lead to data errors, and Didi has enough cost and capacity to fight against them in terms of volume.

Fifth, other network attacks: some black ash production groups may steal data by dragging databases, and then sell them on the dark Internet. In this process, it is not ruled out that there will be misoperation, which will damage the database.

Sixth, blackmail virus: network attack hackers have encrypted Didi's underlying data and business code. According to the disclosure, the user's bill and taxi data are miscalculated. It may be that Didi actively suspended its business in order to avoid greater losses. Recently, extortion attacks have occurred repeatedly. At the beginning of this month, a financial institution suffered from extortion virus attacks, resulting in business suspension.

However, some network security company experts believe that if it is an external hacker attack, the company will generally make a statement at the first time. He speculated that Didi had major internal business adjustments, or new businesses were connected to the original system, but no plan was made, leading to major failures in related businesses or systems, which is the most common reason for system failures in large companies.

Therefore, for Didi's large-scale long-term failure, some industry insiders believe that, Cost reduction and efficiency increase may also be one of the reasons

This person believes that, The core business of Internet companies has frequent and long downtime, which is one of the accessories for cost reduction and efficiency increase. Less system investment, less maintenance resources, frequent programmer replacement, more bugs

He said, for example, that generally there is redundancy in the upstream phase of the business. In order to meet orders that break out at any time, the upper limit of the load should not be too large in the upstream phase, such as 70% in normal times. In this case, there is no need to worry about problems when encountering a small outbreak, which is enough to deal with small peaks; However, the logic of the downlink period is different. When the load is very high, it is OK to resist the first resistance. Although it may be uncomfortable to encounter small peaks later, the overall load will decline over time.


Finally, let's take a look at the news on the Internet. Some peers said that the serious failure of Didi was caused by upgrading the k8s version. At that time, SRE engineers located the problem for three hours, but failed to locate it.


From Didi's public technology sharing, Didi Elastic Cloud upgraded the k8s version last month: from k8s 1.12 to 1.20.



Source: Scheduling Practice of Didi Elastic Cloud Based on K8S

Expand to read the full text
Click to join the discussion 🔥 (42) Post and join the discussion 🔥
This wonderful review
forty-two comment
one Collection
 Back to top
Top