Station B collapsed twice, Tencent's "3.29" level one accident... inventory of 2023's top ten downtime accidents "ghost scenes"

Source: OSCHINA
Edit: game
2023-12-31 12:02:00

Name scene? Hell scene!

Come and watch the "ghost scene" of the top ten downtime accidents in 2023——


Bilibili (Station B) collapsed twice

At around 20:20 on the evening of March 5, 2023, many netizens said that when using Station B, both mobile phones and computers could not access the video details page, and mobile phones could not view the favorites and history. Some netizens said that the home page could be loaded normally, but all characters were in traditional Chinese.

Evening of August 4 Five months after the last accident, many netizens reported that the pictures (video covers) of station B could not be loaded, the videos could not be opened, and the videos were always buffered.

 

Tencent "3.29" Level I accident

In the early morning of March 29, 2023, Tencent's WeChat and QQ businesses had collapsed, and many functions including WeChat voice dialogue, circle of friends, WeChat payment, as well as QQ file transmission, QQ space and QQ mailbox could not be used.

It was not until the morning of the 29th that Tencent WeChat team responded that the system was being gradually restored after the engineers repaired it.

The accident was caused by the failure of the cooling system in the computer room of Guangzhou Telecom. Tencent defined it as a company level one accident and punished a large number of relevant leaders.

On April 12, the Information and Communication Administration of the Ministry of Industry and Information Technology Listen to Tencent's report on "3 ・ 29" WeChat business anomalies Tencent is required to further improve its work safety management system, implement network operation guarantee measures, resolutely avoid major work safety accidents, and effectively improve the safe and stable operation of public business.

 

Vipshop 329 accident punishment result: the head of the basic platform department was removed

On March 29 this year, "Vipshop Collapsed" made a hot search. Due to the long time of collapse, many consumers were unable to place orders normally. Vipshop officials responded that due to short-term system failure, the main station may have abnormal functions such as "additional purchase".

On June 5, Vipshop released the "Announcement on the Troubleshooting of 329 Machine Room Downtime". According to the announcement, on March 29 (00:14-12:01), the cooling system of Nansha IDC failed, which led to the rapid rise of equipment temperature in the computer room and the shutdown of the online mall. The accident lasted for 12 hours, resulting in a performance loss of more than 100 million yuan for Vipshop and an impact of 8 million customers. Vipshop judged this failure as a P0 fault. It is understood that P0 is the highest level accident, such as crash, page unreachability, main process failure, main function failure, or great impact on the impact (even if the bug itself is not serious).

The announcement pointed out that Vipshop decided to seriously deal with this incident, the direct manager of the corresponding department should bear the responsibility for this accident, and the head of the basic platform department should be removed for corresponding treatment.

 

Microsoft Azure failed and 17 production databases were deleted

On May 24, Microsoft Azure DevOps failed in a scale unit in southern Brazil, resulting in downtime of about 10.5 hours. Eric Mattingly, Microsoft's chief software engineering manager later apologized for the failure, and Disclosed Reason for the interruption: that is, a simple spelling error caused 17 production level databases to be deleted.

 up-d28b235003ee1390973397efd32e59d2ee1.png

 

China Telecom Has Large Scale Non service Problems

In the afternoon of June 8, 2023, the network and communication services of China Telecom appeared no signal and other failures, and most of the feedback users were in Guangdong, which was suspected of failure in Guangdong Province.

Since then, the customer service of China Telecom has responded that the telecom base station across the province (Guangdong Telecom) is out of order and cannot make calls for the time being. Please wait patiently. Now it is urgent to deal with it. Sorry for the inconvenience.

After about 4 hours, the telecommunications network in Guangdong Province was fully restored.

 

Yuque 10.23 major service failure, lasting for 7 hours

On October 23, 2023, Nihonaku experienced a major service failure, which lasted more than 7 hours before it fully recovered. The Yuque team subsequently announced the cause and handling process of the fault:

In the afternoon of October 23, when the data storage operation and maintenance team of Service Yuque was upgrading, the production environment storage server in East China was mistakenly offline due to a bug in the new operation and maintenance upgrade tool. Under its influence, Yuque's data service suffered serious failure, resulting in a large area of service interruption.

 

Alibaba Cloud 11.12 major service failure affects all products

On the afternoon of November 12, 2023, Alibaba Cloud experienced a serious failure, affecting all products.

Later, the official confirmed that the cause of the failure was related to an underlying service component. After about 5 hours, Alibaba Cloud announced that all the affected cloud products had been recovered. The data (such as monitoring, billing, etc.) of some cloud products affected by the failure may be delayed in pushing, which will not affect business operations.

 

Didi 11.27 system service failure, the technical team repaired overnight

On the evening of November 27, 2023, Didi's App service was abnormal due to system failure, and it did not display location and could not take a taxi. On the evening of November 27, Didi Chuxing replied: I'm very sorry, because of the system failure, the Didi App service was abnormal this evening. After being urgently repaired by the technical students, it is now being restored one after another.

In the morning of November 28, 2023, Didi Travel reported that online car hailing and other services had been restored, and cycling and other services were being repaired in succession. On November 28, when Didi issued the announcement, the reporter used Didi Call to hail cars in Shanghai, Shenzhen and other places, and found that the online car hailing function had not been restored, the network loading was abnormal, and the taxi still could not be taken. On November 28, Didi replied to reporters that the online car hailing service has been restored, and the rights and interests of drivers and passengers have been restored.

On November 29, Didi sent an apology again, saying that it was initially determined that the cause of the accident was the failure of the underlying system software

 

Twitter is severely down, and Musk is furious

In February 2023, Musk urgently called about 80 people to solve the algorithm problem late at night because his tweets about the Super Bowl were not as exposed as US President Biden.

In March, Twitter suffered serious downtime due to an engineer modifying the configuration, and Musk threatened to refactor all the code.

In July, users reported that the platform had a problem again and could not publish new tweets, and received an error prompt of "exceeding the limit". Musk responded that Twitter is trying to deal with "extreme data capture" and "system manipulation", and these new restrictions are important measures to curb these urgent problems.

 

ChatGPT service was interrupted for nearly 2 hours, CEO Altman apologized: the traffic was far more than expected

At about 22:00 on November 8, Beijing time, the ChatGPT and related APIs of OpenAI were interrupted, resulting in the failure of user and developer oriented services for nearly two hours.

Then OpenAl updated the accident report , has identified a problem that causes high error rates in API and ChatGPT, and is working hard to fix it.

Meanwhile, Sam Altman, CEO of OpenAI Apologize publicly , the new features released this week are far more used than expected. The company originally planned to enable GPTs services for all subscribers on Monday, but it is still unable to achieve this. Due to the load, the service may be unstable in the short term. Apologize to the user for this situation.

 

Extended reading: The Office of Cyberspace Affairs issued the Management Measures for Network Security Incident Report (Draft for Comments)


Other annual inventory:

For more annual event reviews, see 2023 China Open Source Developer Report

Expand to read the full text
Click to join the discussion 🔥 (23) Post and join the discussion 🔥
This wonderful review
twenty-three comment
twelve Collection
 Back to top
Top