Information Center

Network failure is not the only reason why Google Cloud platform interrupts services

  

A few days ago, Google's main public cloud was in deep trouble because of bugs, and the Google cloud platform was interrupted for 18 minutes. Some users responded strongly to this, and the security and reliability of the Google cloud platform were questioned.

At 7:00 p.m. on April 11, users in all regions of the world were disconnected from Google's computing engine for 18 minutes. It is reported that the reason for the interruption of Google's cloud platform is a network failure, which has damaged the image of Google's uninterrupted network connection and made some enterprise customers lose confidence in it.

The network seems to be Google's Achilles heel, and the network layer is a common problem that causes most cloud outages. Lydia, vice president and famous analyst of Gartner, a research and analysis organization? Lelong said. This time, the difference is that it affects not only one availability region, but all regions.

"The most important thing is that customers expect to have multiple availability zones to implement reasonable protection against service interruption, but unexpectedly, all regional services are interrupted," Lelong said.

Similar things have happened in the industry. Although Amazon's services have suffered regional outages, it has avoided the disruption of its entire platform. Microsoft Azure has had several global outages, including a major outage at the end of 2014, but this scenario was not repeated in 2015.

Jason, founder of CloudHarmony (acquired by Gartner)? Reid said that in its memory, it is rare for major public cloud providers to have service interruption accidents in all regions, which should be the first time. Reed's company has been monitoring the uptime of various cloud platforms since 2010.

Google said it has also taken some security measures. But perhaps they should have implemented more tests to ensure that this type of failure can be prevented, Reid said.

Reid said that this sounds like in theory they have taken measures to prevent this from happening, but these measures have failed.

Google declined to comment

Lelong said that before Google and Microsoft moved their businesses to their public cloud on a large scale, they had built some different data centers according to their own needs. "Users need different degrees of redundancy and pay different attention to details, which takes time to complete," Lelong said.

Google's cloud platform has a relatively small market share and a small number of applications. Therefore, Google's cloud outage may not be a major problem for some companies. Lelong said that some Google customers may have ignored this event, unless they transfer data within those 18 minutes, because many companies' businesses are batch computing, which does not require a lot of interaction traffic and more space.

According to Google executive Benjamin? According to the news released by Tranolos on the cloud computing status page, Google has taken measures to prevent recurrence, and reviewed the existing system and added new safeguards. All affected customers will receive 10% of Google's computing engine and 25% of the monthly VPN service fees. Google's service level agreement requires that the monthly uptime of the computing engine should be at least 99.95%.

Network failure causes service interruption of Google Cloud Platform

This event was initially caused by the network interruption, which caused the inbound computing engine traffic not to be routed correctly. The service also affected the VPN and the three-layer network load balancer. The management software attempts to restore to the previous configuration. Failsafe triggers an unknown bug. When an IP module is deleted from its configuration file, other configuration files used for network configuration management do not complete the corresponding transfer, so the module fails to transfer.

When the transmission fails, Google usually chooses to restore the faulty part to the previous location, and then add a new module to re transmit. But this time, an unprecedented software bug was triggered. After this transmission failure, instead of restoring the fault part to its original location, all IP modules on Google Cloud platform were reconfigured. This configuration uses incomplete IP modules for updating.

In the end, more than 95% of inbound traffic was lost, and Google engineers recovered to the latest configuration changes, which were finally corrected 18 minutes after Google Cloud was out of service.

This interruption did not affect the normal operation of Google's application engine, Google's cloud storage or internal connection computing engine services and virtual machines, outbound Internet traffic, and HTTP and HTTPS load balancing.

Google Cloud customer searchcloudcomputing said that this interruption might affect their business. Some high-profile users who rely heavily on resources declined to comment or did not respond. In addition, some smaller users said that their business uses Google Cloud, but the interruption has little impact on them.

Vendsta Technology is a media vendor selling and marketing software. They didn't even notice the outage of Google's cloud platform. Dell, Chief Architect of Vendsta? Hopkins said that the company has adopted a built-in retry mechanism. Most systems use Saskatoon and Sas based traffic. Most of the traffic on the front end of vendesta passes through App engine services.

For five years, Vendasta has been using Google's cloud computing products, only once interrupted, making it have to call customers. The high uptime means that the enterprise does not have to worry about the interruption for a long time, so it does not care much about this event.

"If the business is interrupted, it will be very bad, which is difficult to explain to customers, but it happens so rarely that we do not think that preventing interruption is one of our first tasks," Hopkins said.

For low risk tolerance, it is easy to understand the silence of the enterprise's response to the interruption of the cloud platform, because most operation teams cannot achieve the uptime promised by Google in their own data centers, Hopkins said. For enterprises that are less risk tolerant, silence in trusting the cloud will be easier to understand, but most operation teams cannot achieve the level of uptime Google promises to own data centers, Hopkins said.

Vendsta uses specific services provided by multiple clouds because they are cheaper or better, but it does not consider the redundancy of using another cloud platform, because it does so due to cost and skill requirements, and the consequent limitations of not being able to take advantage of some specific platform optimization.

All public cloud platforms failed. It seems that Google has learned the lesson of network configuration change detection. Dave? Brattiti said. However, this time is not very lucky, because last month, Google Cloud welcomed a new enterprise centered management team.

"Google Cloud has just begun to win the trust of enterprise customers. Although these large companies will certainly like to run their businesses on the low-cost Google Cloud platform, in the long run, its reliability will be more important," Bratletti said.