Analyzing This Summer’s Google Cloud Outage

Moving critical applications and services to the cloud brings power and agility to IT teams who no longer have to worry about building and maintaining infrastructure. But it also brings risks because cloud computing introduces a solid dose of unpredictability due to the sheer complexity of the Internet and cloud connectivity. We are reminded of this reality on a reasonably regular basis when outages occur in the cloud or other service providers. On June 2nd between 12pm and 12:15pm PDT, ThousandEyes detected a network outage in Google Cloud Platform (GCP) that lasted for more than four hours and affected global access to various services including YouTube, G Suite and Google Cloud Engine in multiple hosting data center regions. The following is a recap of what we saw.

Early Signs of Trouble

We started to see both user-level impacts and macro network impacts from the Google Cloud outage starting shortly after 12:00pm PDT on June 2nd, right around the time that initial social media and posts on downdetector began. Figure 1 shows the disruption as seen from 249 of our global monitoring agents in 170 cities that were monitoring access to content hosted in GCP, but served dynamically via a CDN edge network. Users accessing other GCP-hosted services would have started to feel the impact of the outage at this time.

Figure 1: The outage started shortly after 12pm on June 2nd, impacting global user sites

Figure 2 shows the extent of packet loss — 100% — for global users attempting to access a GCP-hosted service.

Figure 2:  Packet loss was total between our global monitoring agents and content hosted in Google Compute Engine

At a macro level, we also were also able to see connectivity and packet loss issues that impacted the Google network locations in the eastern US, including Ashburn, Atlanta and Chicago, as seen in Figure 3 below. We also saw network outages impacting the Google network in India.

Figure 3: Packet loss within Google’s US network impacted eastern US Google locations

From a connectivity point of view, our Path Visualization showed traffic dropping at the edge of Google’s network, as seen in Figure 4 below. In fact, for 3.5 hours of the 4+ hour outage, we saw 100% packet loss from global monitoring locations.

Figure 4: High packet loss conditions at Google’s network edge

The outage impacted multiple, but not all regions. Figure 5 is a view of availability in GCP’s Europe West4 region, which is located in the Netherlands. The green status shows that global monitoring agents were encountering little-to-no network packet loss in accessing the region.

Figure 5: Negligible packet loss for GCP Europe West4 region

Google Publishes its Analysis

Google was aware of the issues early on and announced service impacts on Google Compute Engine beginning at 12:25pm PDT, as seen in Figure 6. They subsequently posted an update at 12:53pm that the service impacts were tied to broader networking issues. By 1:36pm, Google had identified the issue as related to high levels of network congestion in the eastern US, which aligned with the elevated packet loss we started seeing before the announcement (see Figure 2 above).

Figure 6: Google first reported the issues on its status page at 12:25pm PDT

Services Restored

Beginning at roughly 3:30pm PDT, we started to see packet loss conditions easing and reachability to Google services improving, as seen in Figure 7. Reachability continued to improve over the next hour, which we continuously tracked over that period.

Figure 7: Service reachability started to improve from global locations around 3:30pm PDT

Service finally appeared fully restored by 4:45pm PDT, as seen in Figure 8. 

Figure 8: Availability of GCP-hosted service at 4:45pm PDT

At 5:09pm, Google reported that the network congestion issue was resolved for all impacted users, as seen in Figure 9. They also promised an internal investigation of the issue.

Figure 9: Google announced resolution of the issue at 5:09pm PDT

Key Take-Aways

Any organization hosting in GCP that was impacted by this outage will be rightly concerned and will likely be asking Google for a detailed root cause post-mortem report. That’s entirely appropriate, and there will no doubt be important lessons for Google cloud services users to take away from such a report. Likewise, it’s also vitally important to ensure your cloud architecture has sufficient resiliency measures, whether on a multi-region basis or even multi-cloud basis, to protect from future recurrence of outages. However, lest we be tempted to pillory Google for the outage, it’s important to remember that every IT team experiences service outages. 

The more important takeaway is that the cloud and the Internet are inherently unpredictable because they are massive, complex and endlessly interconnected. While the cloud is arguably still the best way to do IT for most businesses today, it carries risks that no team should be unprepared for. When you no longer control the infrastructure, software and networks that run your business, you will need to make sure you have immediate visibility in order to see what’s going on and get resolution as fast as possible. The good news is you don’t need to have the global infrastructure and an engineering team the size of a FAANG company to see through the cloud and the Internet.

Author's Bio

Angelique Medina

Director, Product Marketing ThousandEyes