GCP Troubleshooting — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about Google Cloud Platform outages, debugging, and operational surprises.

Google's 2020 global outage was caused by an internal quota issue¶

On December 14, 2020, Google experienced a 47-minute global outage affecting Gmail, YouTube, Google Drive, and GCP services. The root cause was the User ID Service running out of storage quota for its authentication tokens. When users couldn't be authenticated, nearly everything Google broke simultaneously.

GCP's IAM model is fundamentally different from AWS, and this confuses everyone¶

GCP uses a hierarchical IAM model (Organization > Folder > Project) where permissions inherit downward. AWS uses a flat account model with separate IAM policies per account. Engineers moving between clouds frequently misconfigure permissions because the mental models are incompatible, making IAM the most common GCP troubleshooting topic.

Google Cloud's "Andromeda" virtual network has sub-millisecond overhead¶

GCP's Andromeda virtual networking stack adds less than 100 microseconds of latency overhead compared to bare metal networking. This is achieved through custom smartNICs and kernel bypass techniques. When troubleshooting GCP network latency, the virtual network is rarely the bottleneck — it's almost always the application or cross-region distance.

Preemptible VMs (now Spot VMs) were inspired by Google's internal Borg priorities¶

GCP's Spot VMs (formerly Preemptible VMs) mirror how Google's internal Borg cluster manager assigns priorities. Low-priority "batch" jobs in Borg are preempted when high-priority jobs need resources. GCP exposed this same mechanism to customers at 60-91% discounts. The maximum 24-hour lifetime of original Preemptible VMs was a Borg-inherited constraint.

Cloud SQL's connection limit has surprised thousands of serverless developers¶

Cloud SQL instances have a maximum connection limit (e.g., 4,000 for a standard instance). Serverless functions like Cloud Functions and Cloud Run create new database connections per invocation, exhausting this limit rapidly. Google created the Cloud SQL Auth Proxy specifically to address this problem with connection pooling.

GKE was the first managed Kubernetes service, beating EKS by 3 years¶

Google Kubernetes Engine (originally Google Container Engine) launched in August 2015, roughly three years before AWS EKS (June 2018) and Azure AKS (June 2018). This head start gave GCP a significant Kubernetes credibility advantage, though AWS eventually captured more Kubernetes market share through sheer AWS ecosystem dominance.

The "zones" naming convention in GCP has caused costly mistakes¶

GCP zones are named like us-central1-a, which looks similar to AWS AZs (us-east-1a) but works differently. GCP zones within a region share a metropolitan area but are separate failure domains. Engineers have created cross-zone resources expecting them to be in the same building (free traffic) when they were actually in different facilities (charged traffic).

GCP's error reporting is surprisingly good but hidden¶

GCP Error Reporting automatically groups and deduplicates application errors from Cloud Logging without any configuration. Many teams don't know it exists because it's tucked away in the console. It can identify new error types, track error frequencies, and link directly to offending log entries — often making it faster than Sentry for GCP-hosted applications.

BigQuery's slot-based pricing confuses troubleshooters¶

BigQuery charges either by bytes scanned (on-demand) or by "slots" (reserved capacity). When troubleshooting slow BigQuery queries, the answer depends entirely on which model you're using. On-demand queries compete for shared slots and can be throttled; reserved queries have dedicated slots but may be under-provisioned. Misunderstanding the model leads to wrong conclusions.