GCP Troubleshooting Footguns¶
Mistakes that cause outages, access failures, or misdiagnosis on GCP.
1. Forgetting that firewall rules require network tags¶
You create a firewall rule to allow port 8080. Traffic is still blocked. The rule targets instances with the tag http-server, but your instance does not have that tag. GCP firewall rules are not applied by default to all instances — they need tags or service accounts.
Fix: Always check instance tags: gcloud compute instances describe <instance> --format="value(tags.items)". Match tags to firewall rule targets.
2. Missing health check firewall rules for load balancers¶
Your load balancer shows all backends as unhealthy. The app is running fine. The problem: GCP health checks come from 35.191.0.0/16 and 130.211.0.0/22, and you have no firewall rule allowing these ranges to reach your backends.
Fix: Create a firewall rule allowing health check source ranges to reach your backend port. This is required for every GCP load balancer and is the #1 LB issue.
3. IAM inheritance causing unexpected access¶
You grant roles/editor at the organization level for a dev team. They now have Editor on every project in the org, including production. Someone accidentally deletes a Cloud SQL instance in the wrong project.
Fix: Grant roles at the most specific scope possible. Use project-level or resource-level bindings. Audit inherited roles with gcloud projects get-iam-policy and check for org/folder-level grants.
4. Workload Identity misconfigured — pods silently use node SA¶
Your GKE pod annotation references a GCP service account, but the Workload Identity binding is wrong. Instead of failing, the pod falls back to the node's service account, which might have broader permissions than intended. No error, just a silent privilege escalation.
Fix: Verify the KSA-to-GSA binding: gcloud iam service-accounts get-iam-policy <gsa>. Ensure roles/iam.workloadIdentityUser is granted to the KSA. Test with kubectl exec <pod> -- gcloud auth list.
5. Not using --project and hitting the wrong project¶
Your gcloud default project is set to dev-project. You run a destructive command without --project and it executes against dev when you intended prod. Or worse — against prod when you intended dev.
Fix: Always verify project before destructive operations: gcloud config get-value project. Use --project explicitly in scripts. Set CLOUDSDK_CORE_PROJECT in your shell per-session.
6. Ignoring quota limits¶
You try to create 20 GKE nodes but the autoscaler only provisions 12. No error in GKE. The cluster autoscaler just logs a quiet warning. Your workload is under-provisioned and you do not realize it until users report latency.
Fix: Check quotas before scaling: gcloud compute project-info describe --format="table(quotas.metric,quotas.limit,quotas.usage)". Request quota increases proactively. Monitor autoscaler logs.
7. Deleting a service account that other services depend on¶
You clean up unused service accounts and delete one that a Cloud Function and three GKE workloads use. Everything breaks at the next credential rotation. The error messages say "token invalid" — not "service account deleted."
Fix: Before deleting a SA, check its usage: gcloud iam service-accounts get-iam-policy <sa> and search Cloud Audit Logs for its activity. Disable the SA first and wait 7 days before deleting.
8. Using gcloud compute ssh with default firewall open to 0.0.0.0/0¶
The default VPC includes a firewall rule default-allow-ssh that allows port 22 from everywhere. You deploy production workloads in the default VPC. Every instance is reachable from the internet on port 22.
Fix: Delete the default VPC. Create custom VPCs for production. Use IAP tunneling for SSH: gcloud compute ssh --tunnel-through-iap. No port 22 firewall rule needed.
9. Private Google Access not enabled on subnet¶
Your instances have no external IP. They cannot reach Google APIs (GCS, Cloud Logging, etc.). The instances are effectively disconnected from managed services. Logs stop shipping, health checks fail.
Fix: Enable Private Google Access on the subnet: gcloud compute networks subnets update my-subnet --region us-central1 --enable-private-ip-google-access. Or use Cloud NAT for full internet access.
10. Cloud Logging retention defaults causing data loss¶
Cloud Logging retains logs for 30 days by default (_Default bucket). After an incident, you try to review logs from 45 days ago. They are gone. Your compliance team is unhappy.
Fix: Configure a log sink to BigQuery or Cloud Storage for long-term retention: gcloud logging sinks create my-sink storage.googleapis.com/my-log-bucket --log-filter="...". Set custom retention on the _Default bucket if needed.