Skip to content

GCP Troubleshooting - Primer

Why This Matters

GCP powers production workloads from GKE clusters to Cloud Run services, and when something breaks, you need to move fast through a stack of IAM bindings, VPC firewall rules, and service-specific configurations. GCP's IAM model is different from AWS — it is resource-centric with inherited bindings — and misunderstanding that model is the #1 source of access issues. Knowing where to look and which gcloud commands to run separates a 5-minute fix from a 2-hour firefight.

Core Concepts

1. IAM and Service Accounts

GCP IAM is hierarchical: Organization > Folder > Project > Resource. Permissions granted at a higher level are inherited by lower levels.

# Check who you are
gcloud auth list
gcloud config get-value project

# List IAM bindings on a project
gcloud projects get-iam-policy my-project --format=json | jq '.bindings[] | select(.role | contains("editor"))'

# Check what roles a service account has
gcloud projects get-iam-policy my-project \
  --flatten="bindings[].members" \
  --filter="bindings.members:serviceAccount:my-sa@my-project.iam.gserviceaccount.com" \
  --format="table(bindings.role)"

# List service accounts in a project
gcloud iam service-accounts list

# Check service account keys (look for leaked or old keys)
gcloud iam service-accounts keys list --iam-account my-sa@my-project.iam.gserviceaccount.com

# Test permissions (does this SA have a specific permission?)
gcloud projects get-iam-policy my-project --format=json | \
  jq '.bindings[] | select(.members[] | contains("my-sa@")) | .role'

Common IAM debugging flow:

1. Confirm identity: gcloud auth list, check active account
2. Check project-level bindings (gcloud projects get-iam-policy)
3. Check resource-level bindings (bucket, topic, instance)
4. Check org policies (constraints that override project-level)
5. Verify Workload Identity bindings (for GKE pods)
6. Check IAM Recommender for denied permissions
# Check IAM policy troubleshooter (why was access denied?)
gcloud policy-troubleshoot iam \
  //cloudresourcemanager.googleapis.com/projects/my-project \
  --permission=storage.objects.get \
  --principal-email=my-sa@my-project.iam.gserviceaccount.com

# Workload Identity: check KSA to GSA binding
gcloud iam service-accounts get-iam-policy my-gsa@my-project.iam.gserviceaccount.com \
  --format=json | jq '.bindings[] | select(.role == "roles/iam.workloadIdentityUser")'

2. VPC Networking

# List VPC networks and subnets
gcloud compute networks list
gcloud compute networks subnets list --network=my-vpc

# List firewall rules (equivalent to security groups)
gcloud compute firewall-rules list --filter="network:my-vpc" --format="table(name,direction,allowed[].map().firewall_rule().flat(),sourceRanges[],targetTags[])"

# Check a specific firewall rule
gcloud compute firewall-rules describe allow-http

# Test connectivity between instances
gcloud compute ssh instance-1 -- ping -c3 10.0.1.5
gcloud compute ssh instance-1 -- curl -v http://10.0.1.5:8080/health

# Connectivity tests (built-in network diagnostic)
gcloud network-management connectivity-tests create test-web-to-db \
  --source-instance=projects/my-project/zones/us-central1-a/instances/web-01 \
  --destination-instance=projects/my-project/zones/us-central1-a/instances/db-01 \
  --destination-port=5432 --protocol=TCP

gcloud network-management connectivity-tests describe test-web-to-db

Common VPC issues:

Traffic blocked:
  1. Check firewall rules (gcloud compute firewall-rules list)
  2. Firewall rules use TAGS — verify instance has the right network tag
  3. Check routes (gcloud compute routes list)
  4. Check if Private Google Access is needed (for instances without external IPs)

Cannot reach internet:
  1. Check if instance has external IP or NAT
  2. Check Cloud NAT configuration
  3. Check default route to internet gateway exists

3. Cloud Logging

Cloud Logging (formerly Stackdriver) is GCP's centralized logging service:

# Read recent logs for a GKE workload
gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="production" AND resource.labels.container_name="api-server"' \
  --limit=50 --format=json

# Filter by severity
gcloud logging read 'severity>=ERROR AND resource.type="k8s_container"' \
  --limit=20 --freshness=1h

# Read logs for a specific instance
gcloud logging read 'resource.type="gce_instance" AND resource.labels.instance_id="123456789"' \
  --limit=50

# Audit logs (who did what)
gcloud logging read 'logName:"cloudaudit.googleapis.com" AND protoPayload.methodName:"SetIamPolicy"' \
  --limit=10 --format=json

# Stream logs (tail -f equivalent)
gcloud logging tail 'resource.type="k8s_container" AND resource.labels.namespace_name="production"'

# Check log-based metrics
gcloud logging metrics list

# Export logs to BigQuery or GCS (log sink)
gcloud logging sinks list

Log Explorer queries (for the console, but the syntax works with gcloud logging read):

# Find OOM kills
resource.type="k8s_container"
severity=ERROR
textPayload:"OOMKilled"

# Find slow requests
resource.type="k8s_container"
jsonPayload.duration_ms > 5000

# Find IAM denied actions
logName:"cloudaudit.googleapis.com/activity"
protoPayload.status.code=7

4. GKE Debugging

# Check cluster status
gcloud container clusters list
gcloud container clusters describe my-cluster --zone us-central1-a

# Get credentials (configure kubectl)
gcloud container clusters get-credentials my-cluster --zone us-central1-a

# Check node pool status
gcloud container node-pools list --cluster=my-cluster --zone us-central1-a

# Check node pool autoscaling events
gcloud logging read 'resource.type="gke_cluster" AND jsonPayload.message:"scale"' --limit=20

# GKE-specific diagnostics
kubectl get nodes -o wide
kubectl describe node gke-my-cluster-default-pool-abc123
kubectl top nodes
kubectl top pods -n production

# Check if a pod is stuck due to resource quota
kubectl describe resourcequota -n production
kubectl get events -n production --sort-by=.lastTimestamp

# Node auto-repair status
gcloud container node-pools describe default-pool \
  --cluster=my-cluster --zone us-central1-a \
  --format="value(management.autoRepair,management.autoUpgrade)"

# Check workload identity configuration
kubectl get serviceaccount -n production -o yaml | grep -A5 annotations
# Should show: iam.gke.io/gcp-service-account: my-gsa@my-project.iam.gserviceaccount.com

5. Load Balancer Issues

# List forwarding rules (load balancer frontends)
gcloud compute forwarding-rules list

# Check backend service health
gcloud compute backend-services get-health my-backend-service --global
# Output shows HEALTHY/UNHEALTHY for each instance or NEG

# Check health check configuration
gcloud compute health-checks list
gcloud compute health-checks describe my-health-check

# Debug unhealthy backends
# 1. Check health check path/port matches app config
# 2. Check firewall rule allows health check source ranges
gcloud compute firewall-rules list --filter="name~health"
# GCP health check source ranges: 35.191.0.0/16, 130.211.0.0/22

# Check URL map (routing rules for HTTP(S) LB)
gcloud compute url-maps describe my-url-map

# Check SSL certificate status
gcloud compute ssl-certificates list
gcloud compute ssl-certificates describe my-cert

# View load balancer logs
gcloud logging read 'resource.type="http_load_balancer" AND httpRequest.status>=500' \
  --limit=20 --format=json

Common LB issues:

502 Bad Gateway:
  - Backend is unhealthy — check health check status
  - Backend not ready — app starting up, check readiness probe
  - Timeout — backend takes too long, increase timeout setting

403 Forbidden:
  - Cloud Armor policy blocking request
  - IAP (Identity-Aware Proxy) denying access

Health check failing:
  - Missing firewall rule for 35.191.0.0/16 and 130.211.0.0/22
  - Health check port does not match container port
  - Health check path returns non-200

6. gcloud CLI Patterns

# Find resources across all projects you have access to
for project in $(gcloud projects list --format="value(projectId)"); do
  echo "=== ${project} ==="
  gcloud compute instances list --project="${project}" 2>/dev/null
done

# Get serial port output (boot diagnostics)
gcloud compute instances get-serial-port-output my-instance --zone us-central1-a

# SSH with IAP tunneling (no public IP needed)
gcloud compute ssh my-instance --zone us-central1-a --tunnel-through-iap

# Port forward through IAP
gcloud compute start-iap-tunnel my-instance 8080 --local-host-port=localhost:8080 --zone us-central1-a

# Quick resource inventory
gcloud asset search-all-resources --scope=projects/my-project --asset-types="compute.googleapis.com/Instance"

# Check quotas (hitting quota limits causes silent failures)
gcloud compute project-info describe --project my-project --format="table(quotas.metric,quotas.limit,quotas.usage)" | grep -i cpu

# List recent operations (what changed recently?)
gcloud compute operations list --filter="status=DONE" --limit=20 --sort-by=~insertTime

Key Takeaway

GCP troubleshooting follows a pattern: confirm your identity and project (gcloud auth list), check IAM bindings (project-level and resource-level, remembering inheritance), check network path (firewall rules with correct tags, routes, NAT), and use Cloud Logging to find errors. For GKE, combine gcloud diagnostics with kubectl — check node pools, workload identity, and resource quotas. The gcloud policy-troubleshoot and network-management connectivity-tests commands are purpose-built debugging tools that most engineers underuse.


Wiki Navigation

  • GCP Compute Flashcards (CLI) (flashcard_deck, L1) — GCP Troubleshooting
  • GCP General Flashcards (CLI) (flashcard_deck, L1) — GCP Troubleshooting
  • GCP Kubernetes Flashcards (CLI) (flashcard_deck, L1) — GCP Troubleshooting
  • GCP Networking Flashcards (CLI) (flashcard_deck, L1) — GCP Troubleshooting
  • GCP Security Flashcards (CLI) (flashcard_deck, L1) — GCP Troubleshooting
  • GCP Troubleshooting Flashcards (CLI) (flashcard_deck, L1) — GCP Troubleshooting