Gcp Troubleshooting¶
36 cards — 🟢 7 easy | 🟡 11 medium | 🔴 7 hard
🟢 Easy (7)¶
1. A developer gets "403 Forbidden" calling a GCP API. What do you check first?
Show answer
1) Confirm the caller identity with 'gcloud auth list'.2) Check IAM bindings on the resource for the correct role.
3) Look for Deny policies (IAM Deny).
4) Check Organization Policy constraints.
5) Verify the API is enabled on the project.
2. You created a firewall rule to allow port 443 but traffic is still blocked. What do you check?
Show answer
1) Verify the rule targets the correct network and instances (check target tags or target service accounts).2) Confirm the source range is correct (0.0.0.0/0 for public).
3) Check rule priority — a higher-priority deny rule may be winning.
4) Verify the instance's network tag matches the firewall rule's target tag.
3. You cannot SSH into a GCE instance. What do you check?
Show answer
1) Firewall rule allowing TCP:22 from your IP (the default-allow-ssh rule may have been deleted).2) Instance must have an external IP or you must use IAP tunneling.
3) OS Login or metadata-based SSH keys must be configured.
4) The instance must be running and the guest OS must have sshd running.
5) Check the serial console output for boot issues.
4. A backend service behind an HTTP(S) load balancer shows all instances as unhealthy. What do you check?
Show answer
1) The health check firewall rule must allow traffic from 130.211.0.0/22 and 35.191.0.0/16 (Google health check probe ranges).2) The health check path must return HTTP 200.
3) The health check port must match the port the application listens on.
4) The instance must be in a running state.
5. A Cloud Storage upload fails with "403 Access Denied" even though the user has roles/storage.objectCreator. What could be wrong?
Show answer
1) roles/storage.objectCreator allows creating objects but NOT overwriting existing ones — you also need roles/storage.objectViewer or roles/storage.objectAdmin to overwrite.2) A bucket-level IAM policy or ACL may deny the action.
3) VPC Service Controls may block access from outside the perimeter.
4) Uniform bucket-level access may conflict with legacy ACLs.
6. You cannot find expected logs in Cloud Logging. What do you check?
Show answer
1) The correct project is selected.2) The log sink/router has not excluded the logs (check _Default sink exclusion filters).
3) The logging agent (Ops Agent) is installed and running on the VM.
4) The retention period has not expired (default 30 days).
5) The resource type and log name filters in the Logs Explorer are correct.
7. A GCE instance starts but the application does not come up. How do you debug?
Show answer
1) Check the serial port output: 'gcloud compute instances get-serial-port-output'.2) Review the startup script logs at /var/log/syslog or /var/log/messages.
3) Use the Cloud Console's serial console for interactive debugging.
4) Check the metadata server for the startup script: 'curl -H Metadata-Flavor:Google metadata.google.internal/computeMetadata/v1/instance/attributes/startup-script'.
5) Verify the startup script exit code in Cloud Logging.
🟡 Medium (11)¶
1. A workload on GCE has no access to Cloud Storage despite the VM's service account having roles/storage.objectViewer. What could be wrong?
Show answer
1) The VM may have been created with insufficient access scopes — scopes AND IAM both must allow the action.2) Check that the correct service account is attached (gcloud compute instances describe).
3) Bucket-level IAM or ACL may have an explicit deny.
2. What is the difference between GCP firewall rules and routes, and how can confusing them cause connectivity issues?
Show answer
Firewall rules filter traffic (allow/deny by protocol, port, source/target). Routes determine where traffic is forwarded (next hop). A route can send traffic to the correct destination, but a firewall rule can still block it. Always check both: 'gcloud compute routes list' and 'gcloud compute firewall-rules list'. The implied allow-egress and deny-ingress defaults also matter.3. Instances in two different VPC networks cannot communicate. Peering is set up. What do you check?
Show answer
1) VPC peering must be established in both directions (each side must create a peering connection).2) Peering does not export/import custom routes by default — enable 'export custom routes' and 'import custom routes' if using custom routes.
3) Firewall rules must explicitly allow traffic from the peer CIDR.
4) Subnet CIDRs must not overlap.
4. Instances in a private subnet cannot reach the internet after setting up Cloud NAT. What do you troubleshoot?
Show answer
1) Cloud NAT must be configured on the correct Cloud Router in the correct region.2) The NAT must be mapped to the correct subnets (or set to all subnets).
3) Check that the route table has a default route (0.0.0.0/0) to the default internet gateway.
4) Cloud NAT only covers TCP, UDP, and ICMP.
5) Check NAT gateway logs in Cloud Logging for dropped connections or port exhaustion.
5. Users get 502 errors from a GCP HTTP(S) load balancer. What do you investigate?
Show answer
1) Backend instances may be unhealthy — check health check status.2) The backend may be timing out (default timeout is 30s).
3) The backend may be returning responses the LB cannot parse.
4) Check that the named port on the instance group matches the backend service port.
5) Review Cloud Logging for the load balancer (resource.type='http_load_balancer') to see statusDetails.
6. A pod in GKE is stuck in CrashLoopBackOff. What GCP-specific things do you check beyond standard Kubernetes debugging?
Show answer
1) If the pod uses Workload Identity, verify the KSA-to-GSA binding and that the GSA has correct IAM roles.2) Check if the node pool has sufficient resources (may need to check cluster autoscaler).
3) If pulling from Artifact Registry/GCR, verify the node service account has roles/artifactregistry.reader.
4) Check GKE-specific logs in Cloud Logging under k8s_container.
7. A GKE pod using Workload Identity gets "403 Permission Denied" calling a GCP API. What do you check?
Show answer
1) KSA annotation: 'iam.gke.io/gcp-service-account' must reference the correct GSA.2) GSA must have the IAM binding: roles/iam.workloadIdentityUser for the KSA.
3) The GSA must have the required role on the target resource.
4) The pod spec must set serviceAccountName to the annotated KSA.
5) Workload Identity must be enabled on the cluster and node pool.
8. Objects in a Cloud Storage bucket are disappearing unexpectedly. What do you check?
Show answer
1) Inspect the bucket's lifecycle rules (gsutil lifecycle get gs://bucket) — there may be a delete rule based on age, storage class, or creation date.2) Check object versioning status — if disabled, overwrites destroy previous data.
3) Review Cloud Audit Logs for storage.objects.delete events to identify the actor.
4) Check for retention policies that may have expired.
9. An application on GCE cannot connect to a Cloud SQL instance. What do you troubleshoot?
Show answer
1) Use the Cloud SQL Auth Proxy for secure connections — direct IP requires authorized networks.2) Check that the GCE instance's IP is in Cloud SQL's authorized networks (for public IP).
3) For private IP: verify both are on the same VPC or peered VPC, and private services access is configured.
4) Verify the Cloud SQL Admin API is enabled.
5) Check the service account has roles/cloudsql.client.
10. A Cloud Monitoring alert is not firing even though the condition appears met. What do you investigate?
Show answer
1) Check the alignment period and aggregation — the metric may be averaged over a window that hides spikes.2) The alert may have a duration requirement (e.g., condition must be true for 5 minutes).
3) Verify the notification channel is correctly configured and verified.
4) Check if the alert policy is enabled.
5) Use Metrics Explorer to visualize the raw metric and compare it to the alert threshold.
11. Messages published to a Pub/Sub topic are not being received by the subscriber. What do you check?
Show answer
1) Verify the subscription exists and is attached to the correct topic.2) Check the subscription type (pull vs push) — pull requires the subscriber to actively call pull.
3) For push subscriptions, verify the endpoint URL is reachable and returns 2xx.
4) Check the ackDeadline — messages not acknowledged are redelivered.
5) Look at the subscription's 'oldest_unacked_message_age' metric to see if messages are accumulating.
🔴 Hard (7)¶
1. A service account in project A cannot access a resource in project B. What do you troubleshoot?
Show answer
1) The service account must be granted an IAM role on the resource in project B.2) The target project must have the required API enabled.
3) Check for Org Policy constraints like iam.allowedPolicyMemberDomains restricting cross-project access.
4) VPC Service Controls perimeters may block cross-project API calls.
2. Cloud NAT is configured but you see intermittent connection failures from multiple VMs. What is the likely cause?
Show answer
Port exhaustion. Each Cloud NAT IP provides 64,512 ports. With many VMs or high connection rates, you run out of source ports. Fix:1) Allocate more NAT IPs.
2) Increase minPortsPerVm.
3) Enable Dynamic Port Allocation.
4) Reduce idle connection timeouts. Check the 'nat_allocation_failed' metric in Cloud Monitoring.
3. After attaching a Google-managed SSL certificate to an HTTPS load balancer, the certificate stays in PROVISIONING status. What is wrong?
Show answer
1) DNS for the domain must resolve to the load balancer's IP before Google can provision the cert.2) The domain must be publicly resolvable (not just internal DNS).
3) Port 443 must be open and the forwarding rule must be configured.
4) It can take up to 24 hours.
5) Check 'gcloud compute ssl-certificates describe' for domainStatus — FAILED_NOT_VISIBLE means DNS is not pointing correctly.
4. Pods in a GKE cluster cannot reach external services. Nodes have no public IPs. What do you troubleshoot?
Show answer
1) Verify Cloud NAT is configured for the GKE subnet (including secondary ranges used for pods).2) For private clusters, the master authorized networks must allow the control plane.
3) Check if IP masquerade agent is running and configured correctly.
4) Network Policy or Dataplane V2 policies may be blocking egress.
5) Check the default route (0.0.0.0/0) has not been deleted from the VPC.
5. After enabling private IP on Cloud SQL, connections from a peered VPC fail. What is the likely issue?
Show answer
1) Private services access uses VPC peering, and transitive peering is not supported in GCP. If the app is in a VPC peered to the Cloud SQL VPC, you cannot reach Cloud SQL over two hops.2) Fix: use the Cloud SQL Auth Proxy running in the same VPC with private IP, or configure custom route export/import.
3) DNS resolution must also work — check if the Cloud SQL private IP resolves correctly from the app VPC.
6. A Pub/Sub subscriber is receiving duplicate messages and messages out of order. What do you do?
Show answer
1) Pub/Sub guarantees at-least-once delivery — duplicates are expected. Design the consumer to be idempotent.2) For ordering, you must set an ordering key on publish AND enable message ordering on the subscription.
3) Even with ordering keys, different keys have no ordering guarantee between them.
4) Ensure the subscriber acks promptly — slow acks cause redelivery.
5) Consider exactly-once delivery (available for pull subscriptions with ack-with-id).
7. API calls from within a VPC Service Controls perimeter succeed, but the same calls from a different project fail with PERMISSION_DENIED. What do you troubleshoot?
Show answer
1) The calling project must be inside the same service perimeter or connected via a service perimeter bridge.2) Check access levels — an access level based on IP, identity, or device trust may be needed for the external project.
3) Check ingress/egress rules on the perimeter for the specific API and identity.
4) Review VPC-SC audit logs (resource.type='audited_resource') for violations that show the exact rule that blocked the call.
5) Use the VPC Service Controls troubleshooter in the console.