Skip to content

Runbook: Load Balancer Health Check Failure

Field Value
Domain Networking
Alert Unhealthy targets in LB target group or kube_service_status_load_balancer_ingress missing
Severity P1
Est. Resolution Time 20-40 minutes
Escalation Timeout 30 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites kubectl access, cloud CLI (aws/gcloud/az) installed and authenticated, load balancer console access or CLI permissions

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
kubectl get svc -A --field-selector spec.type=LoadBalancer
If output shows: EXTERNAL-IP column is <pending> for more than 5 minutes → LB provisioning failed, see Step 1 If output shows: External IP exists but traffic is failing → Health checks are failing on existing LB, start at Step 2

Step 1: Check Service External IP/Hostname

Why: The LoadBalancer service must have an assigned external IP or hostname before any health checks can run.

# Get full details of the service
kubectl describe svc <SERVICE_NAME> -n <NAMESPACE>

# Look at events for provisioning errors
kubectl get events -n <NAMESPACE> --field-selector involvedObject.name=<SERVICE_NAME> --sort-by=.lastTimestamp
Expected output:
LoadBalancer Ingress:     a1b2c3d4e5f.us-east-1.elb.amazonaws.com
If this fails: If LoadBalancer Ingress is blank, check cloud provider quota (too many LBs), subnet tags (for AWS, subnets must be tagged for ELB use), or cloud-controller-manager logs: kubectl logs -n kube-system -l component=cloud-controller-manager.

Step 2: Check NodePort Reachability

Why: A LoadBalancer service routes traffic through NodePorts on cluster nodes. If the NodePort is unreachable, the LB health check fails even if the app is healthy.

# Find the NodePort assigned to the service
kubectl get svc <SERVICE_NAME> -n <NAMESPACE> -o jsonpath='{.spec.ports[*].nodePort}'

# Get node IPs
kubectl get nodes -o wide

# Test NodePort directly from a machine that can reach the nodes
curl -v http://<NODE_IP>:<NODE_PORT>/
Expected output:
< HTTP/1.1 200 OK
If this fails: If connection is refused or times out, the app is not listening or the firewall blocks the NodePort. Check Step 3.

Step 3: Check Security Group / Firewall Rules

Why: Cloud firewalls (AWS Security Groups, GCP Firewall Rules, Azure NSGs) frequently block the health check port or source IP range.

# For AWS: find the node security group
aws ec2 describe-security-groups \
  --filters Name=tag:kubernetes.io/cluster/<CLUSTER_NAME>,Values=owned \
  --query 'SecurityGroups[*].[GroupId,GroupName]' --output table

# Check inbound rules for the health check port
aws ec2 describe-security-groups --group-ids <SG_ID> \
  --query 'SecurityGroups[*].IpPermissions' --output json

# For GCP:
gcloud compute firewall-rules list --filter="network=<NETWORK_NAME>"

# For Azure:
az network nsg rule list --resource-group <RG_NAME> --nsg-name <NSG_NAME> --output table
Expected output:
# Should show an inbound rule allowing the health check source CIDR
# For AWS NLB, health checks come from the node's own IP
# For AWS ALB, health checks come from the LB subnet CIDRs
If this fails: Add an inbound rule to allow TCP traffic on <NODE_PORT> from the LB health check source. For AWS ALB this is the VPC CIDR; for NLB it is the node IP itself.

Step 4: Check Health Check Endpoint Responds

Why: The LB probes a specific HTTP path and port. If the app returns non-2xx or the port does not match, targets stay unhealthy.

# Find what health check path and port the LB is configured with
# (Check cloud console or annotation on the service)
kubectl get svc <SERVICE_NAME> -n <NAMESPACE> -o yaml | grep -i health

# Test the exact health check path from inside a pod on the same node
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- \
  wget -qO- http://localhost:<APP_PORT><HEALTH_CHECK_PATH>

# Test from a debug pod on the node
kubectl run debug-pod --image=curlimages/curl --restart=Never --rm -it \
  -- curl -v http://<POD_IP>:<APP_PORT><HEALTH_CHECK_PATH>
Expected output:
HTTP/1.1 200 OK
If this fails: If the health endpoint returns 4xx or 5xx, the app itself is unhealthy. Check pod logs: kubectl logs <POD_NAME> -n <NAMESPACE>. If returning 200 from inside but failing from LB, it is a network path issue — re-examine Steps 2-3.

Step 5: Check Backend Pod Readiness

Why: Kubernetes will not forward traffic to pods that are not Ready, so if all pods are unready the service has no endpoints and the LB targets stay unhealthy.

# Check pod readiness
kubectl get pods -n <NAMESPACE> -l <SELECTOR_LABEL>=<SELECTOR_VALUE>

# Check endpoints — should list pod IPs
kubectl get endpoints <SERVICE_NAME> -n <NAMESPACE>

# Describe pods to see failing readiness probes
kubectl describe pods -n <NAMESPACE> -l <SELECTOR_LABEL>=<SELECTOR_VALUE> \
  | grep -A 10 "Readiness\|Conditions"
Expected output:
NAME           ENDPOINTS                        AGE
my-service     10.244.1.5:8080,10.244.2.3:8080  5d
If this fails: If ENDPOINTS shows <none>, all pods are unready. Fix the failing readiness probe or the underlying app issue first.

Step 6: Check if Health Check Port/Path Matches

Why: A common misconfiguration is the LB health check pointing at a different port or path than what the app actually serves.

# For AWS ALB (annotation-driven):
kubectl get svc <SERVICE_NAME> -n <NAMESPACE> -o yaml \
  | grep -E "healthcheck|health-check"

# Common AWS ALB annotations to check:
# alb.ingress.kubernetes.io/healthcheck-path
# alb.ingress.kubernetes.io/healthcheck-port

# Check what port/path the app actually exposes
kubectl get deployment <DEPLOYMENT_NAME> -n <NAMESPACE> -o yaml \
  | grep -A 10 "readinessProbe\|livenessProbe"
Expected output:
# Health check path in annotation should match readinessProbe httpGet path
# Health check port should match containerPort
If this fails: Update the service annotation to match the correct path and port. For AWS ALB, changes take effect immediately. For GCP/Azure, you may need to update the backend service health check via cloud console.

Verification

# Confirm the issue is resolved — check target health from cloud CLI
# AWS:
aws elbv2 describe-target-health \
  --target-group-arn <TARGET_GROUP_ARN> \
  --query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]' \
  --output table

# GCP:
gcloud compute backend-services get-health <BACKEND_SERVICE_NAME> --global

# Then test end-to-end from outside
curl -v https://<LB_HOSTNAME>/
Success looks like: All targets show healthy in the cloud CLI output, and the external curl returns HTTP 200. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 30 min Platform/Infrastructure on-call "LB health checks failing for , all targets unhealthy, external traffic completely down"
Data loss suspected Application team lead "Traffic routing failure for , requests may be dropping silently"
Scope expanding to multiple services SRE lead "Multiple LBs failing health checks, possible cloud provider networking issue or CNI problem"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete

Common Mistakes

  1. Wrong health check path configured on LB: The LB sends a GET to a specific path — if the annotation says /health but the app only exposes /healthz, all targets will be unhealthy indefinitely even though the app is perfectly fine.
  2. Security group blocking health check port: For AWS ALB/NLB, the health check source IPs must be allowed by the node security group. Forgetting this rule is the single most common cause of this alert.

Cross-References


Wiki Navigation