networking
l2
runbook
load-balancing
networking-troubleshooting --- Portal | Level: L2: Operations | Topics: Load Balancing, Networking Troubleshooting | Domain: Networking

Runbook: Load Balancer Health Check Failure¶

Field	Value
Domain	Networking
Alert	Unhealthy targets in LB target group or `kube_service_status_load_balancer_ingress` missing
Severity	P1
Est. Resolution Time	20-40 minutes
Escalation Timeout	30 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	kubectl access, cloud CLI (aws/gcloud/az) installed and authenticated, load balancer console access or CLI permissions

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
kubectl get svc -A --field-selector spec.type=LoadBalancer

If output shows: EXTERNAL-IP column is <pending> for more than 5 minutes → LB provisioning failed, see Step 1 If output shows: External IP exists but traffic is failing → Health checks are failing on existing LB, start at Step 2

Step 1: Check Service External IP/Hostname¶

Why: The LoadBalancer service must have an assigned external IP or hostname before any health checks can run.

# Get full details of the service
kubectl describe svc <SERVICE_NAME> -n <NAMESPACE>

# Look at events for provisioning errors
kubectl get events -n <NAMESPACE> --field-selector involvedObject.name=<SERVICE_NAME> --sort-by=.lastTimestamp

Expected output:

LoadBalancer Ingress:     a1b2c3d4e5f.us-east-1.elb.amazonaws.com

If this fails: If LoadBalancer Ingress is blank, check cloud provider quota (too many LBs), subnet tags (for AWS, subnets must be tagged for ELB use), or cloud-controller-manager logs: kubectl logs -n kube-system -l component=cloud-controller-manager.

Step 2: Check NodePort Reachability¶

Why: A LoadBalancer service routes traffic through NodePorts on cluster nodes. If the NodePort is unreachable, the LB health check fails even if the app is healthy.

# Find the NodePort assigned to the service
kubectl get svc <SERVICE_NAME> -n <NAMESPACE> -o jsonpath='{.spec.ports[*].nodePort}'

# Get node IPs
kubectl get nodes -o wide

# Test NodePort directly from a machine that can reach the nodes
curl -v http://<NODE_IP>:<NODE_PORT>/

Expected output:

< HTTP/1.1 200 OK

If this fails: If connection is refused or times out, the app is not listening or the firewall blocks the NodePort. Check Step 3.

Step 3: Check Security Group / Firewall Rules¶

Why: Cloud firewalls (AWS Security Groups, GCP Firewall Rules, Azure NSGs) frequently block the health check port or source IP range.

# For AWS: find the node security group
aws ec2 describe-security-groups \
  --filters Name=tag:kubernetes.io/cluster/<CLUSTER_NAME>,Values=owned \
  --query 'SecurityGroups[*].[GroupId,GroupName]' --output table

# Check inbound rules for the health check port
aws ec2 describe-security-groups --group-ids <SG_ID> \
  --query 'SecurityGroups[*].IpPermissions' --output json

# For GCP:
gcloud compute firewall-rules list --filter="network=<NETWORK_NAME>"

# For Azure:
az network nsg rule list --resource-group <RG_NAME> --nsg-name <NSG_NAME> --output table

Expected output:

# Should show an inbound rule allowing the health check source CIDR
# For AWS NLB, health checks come from the node's own IP
# For AWS ALB, health checks come from the LB subnet CIDRs

If this fails: Add an inbound rule to allow TCP traffic on <NODE_PORT> from the LB health check source. For AWS ALB this is the VPC CIDR; for NLB it is the node IP itself.

Step 4: Check Health Check Endpoint Responds¶

Why: The LB probes a specific HTTP path and port. If the app returns non-2xx or the port does not match, targets stay unhealthy.

# Find what health check path and port the LB is configured with
# (Check cloud console or annotation on the service)
kubectl get svc <SERVICE_NAME> -n <NAMESPACE> -o yaml | grep -i health

# Test the exact health check path from inside a pod on the same node
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- \
  wget -qO- http://localhost:<APP_PORT><HEALTH_CHECK_PATH>

# Test from a debug pod on the node
kubectl run debug-pod --image=curlimages/curl --restart=Never --rm -it \
  -- curl -v http://<POD_IP>:<APP_PORT><HEALTH_CHECK_PATH>

Expected output:

HTTP/1.1 200 OK

If this fails: If the health endpoint returns 4xx or 5xx, the app itself is unhealthy. Check pod logs: kubectl logs <POD_NAME> -n <NAMESPACE>. If returning 200 from inside but failing from LB, it is a network path issue — re-examine Steps 2-3.

Step 5: Check Backend Pod Readiness¶

Why: Kubernetes will not forward traffic to pods that are not Ready, so if all pods are unready the service has no endpoints and the LB targets stay unhealthy.

# Check pod readiness
kubectl get pods -n <NAMESPACE> -l <SELECTOR_LABEL>=<SELECTOR_VALUE>

# Check endpoints — should list pod IPs
kubectl get endpoints <SERVICE_NAME> -n <NAMESPACE>

# Describe pods to see failing readiness probes
kubectl describe pods -n <NAMESPACE> -l <SELECTOR_LABEL>=<SELECTOR_VALUE> \
  | grep -A 10 "Readiness\|Conditions"

Expected output:

NAME           ENDPOINTS                        AGE
my-service     10.244.1.5:8080,10.244.2.3:8080  5d

If this fails: If ENDPOINTS shows <none>, all pods are unready. Fix the failing readiness probe or the underlying app issue first.

Step 6: Check if Health Check Port/Path Matches¶

Why: A common misconfiguration is the LB health check pointing at a different port or path than what the app actually serves.

# For AWS ALB (annotation-driven):
kubectl get svc <SERVICE_NAME> -n <NAMESPACE> -o yaml \
  | grep -E "healthcheck|health-check"

# Common AWS ALB annotations to check:
# alb.ingress.kubernetes.io/healthcheck-path
# alb.ingress.kubernetes.io/healthcheck-port

# Check what port/path the app actually exposes
kubectl get deployment <DEPLOYMENT_NAME> -n <NAMESPACE> -o yaml \
  | grep -A 10 "readinessProbe\|livenessProbe"

Expected output:

# Health check path in annotation should match readinessProbe httpGet path
# Health check port should match containerPort

If this fails: Update the service annotation to match the correct path and port. For AWS ALB, changes take effect immediately. For GCP/Azure, you may need to update the backend service health check via cloud console.

Verification¶

# Confirm the issue is resolved — check target health from cloud CLI
# AWS:
aws elbv2 describe-target-health \
  --target-group-arn <TARGET_GROUP_ARN> \
  --query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]' \
  --output table

# GCP:
gcloud compute backend-services get-health <BACKEND_SERVICE_NAME> --global

# Then test end-to-end from outside
curl -v https://<LB_HOSTNAME>/

Success looks like: All targets show healthy in the cloud CLI output, and the external curl returns HTTP 200. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 30 min	Platform/Infrastructure on-call	"LB health checks failing for , all targets unhealthy, external traffic completely down"
Data loss suspected	Application team lead	"Traffic routing failure for , requests may be dropping silently"
Scope expanding to multiple services	SRE lead	"Multiple LBs failing health checks, possible cloud provider networking issue or CNI problem"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete

Common Mistakes¶

Wrong health check path configured on LB: The LB sends a GET to a specific path — if the annotation says /health but the app only exposes /healthz, all targets will be unhealthy indefinitely even though the app is perfectly fine.
Security group blocking health check port: For AWS ALB/NLB, the health check source IPs must be allowed by the node security group. Forgetting this rule is the single most common cause of this alert.

Cross-References¶

Topic Pack: Cloud Load Balancing and Kubernetes Services (deep background)
Related Runbook: Network Partition

API Gateways & Ingress (Topic Pack, L2) — Load Balancing
HAProxy & Nginx for Ops (Topic Pack, L2) — Load Balancing
Load Balancing Flashcards (CLI) (flashcard_deck, L1) — Load Balancing
Networking Troubleshooting (Topic Pack, L1) — Networking Troubleshooting
Nginx & Web Servers (Topic Pack, L1) — Load Balancing
Runbook: DNS Resolution Failure (Runbook, L1) — Networking Troubleshooting
Runbook: MTU Mismatch (Runbook, L2) — Networking Troubleshooting
Runbook: Network Partition (Split Brain / Partial Connectivity) (Runbook, L2) — Networking Troubleshooting

Runbook: Load Balancer Health Check Failure¶

Quick Assessment (30 seconds)¶

Step 1: Check Service External IP/Hostname¶

Step 2: Check NodePort Reachability¶

Step 3: Check Security Group / Firewall Rules¶

Step 4: Check Health Check Endpoint Responds¶

Step 5: Check Backend Pod Readiness¶

Step 6: Check if Health Check Port/Path Matches¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: Load Balancer Health Check Failure¶

Quick Assessment (30 seconds)¶

Step 1: Check Service External IP/Hostname¶

Step 2: Check NodePort Reachability¶

Step 3: Check Security Group / Firewall Rules¶

Step 4: Check Health Check Endpoint Responds¶

Step 5: Check Backend Pod Readiness¶

Step 6: Check if Health Check Port/Path Matches¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶