- networking
- l2
- runbook
- load-balancing
- networking-troubleshooting --- Portal | Level: L2: Operations | Topics: Load Balancing, Networking Troubleshooting | Domain: Networking
Runbook: Load Balancer Health Check Failure¶
| Field | Value |
|---|---|
| Domain | Networking |
| Alert | Unhealthy targets in LB target group or kube_service_status_load_balancer_ingress missing |
| Severity | P1 |
| Est. Resolution Time | 20-40 minutes |
| Escalation Timeout | 30 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | kubectl access, cloud CLI (aws/gcloud/az) installed and authenticated, load balancer console access or CLI permissions |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
kubectl get svc -A --field-selector spec.type=LoadBalancer
EXTERNAL-IP column is <pending> for more than 5 minutes → LB provisioning failed, see Step 1
If output shows: External IP exists but traffic is failing → Health checks are failing on existing LB, start at Step 2
Step 1: Check Service External IP/Hostname¶
Why: The LoadBalancer service must have an assigned external IP or hostname before any health checks can run.
# Get full details of the service
kubectl describe svc <SERVICE_NAME> -n <NAMESPACE>
# Look at events for provisioning errors
kubectl get events -n <NAMESPACE> --field-selector involvedObject.name=<SERVICE_NAME> --sort-by=.lastTimestamp
LoadBalancer Ingress is blank, check cloud provider quota (too many LBs), subnet tags (for AWS, subnets must be tagged for ELB use), or cloud-controller-manager logs: kubectl logs -n kube-system -l component=cloud-controller-manager.
Step 2: Check NodePort Reachability¶
Why: A LoadBalancer service routes traffic through NodePorts on cluster nodes. If the NodePort is unreachable, the LB health check fails even if the app is healthy.
# Find the NodePort assigned to the service
kubectl get svc <SERVICE_NAME> -n <NAMESPACE> -o jsonpath='{.spec.ports[*].nodePort}'
# Get node IPs
kubectl get nodes -o wide
# Test NodePort directly from a machine that can reach the nodes
curl -v http://<NODE_IP>:<NODE_PORT>/
Step 3: Check Security Group / Firewall Rules¶
Why: Cloud firewalls (AWS Security Groups, GCP Firewall Rules, Azure NSGs) frequently block the health check port or source IP range.
# For AWS: find the node security group
aws ec2 describe-security-groups \
--filters Name=tag:kubernetes.io/cluster/<CLUSTER_NAME>,Values=owned \
--query 'SecurityGroups[*].[GroupId,GroupName]' --output table
# Check inbound rules for the health check port
aws ec2 describe-security-groups --group-ids <SG_ID> \
--query 'SecurityGroups[*].IpPermissions' --output json
# For GCP:
gcloud compute firewall-rules list --filter="network=<NETWORK_NAME>"
# For Azure:
az network nsg rule list --resource-group <RG_NAME> --nsg-name <NSG_NAME> --output table
# Should show an inbound rule allowing the health check source CIDR
# For AWS NLB, health checks come from the node's own IP
# For AWS ALB, health checks come from the LB subnet CIDRs
<NODE_PORT> from the LB health check source. For AWS ALB this is the VPC CIDR; for NLB it is the node IP itself.
Step 4: Check Health Check Endpoint Responds¶
Why: The LB probes a specific HTTP path and port. If the app returns non-2xx or the port does not match, targets stay unhealthy.
# Find what health check path and port the LB is configured with
# (Check cloud console or annotation on the service)
kubectl get svc <SERVICE_NAME> -n <NAMESPACE> -o yaml | grep -i health
# Test the exact health check path from inside a pod on the same node
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- \
wget -qO- http://localhost:<APP_PORT><HEALTH_CHECK_PATH>
# Test from a debug pod on the node
kubectl run debug-pod --image=curlimages/curl --restart=Never --rm -it \
-- curl -v http://<POD_IP>:<APP_PORT><HEALTH_CHECK_PATH>
kubectl logs <POD_NAME> -n <NAMESPACE>. If returning 200 from inside but failing from LB, it is a network path issue — re-examine Steps 2-3.
Step 5: Check Backend Pod Readiness¶
Why: Kubernetes will not forward traffic to pods that are not Ready, so if all pods are unready the service has no endpoints and the LB targets stay unhealthy.
# Check pod readiness
kubectl get pods -n <NAMESPACE> -l <SELECTOR_LABEL>=<SELECTOR_VALUE>
# Check endpoints — should list pod IPs
kubectl get endpoints <SERVICE_NAME> -n <NAMESPACE>
# Describe pods to see failing readiness probes
kubectl describe pods -n <NAMESPACE> -l <SELECTOR_LABEL>=<SELECTOR_VALUE> \
| grep -A 10 "Readiness\|Conditions"
ENDPOINTS shows <none>, all pods are unready. Fix the failing readiness probe or the underlying app issue first.
Step 6: Check if Health Check Port/Path Matches¶
Why: A common misconfiguration is the LB health check pointing at a different port or path than what the app actually serves.
# For AWS ALB (annotation-driven):
kubectl get svc <SERVICE_NAME> -n <NAMESPACE> -o yaml \
| grep -E "healthcheck|health-check"
# Common AWS ALB annotations to check:
# alb.ingress.kubernetes.io/healthcheck-path
# alb.ingress.kubernetes.io/healthcheck-port
# Check what port/path the app actually exposes
kubectl get deployment <DEPLOYMENT_NAME> -n <NAMESPACE> -o yaml \
| grep -A 10 "readinessProbe\|livenessProbe"
# Health check path in annotation should match readinessProbe httpGet path
# Health check port should match containerPort
Verification¶
# Confirm the issue is resolved — check target health from cloud CLI
# AWS:
aws elbv2 describe-target-health \
--target-group-arn <TARGET_GROUP_ARN> \
--query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]' \
--output table
# GCP:
gcloud compute backend-services get-health <BACKEND_SERVICE_NAME> --global
# Then test end-to-end from outside
curl -v https://<LB_HOSTNAME>/
healthy in the cloud CLI output, and the external curl returns HTTP 200.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 30 min | Platform/Infrastructure on-call | "LB health checks failing for |
| Data loss suspected | Application team lead | "Traffic routing failure for |
| Scope expanding to multiple services | SRE lead | "Multiple LBs failing health checks, possible cloud provider networking issue or CNI problem" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
Common Mistakes¶
- Wrong health check path configured on LB: The LB sends a GET to a specific path — if the annotation says
/healthbut the app only exposes/healthz, all targets will be unhealthy indefinitely even though the app is perfectly fine. - Security group blocking health check port: For AWS ALB/NLB, the health check source IPs must be allowed by the node security group. Forgetting this rule is the single most common cause of this alert.
Cross-References¶
- Topic Pack: Cloud Load Balancing and Kubernetes Services (deep background)
- Related Runbook: Network Partition
Wiki Navigation¶
Related Content¶
- API Gateways & Ingress (Topic Pack, L2) — Load Balancing
- HAProxy & Nginx for Ops (Topic Pack, L2) — Load Balancing
- Load Balancing Flashcards (CLI) (flashcard_deck, L1) — Load Balancing
- Networking Troubleshooting (Topic Pack, L1) — Networking Troubleshooting
- Nginx & Web Servers (Topic Pack, L1) — Load Balancing
- Runbook: DNS Resolution Failure (Runbook, L1) — Networking Troubleshooting
- Runbook: MTU Mismatch (Runbook, L2) — Networking Troubleshooting
- Runbook: Network Partition (Split Brain / Partial Connectivity) (Runbook, L2) — Networking Troubleshooting
Pages that link here¶
- HAProxy & Nginx for Ops
- HAProxy & Nginx for Ops - Primer
- Networking Troubleshooting
- Nginx & Web Servers
- Nginx & Web Servers - Primer
- Operational Runbooks
- Runbook: DNS Resolution Failure
- Runbook: MTU Mismatch
- Runbook: Network Partition (Split Brain / Partial Connectivity)
- Runbook: TLS Certificate Expiry