Kubernetes Ecosystem - Street-Level Ops¶

Quick Diagnosis Commands¶

# --- Cluster & Context Navigation ---

# See all contexts and current context
kubectl config get-contexts
kubectx                          # with kubectx installed (faster)

# Switch context
kubectl config use-context prod-cluster
kubectx prod-cluster             # shorthand

# Switch namespace
kubectl config set-context --current --namespace=production
kubens production                # with kubens installed

# --- Helm ---

# List all releases across all namespaces
helm list -A

# Check a release's status
helm status myapp -n production
helm history myapp -n production

# Diff before upgrade
helm diff upgrade myapp ./chart -n production -f values-prod.yaml  # requires helm-diff plugin

# Upgrade with dry run
helm upgrade myapp ./chart -n production -f values-prod.yaml --dry-run

# Rollback
helm rollback myapp 2 -n production  # roll back to revision 2

# Show rendered manifests
helm template myapp ./chart -f values-prod.yaml

# --- ArgoCD ---

# List all apps and their sync status
argocd app list

# Check a specific app
argocd app get myapp
argocd app get myapp --show-operation   # show last sync operation

# Force sync
argocd app sync myapp
argocd app sync myapp --force --replace  # nuclear: override all resources

# Fix out-of-sync app
argocd app diff myapp           # show what's different
argocd app sync myapp --prune   # apply and delete orphaned resources

# App of apps status
argocd app list --selector app-of-apps=true

# --- Flux ---

# Check all Flux resources
flux get all -A

# Check kustomization reconciliation
flux get kustomization -A
flux reconcile kustomization myapp --with-source  # force reconcile + pull latest

# Check helm release status
flux get helmrelease -A
flux reconcile helmrelease myapp -n production

# Check source sync (GitRepository, HelmRepository)
flux get sources all -A

# Suspend and resume
flux suspend kustomization myapp
flux resume kustomization myapp

# --- cert-manager ---

# List all certificates and their ready state
kubectl get certificate -A
kubectl get certificaterequest -A
kubectl get challenges -A        # ACME challenges in progress

# Check a certificate
kubectl describe certificate myapp-tls -n production

# Force renewal
kubectl cert-manager renew myapp-tls -n production

# Check issuer health
kubectl describe clusterissuer letsencrypt-prod

# --- External Secrets Operator ---

# Check ExternalSecret sync status
kubectl get externalsecret -A
kubectl describe externalsecret myapp-secret -n production

# Force refresh (before next scheduled sync)
kubectl annotate externalsecret myapp-secret \
  force-sync=$(date +%s) -n production --overwrite

# Check SecretStore connectivity
kubectl get secretstore -A
kubectl get clustersecretstore
kubectl describe clustersecretstore vault-backend

# --- Ingress-NGINX ---

# Check ingress controller pod
kubectl get pods -n ingress-nginx

# View NGINX config for a specific ingress
kubectl exec -n ingress-nginx deploy/ingress-nginx-controller -- \
  nginx -T | grep -A20 "server_name myapp.example.com"

# List all ingresses and their addresses
kubectl get ingress -A

# Check NGINX controller logs
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=50

# --- Cluster Autoscaler / Node Scaling ---

# Check cluster autoscaler status
kubectl logs -n kube-system deploy/cluster-autoscaler --tail=50 | grep -E "scale|node"

# Check for unschedulable pods (trigger for scale-up)
kubectl get pods -A --field-selector=status.phase=Pending | grep -v Completed

# HPA status
kubectl get hpa -A
kubectl describe hpa myapp -n production

# VPA recommendations
kubectl describe vpa myapp -n production

Common Scenarios¶

Scenario 1: ArgoCD App Out of Sync — Find What's Different¶

App shows OutOfSync but you're not sure what changed.

# Step 1: Show what ArgoCD thinks is different
argocd app diff myapp

# Step 2: Check if it's a resource that's not managed by ArgoCD
argocd app get myapp --show-managed-fields

# Step 3: Check if it's a mutation webhook adding annotations
argocd app get myapp -o json | jq '.status.resources[] | select(.status != "Synced")'

# Step 4: Look at the live vs desired state
argocd app get myapp --output json | jq '.status.sync.comparedTo'

# Step 5: Common cause — resource drift from manual kubectl edit
# Check who modified it
kubectl get <resource> <name> -n <ns> -o json | jq '.metadata.annotations'

# Step 6: Sync and let ArgoCD reconcile back to git state
argocd app sync myapp

# Step 7: If you want to preserve the live state, update git to match
# (i.e., "refresh" the desired state from live)
# Do NOT use --replace unless you understand what will be deleted

Scenario 2: Flux HelmRelease Stuck in Failed State¶

Flux keeps trying to reconcile a HelmRelease but it keeps failing.

# Step 1: Check the HelmRelease status
kubectl describe helmrelease myapp -n production
# Look at: Status.Conditions — Failed, Reason, Message

# Step 2: Check the Helm release itself
helm list -n production -a  # -a includes failed/pending
helm history myapp -n production

# Step 3: Get detailed Flux error
flux get helmrelease myapp -n production

# Step 4: Check the source (HelmRepository or GitRepository) is synced
flux get sources helm -A
flux get sources git -A

# Step 5: If Helm chart is broken — get the actual Helm error
kubectl get helmrelease myapp -n production -o json | \
  jq '.status.conditions[] | select(.type == "Ready") | .message'

# Step 6: Force a reconcile to pick up fixes
flux reconcile helmrelease myapp -n production --with-source

# Step 7: If in a terminal failure loop, reset the release state
# (Flux will retry from scratch)
kubectl patch helmrelease myapp -n production --type=merge \
  -p '{"spec":{"suspend":true}}'
kubectl patch helmrelease myapp -n production --type=merge \
  -p '{"spec":{"suspend":false}}'

Scenario 3: cert-manager Certificate Stuck Not Ready¶

Certificate shows READY: False and ACME challenge isn't completing.

# Step 1: Get the certificate status
kubectl describe certificate myapp-tls -n production
# Look at Events and Status.Conditions

# Step 2: Find the associated CertificateRequest
kubectl get certificaterequest -n production
kubectl describe certificaterequest <name> -n production

# Step 3: Check ACME challenges
kubectl get challenges -A
kubectl describe challenge <name> -n production

# Step 4: For HTTP-01 challenges — verify the challenge URL is reachable
CHALLENGE_TOKEN=$(kubectl get challenge <name> -n production -o jsonpath='{.spec.token}')
curl http://myapp.example.com/.well-known/acme-challenge/$CHALLENGE_TOKEN

# Step 5: For DNS-01 challenges — verify DNS propagation
dig _acme-challenge.myapp.example.com TXT

# Step 6: Check cert-manager logs for detailed error
kubectl logs -n cert-manager deploy/cert-manager --tail=100 | grep -i "myapp\|error\|fail"

# Step 7: Check if Let's Encrypt rate limited (5 certs/domain/week)
# If rate limited, wait or use staging issuer to test
kubectl describe clusterissuer letsencrypt-prod | grep -A5 "Status"

Scenario 4: External Secrets Not Syncing from Vault¶

ExternalSecret shows SecretSyncedError or is stuck in Pending.

# Step 1: Check ExternalSecret status
kubectl describe externalsecret myapp-secret -n production
# Status.Conditions will show the error

# Step 2: Check the SecretStore connection
kubectl describe clustersecretstore vault-backend
# Status should show: Ready

# Step 3: Test Vault connectivity from within the cluster
kubectl run vault-test --rm -it --image=vault:latest --restart=Never -- \
  vault status -address=http://vault.vault:8200

# Step 4: Check the Vault token/auth used by ESO
kubectl get clustersecretstore vault-backend -o json | jq '.spec.provider.vault'

# Step 5: Verify the Vault path exists
# (from a pod with Vault access)
vault kv get secret/production/myapp/config

# Step 6: Check ESO operator logs
kubectl logs -n external-secrets deploy/external-secrets --tail=50 | grep -i "error\|vault"

# Step 7: Force a sync
kubectl annotate externalsecret myapp-secret \
  force-sync=$(date +%s) -n production --overwrite

Key Patterns¶

GitOps Health Check Workflow¶

# Quick sanity check of the full GitOps stack
echo "=== ArgoCD Apps ==="
argocd app list --output table | grep -v Synced

echo "=== Flux Kustomizations ==="
flux get kustomization -A | grep -v True

echo "=== Flux HelmReleases ==="
flux get helmrelease -A | grep -v True

echo "=== cert-manager Certificates ==="
kubectl get certificate -A | grep -v True

echo "=== ExternalSecrets ==="
kubectl get externalsecret -A | grep -v True

echo "=== HPA Status ==="
kubectl get hpa -A

Ecosystem Stack Integration Debugging¶

# Problem: Ingress 502 — walk the stack
# 1. Is the pod running?
kubectl get pods -n production -l app=myapp

# 2. Is the Service pointing to the pods?
kubectl get endpoints myapp -n production  # should have IP:port entries

# 3. Is the Ingress configured correctly?
kubectl describe ingress myapp -n production
kubectl get ingress myapp -n production   # check ADDRESS is populated

# 4. Is the ingress controller healthy?
kubectl get pods -n ingress-nginx

# 5. Is TLS cert valid?
curl -v https://myapp.example.com 2>&1 | grep -E "expire|subject|issuer"

# 6. Is the cert-manager certificate ready?
kubectl get certificate -n production | grep myapp

# 7. Did ArgoCD/Flux apply the latest version?
argocd app get myapp | grep "Last Sync"

Helm + Kustomize Combined Pattern¶

# Common pattern: Helm for installation, Kustomize for environment overrides
# 1. Render Helm chart as base
helm template myapp ./chart -f values-base.yaml > k8s/base/rendered.yaml

# 2. Apply Kustomize overlay on top
kubectl kustomize overlays/production/ | kubectl apply -f -

# OR: Kustomize's helmCharts integration (Kustomize >= 4.1)
# In kustomization.yaml:
# helmCharts:
#   - name: myapp
#     repo: https://charts.example.com
#     version: 1.2.3
#     valuesFile: values.yaml

Operator Debugging Commands¶

# List all CRDs
kubectl get crds

# Check a CRD's schema
kubectl get crd databases.example.com -o yaml | head -80

# List all custom resources of a type
kubectl get databases -A

# Check operator logs
kubectl logs deploy/<operator-name> -n <ns> --tail=100

# Check if operator is running
kubectl get pods -n <ns> -l app=<operator-label>

# View CR status
kubectl get database my-db -n grokdevops -o jsonpath='{.status}' | jq .

# Check owner references on child resources
kubectl get statefulset my-db -n grokdevops -o jsonpath='{.metadata.ownerReferences}' | jq .

Gotcha: CRD Deletion Stuck¶

Debug clue: A hanging kubectl delete crd almost always means custom resources (CRs) of that type still exist somewhere in the cluster. The CRD's finalizer waits for the operator to clean up each CR. If the operator is already gone, nothing processes the finalizers and deletion hangs forever. Always kubectl get <cr-type> -A and delete all CRs before deleting the CRD.

If kubectl delete crd hangs, a finalizer is blocking it:

# Check for finalizers
kubectl get crd databases.example.com -o jsonpath='{.metadata.finalizers}'

# Remove finalizers (careful - this skips cleanup)
kubectl patch crd databases.example.com --type=json \
  -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

Root cause: CRs still exist that reference this CRD. Delete all CRs first, then delete the CRD.

Gotcha: CR Deletion Stuck¶

If kubectl delete database my-db hangs:

# Check for finalizers
kubectl get database my-db -n grokdevops -o jsonpath='{.metadata.finalizers}'

# Is the operator running? (it needs to process the finalizer)
kubectl get pods -n <operator-namespace>

# If operator is dead and you need to force delete:
kubectl patch database my-db -n grokdevops --type=json \
  -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

Gotcha: RBAC Missing¶

Operators often fail silently when they lack permissions:

# Check operator logs for RBAC errors
kubectl logs deploy/<operator> -n <ns> | grep -i forbidden

Fix: Check the operator's ClusterRole/Role and add missing verbs.

Pattern: Check Operator Health¶

# Operator pod running?
kubectl get pods -n <ns> -l control-plane=controller-manager

# Leader election working? (for HA operators)
kubectl get lease -n <ns>

# Reconciliation metrics (if exposed)
kubectl port-forward deploy/<operator> -n <ns> 8080:8080
curl localhost:8080/metrics | grep controller_runtime_reconcile

Key metrics to watch: - controller_runtime_reconcile_total — reconciliation count - controller_runtime_reconcile_errors_total — error count - controller_runtime_reconcile_time_seconds — reconciliation latency - workqueue_depth — queue depth (should be near 0)

Pattern: Safe CRD Upgrades¶

Add new optional fields (backward compatible)
Never remove or rename existing fields
For breaking changes, create a new API version
Test with kubectl apply --dry-run=server
Use kubectl diff to preview changes

Emergency: Operator Causing Cluster Issues¶

If an operator is hammering the API server or causing problems:

# Scale operator to 0 (stop reconciliation)
kubectl scale deploy/<operator> -n <ns> --replicas=0

# Once stable, check operator logs for the root cause
kubectl logs deploy/<operator> -n <ns> --previous --tail=200

# Fix the issue, then scale back up
kubectl scale deploy/<operator> -n <ns> --replicas=1

Quick Reference¶

Cheatsheet: Kubernetes Operators