Skip to content

Azure Troubleshooting - Street-Level Ops

Real-world Azure debugging workflows for production incidents.

First move: Who am I and which subscription?

az account show --output table
# Name            CloudName    SubscriptionId                     State    IsDefault
# prod-sub        AzureCloud   aaaaaaaa-bbbb-cccc-dddd-eeeeeeee  Enabled  True

az account list --output table
# (shows all subscriptions you have access to)

Gotcha: The az CLI remembers your last-used subscription. If you switched to dev yesterday and forgot, your production commands run against dev. Always verify with az account show before destructive operations. Set export AZURE_DEFAULTS_SUBSCRIPTION=<prod-sub-id> in production terminals.

RBAC Access Denied — systematic debug

# 1. Check role assignments for the user or managed identity
az role assignment list --assignee user@company.com --output table
# Principal           Role                Scope
# user@company.com    Reader              /subscriptions/SUB_ID/resourceGroups/my-rg

# 2. Check what a role actually includes
az role definition list --name "Contributor" --output json | \
  jq '.[0].permissions[].actions'
# ["*"]  (with notActions excluding authorization/roleAssignment writes)

# 3. Check managed identity assignments
PRINCIPAL_ID=$(az identity show --name my-identity --resource-group my-rg --query principalId -o tsv)
az role assignment list --assignee "${PRINCIPAL_ID}" --output table

# 4. Check Activity Log for denied operations
az monitor activity-log list \
  --resource-group my-rg \
  --status Failed \
  --output json | jq '.[] | {caller,operationName,status: .status.value,time: .eventTimestamp}'

NSG blocking traffic

# Check effective security rules for a NIC (combined result of all NSGs)
az network nic list-effective-nsg --name my-nic --resource-group my-rg

# List NSG rules including defaults
az network nsg rule list --nsg-name my-nsg --resource-group my-rg \
  --include-default --output table
# Priority  Name               Access  Direction  SourceAddr  DestPort
# 100       AllowHTTP          Allow   Inbound    *           80,443
# 65000     AllowVnetInBound   Allow   Inbound    VirtualNet  *
# 65500     DenyAllInBound     Deny    Inbound    *           *

# Check flow logs for DENY entries
az network watcher flow-log list --location eastus --output table

# Remember: NSGs apply at both subnet AND NIC level
# Both must allow the traffic for it to pass

Debug clue: Use az network watcher show-next-hop to trace how Azure routes traffic for a specific source/destination pair. This reveals UDR (User Defined Route) issues that nsg rule list cannot show — traffic might be routed to a firewall appliance that drops it before the NSG even matters.

Under the hood: NSG rules are evaluated by priority (lowest number = highest priority). Azure adds implicit rules at priority 65000+ that allow VNet and LB traffic but deny everything else. Your custom rules must have lower priority numbers to take effect.

Remember: Azure NSG evaluation mnemonic: "Lowest number wins, first match stops." Priority 100 beats 200. Azure evaluates inbound rules at both the subnet NSG and the NIC NSG — traffic must be allowed by BOTH to pass through. Think of it as two security guards at two doors.

AKS cluster issues

# Cluster status
az aks show --name my-cluster --resource-group my-rg --output table

# Get credentials
az aks get-credentials --name my-cluster --resource-group my-rg

# Node pool health
az aks nodepool list --cluster-name my-cluster --resource-group my-rg --output table
# Name       OsType  VmSize          Count  ProvisioningState
# nodepool1  Linux   Standard_D4s_v3 3      Succeeded

# Pods stuck Pending — check events
kubectl get events -n production --sort-by=.lastTimestamp | grep -i fail | tail -10
# FailedScheduling: 0/3 nodes are available: 3 Insufficient cpu

# ImagePullBackOff — check ACR permissions
az aks check-acr --name my-cluster --resource-group my-rg --acr myregistry.azurecr.io
# ACR 'myregistry.azurecr.io' is reachable from AKS cluster 'my-cluster'

# Quick fix for ImagePullBackOff: attach ACR to AKS
# az aks update --name my-cluster --resource-group my-rg --attach-acr myregistry

# Run diagnostics on the cluster
kubectl get nodes -o wide
kubectl describe node <node-name> | grep -A10 Conditions

Gotcha: az aks get-credentials overwrites your ~/.kube/config context for that cluster name. If you manage multiple AKS clusters with the same name in different resource groups, the second get-credentials silently overwrites the first. Use --context to set distinct context names, or use kubelogin with --server-id for AAD-integrated clusters.

Application Gateway returning 502

# Check backend health
az network application-gateway show-backend-health \
  --name my-appgw --resource-group my-rg --output table
# BackendServer       Health     HealthProbeLog
# 10.0.1.5:8080       Unhealthy  Connection refused

# Check health probe configuration
az network application-gateway probe list \
  --gateway-name my-appgw --resource-group my-rg --output table
# Name       Protocol  Host  Path     Interval  Timeout
# my-probe   Http      *     /health  30        30

# Common fix: allow AzureLoadBalancer in NSG
az network nsg rule create \
  --nsg-name my-nsg --resource-group my-rg \
  --name AllowAppGwProbe --priority 110 \
  --direction Inbound --access Allow \
  --protocol Tcp --destination-port-ranges 8080 \
  --source-address-prefixes AzureLoadBalancer

Activity Log — who changed what?

# Recent changes in the resource group
az monitor activity-log list \
  --resource-group my-rg \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --output table

# Find failed deployments
az deployment group list --resource-group my-rg --output table
az deployment group show --name my-deployment --resource-group my-rg \
  --query "properties.error"

VM diagnostics without SSH

# Boot diagnostics (serial console output)
az vm boot-diagnostics get-boot-log --name my-vm --resource-group my-rg

# Run a command remotely
az vm run-command invoke --name my-vm --resource-group my-rg \
  --command-id RunShellScript \
  --scripts "df -h && free -m && systemctl status myapp"

# One-liner: get serial console output for boot failures
# az vm boot-diagnostics get-boot-log --name my-vm --resource-group my-rg 2>&1 | tail -50

# Check VM status
az vm get-instance-view --name my-vm --resource-group my-rg \
  --query "instanceView.statuses[].{Code:code,Status:displayStatus}" --output table
# Code                     Status
# PowerState/running       VM running

Debug clue: If az vm run-command invoke returns "Provisioning failed," the VM agent (waagent on Linux, WindowsAzureGuestAgent on Windows) is unhealthy. This agent is required for run-command, extensions, and password reset. SSH directly to the VM and check systemctl status walinuxagent (or waagent -version).

Log Analytics (KQL queries)

# Query container errors
az monitor log-analytics query \
  --workspace "workspace-id" \
  --analytics-query "ContainerLog | where LogEntry contains 'error' | take 20" \
  --output table

# Find OOM kills in AKS
az monitor log-analytics query \
  --workspace "workspace-id" \
  --analytics-query "KubeEvents | where Reason == 'OOMKilling' | project TimeGenerated, Name, Namespace, Message | order by TimeGenerated desc" \
  --output table

Check Azure service health

# Is it Azure's fault?
az rest --method get \
  --url "https://management.azure.com/subscriptions/SUB_ID/providers/Microsoft.ResourceHealth/availabilityStatuses?api-version=2023-07-01-preview" | \
  jq '.value[] | select(.properties.availabilityState != "Available") | {resource: .id, status: .properties.availabilityState}'

# Check resource locks (prevent accidental deletion)
az lock list --resource-group my-rg --output table

Default trap: Azure resource locks (CanNotDelete, ReadOnly) apply to ALL users including automation. If Terraform or Ansible fails with "resource is locked," check az lock list before assuming a permissions issue. ReadOnly locks are especially sneaky — they block tags updates too, not just content changes.

One-liner: Quick "is it an Azure outage?" check: az rest --method get --url "https://management.azure.com/providers/Microsoft.ResourceHealth/events?api-version=2022-10-01" | jq '.value[] | {title: .properties.title, status: .properties.status}'