Azure Troubleshooting - Street-Level Ops¶
Real-world Azure debugging workflows for production incidents.
First move: Who am I and which subscription?¶
az account show --output table
# Name CloudName SubscriptionId State IsDefault
# prod-sub AzureCloud aaaaaaaa-bbbb-cccc-dddd-eeeeeeee Enabled True
az account list --output table
# (shows all subscriptions you have access to)
Gotcha: The
azCLI remembers your last-used subscription. If you switched to dev yesterday and forgot, your production commands run against dev. Always verify withaz account showbefore destructive operations. Setexport AZURE_DEFAULTS_SUBSCRIPTION=<prod-sub-id>in production terminals.
RBAC Access Denied — systematic debug¶
# 1. Check role assignments for the user or managed identity
az role assignment list --assignee user@company.com --output table
# Principal Role Scope
# user@company.com Reader /subscriptions/SUB_ID/resourceGroups/my-rg
# 2. Check what a role actually includes
az role definition list --name "Contributor" --output json | \
jq '.[0].permissions[].actions'
# ["*"] (with notActions excluding authorization/roleAssignment writes)
# 3. Check managed identity assignments
PRINCIPAL_ID=$(az identity show --name my-identity --resource-group my-rg --query principalId -o tsv)
az role assignment list --assignee "${PRINCIPAL_ID}" --output table
# 4. Check Activity Log for denied operations
az monitor activity-log list \
--resource-group my-rg \
--status Failed \
--output json | jq '.[] | {caller,operationName,status: .status.value,time: .eventTimestamp}'
NSG blocking traffic¶
# Check effective security rules for a NIC (combined result of all NSGs)
az network nic list-effective-nsg --name my-nic --resource-group my-rg
# List NSG rules including defaults
az network nsg rule list --nsg-name my-nsg --resource-group my-rg \
--include-default --output table
# Priority Name Access Direction SourceAddr DestPort
# 100 AllowHTTP Allow Inbound * 80,443
# 65000 AllowVnetInBound Allow Inbound VirtualNet *
# 65500 DenyAllInBound Deny Inbound * *
# Check flow logs for DENY entries
az network watcher flow-log list --location eastus --output table
# Remember: NSGs apply at both subnet AND NIC level
# Both must allow the traffic for it to pass
Debug clue: Use
az network watcher show-next-hopto trace how Azure routes traffic for a specific source/destination pair. This reveals UDR (User Defined Route) issues thatnsg rule listcannot show — traffic might be routed to a firewall appliance that drops it before the NSG even matters.Under the hood: NSG rules are evaluated by priority (lowest number = highest priority). Azure adds implicit rules at priority 65000+ that allow VNet and LB traffic but deny everything else. Your custom rules must have lower priority numbers to take effect.
Remember: Azure NSG evaluation mnemonic: "Lowest number wins, first match stops." Priority 100 beats 200. Azure evaluates inbound rules at both the subnet NSG and the NIC NSG — traffic must be allowed by BOTH to pass through. Think of it as two security guards at two doors.
AKS cluster issues¶
# Cluster status
az aks show --name my-cluster --resource-group my-rg --output table
# Get credentials
az aks get-credentials --name my-cluster --resource-group my-rg
# Node pool health
az aks nodepool list --cluster-name my-cluster --resource-group my-rg --output table
# Name OsType VmSize Count ProvisioningState
# nodepool1 Linux Standard_D4s_v3 3 Succeeded
# Pods stuck Pending — check events
kubectl get events -n production --sort-by=.lastTimestamp | grep -i fail | tail -10
# FailedScheduling: 0/3 nodes are available: 3 Insufficient cpu
# ImagePullBackOff — check ACR permissions
az aks check-acr --name my-cluster --resource-group my-rg --acr myregistry.azurecr.io
# ACR 'myregistry.azurecr.io' is reachable from AKS cluster 'my-cluster'
# Quick fix for ImagePullBackOff: attach ACR to AKS
# az aks update --name my-cluster --resource-group my-rg --attach-acr myregistry
# Run diagnostics on the cluster
kubectl get nodes -o wide
kubectl describe node <node-name> | grep -A10 Conditions
Gotcha:
az aks get-credentialsoverwrites your~/.kube/configcontext for that cluster name. If you manage multiple AKS clusters with the same name in different resource groups, the secondget-credentialssilently overwrites the first. Use--contextto set distinct context names, or usekubeloginwith--server-idfor AAD-integrated clusters.
Application Gateway returning 502¶
# Check backend health
az network application-gateway show-backend-health \
--name my-appgw --resource-group my-rg --output table
# BackendServer Health HealthProbeLog
# 10.0.1.5:8080 Unhealthy Connection refused
# Check health probe configuration
az network application-gateway probe list \
--gateway-name my-appgw --resource-group my-rg --output table
# Name Protocol Host Path Interval Timeout
# my-probe Http * /health 30 30
# Common fix: allow AzureLoadBalancer in NSG
az network nsg rule create \
--nsg-name my-nsg --resource-group my-rg \
--name AllowAppGwProbe --priority 110 \
--direction Inbound --access Allow \
--protocol Tcp --destination-port-ranges 8080 \
--source-address-prefixes AzureLoadBalancer
Activity Log — who changed what?¶
# Recent changes in the resource group
az monitor activity-log list \
--resource-group my-rg \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
--output table
# Find failed deployments
az deployment group list --resource-group my-rg --output table
az deployment group show --name my-deployment --resource-group my-rg \
--query "properties.error"
VM diagnostics without SSH¶
# Boot diagnostics (serial console output)
az vm boot-diagnostics get-boot-log --name my-vm --resource-group my-rg
# Run a command remotely
az vm run-command invoke --name my-vm --resource-group my-rg \
--command-id RunShellScript \
--scripts "df -h && free -m && systemctl status myapp"
# One-liner: get serial console output for boot failures
# az vm boot-diagnostics get-boot-log --name my-vm --resource-group my-rg 2>&1 | tail -50
# Check VM status
az vm get-instance-view --name my-vm --resource-group my-rg \
--query "instanceView.statuses[].{Code:code,Status:displayStatus}" --output table
# Code Status
# PowerState/running VM running
Debug clue: If
az vm run-command invokereturns "Provisioning failed," the VM agent (waagent on Linux, WindowsAzureGuestAgent on Windows) is unhealthy. This agent is required for run-command, extensions, and password reset. SSH directly to the VM and checksystemctl status walinuxagent(orwaagent -version).
Log Analytics (KQL queries)¶
# Query container errors
az monitor log-analytics query \
--workspace "workspace-id" \
--analytics-query "ContainerLog | where LogEntry contains 'error' | take 20" \
--output table
# Find OOM kills in AKS
az monitor log-analytics query \
--workspace "workspace-id" \
--analytics-query "KubeEvents | where Reason == 'OOMKilling' | project TimeGenerated, Name, Namespace, Message | order by TimeGenerated desc" \
--output table
Check Azure service health¶
# Is it Azure's fault?
az rest --method get \
--url "https://management.azure.com/subscriptions/SUB_ID/providers/Microsoft.ResourceHealth/availabilityStatuses?api-version=2023-07-01-preview" | \
jq '.value[] | select(.properties.availabilityState != "Available") | {resource: .id, status: .properties.availabilityState}'
# Check resource locks (prevent accidental deletion)
az lock list --resource-group my-rg --output table
Default trap: Azure resource locks (
CanNotDelete,ReadOnly) apply to ALL users including automation. If Terraform or Ansible fails with "resource is locked," checkaz lock listbefore assuming a permissions issue.ReadOnlylocks are especially sneaky — they block tags updates too, not just content changes.One-liner: Quick "is it an Azure outage?" check:
az rest --method get --url "https://management.azure.com/providers/Microsoft.ResourceHealth/events?api-version=2022-10-01" | jq '.value[] | {title: .properties.title, status: .properties.status}'