Skip to content

Azure Troubleshooting - Primer

Why This Matters

Azure runs a significant share of enterprise infrastructure, and its troubleshooting patterns differ from AWS and GCP in important ways. Azure uses RBAC with scope inheritance, Network Security Groups with priority-based rules, and a monitoring stack (Azure Monitor) that ties together metrics, logs, and alerts. When something breaks in Azure, the path to resolution runs through az cli, the Activity Log, and NSG flow logs — and knowing those tools well is the difference between a quick fix and hours of clicking through the portal.

Core Concepts

1. RBAC and Managed Identities

Azure RBAC assigns roles at a scope: Management Group > Subscription > Resource Group > Resource. Permissions are inherited downward.

# Check current account context
az account show
az account list --output table

# List role assignments for a resource group
az role assignment list --resource-group my-rg --output table

# Check what a specific principal can do
az role assignment list --assignee user@company.com --output table

# List roles assigned to a managed identity
IDENTITY_PRINCIPAL_ID=$(az identity show --name my-identity --resource-group my-rg --query principalId -o tsv)
az role assignment list --assignee "${IDENTITY_PRINCIPAL_ID}" --output table

# Check a specific role definition (what permissions does "Contributor" include?)
az role definition list --name "Contributor" --output json | jq '.[0].permissions'

# Create a role assignment
az role assignment create \
  --assignee "${IDENTITY_PRINCIPAL_ID}" \
  --role "Storage Blob Data Reader" \
  --scope "/subscriptions/SUB_ID/resourceGroups/my-rg/providers/Microsoft.Storage/storageAccounts/mystorageacct"

Managed Identities vs Service Principals:

System-assigned managed identity:
  - Tied to a specific resource (VM, App Service, AKS)
  - Lifecycle managed by Azure (created/deleted with resource)
  - Use for: workloads that run on a single Azure resource

User-assigned managed identity:
  - Standalone resource, can be shared across multiple resources
  - You manage the lifecycle
  - Use for: workloads that span multiple resources or need to survive redeployment

Service Principal:
  - Traditional app registration with credentials
  - Use for: external systems, CI/CD pipelines, multi-tenant apps
# Check managed identity on a VM
az vm identity show --name my-vm --resource-group my-rg

# Check managed identity on AKS
az aks show --name my-cluster --resource-group my-rg --query identity

# Verify pod identity / workload identity federation (AKS)
kubectl get azureidentity -A
kubectl get azureidentitybinding -A

Gotcha: Azure RBAC role assignments can take up to 5 minutes to propagate. If you assign a role and immediately test with az role assignment list, the permission may not show yet. This is a common source of "I gave it permissions but it still gets 403." Wait 5 minutes and retry before investigating further. The same propagation delay applies when revoking access — a removed role may still work briefly.

2. Network Security Groups (NSGs)

Analogy: NSGs work like a nightclub bouncer with a numbered list. Rules are checked by priority number (lowest first). The first matching rule wins — if rule 100 says "Allow" and rule 200 says "Deny" for the same traffic, rule 100 wins. This is the opposite of iptables, where rules are processed top-to-bottom and the first match wins by position, not by number.

NSGs are Azure's firewall rules. They process rules by priority (lowest number = highest priority):

# List NSGs in a resource group
az network nsg list --resource-group my-rg --output table

# Show all rules for an NSG (including default rules)
az network nsg rule list --nsg-name my-nsg --resource-group my-rg --include-default --output table

# Check effective security rules for a NIC (combined NSG result)
az network nic list-effective-nsg --name my-nic --resource-group my-rg

# Add an allow rule
az network nsg rule create \
  --nsg-name my-nsg --resource-group my-rg \
  --name AllowHTTP --priority 100 \
  --direction Inbound --access Allow \
  --protocol Tcp --destination-port-ranges 80 443 \
  --source-address-prefixes '*'

# Check NSG flow logs (requires Network Watcher)
az network watcher flow-log list --location eastus --output table

NSG debugging order:

1. Check effective rules: az network nic list-effective-nsg
2. NSGs can be on BOTH subnet AND NIC — both must allow traffic
3. Rules are evaluated by priority (100 before 200)
4. Default rules allow VNet-to-VNet and deny internet inbound
5. Check if NSG flow logs show DENY for the traffic

3. Azure Monitor

Azure Monitor is the unified monitoring platform — metrics, logs, alerts, diagnostics:

# Check Activity Log (who changed what — like CloudTrail)
az monitor activity-log list \
  --resource-group my-rg \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --output table

# Filter activity log for failures
az monitor activity-log list \
  --resource-group my-rg \
  --status Failed \
  --output json | jq '.[] | {caller,operationName,status,timestamp}'

# Query Log Analytics workspace (KQL — Kusto Query Language)
az monitor log-analytics query \
  --workspace "workspace-id" \
  --analytics-query "ContainerLog | where LogEntry contains 'error' | take 20" \
  --output table

# Check VM diagnostics
az vm diagnostics get-default-config | jq .

# List alerts that fired
az monitor alert list --resource-group my-rg --output table

# Check metric values
az monitor metrics list \
  --resource "/subscriptions/SUB_ID/resourceGroups/my-rg/providers/Microsoft.Compute/virtualMachines/my-vm" \
  --metric "Percentage CPU" \
  --interval PT5M \
  --output table

Common KQL queries for AKS:

// Pod errors in the last hour
ContainerLog
| where TimeGenerated > ago(1h)
| where LogEntry contains "error" or LogEntry contains "fatal"
| project TimeGenerated, ContainerID, LogEntry
| order by TimeGenerated desc

// Node CPU pressure
Perf
| where ObjectName == "K8SNode"
| where CounterName == "cpuUsageNanoCores"
| summarize avg(CounterValue) by Computer, bin(TimeGenerated, 5m)

// Failed pod events
KubeEvents
| where Reason in ("Failed", "BackOff", "FailedScheduling", "OOMKilling")
| project TimeGenerated, Name, Namespace, Reason, Message

Under the hood: Azure Monitor uses Kusto Query Language (KQL), originally developed for Azure Data Explorer. KQL is not SQL — it uses a pipe-based syntax where each operator transforms the result set. The pattern is Table | where condition | project columns | summarize aggregation. If you know PromQL or LogQL, KQL follows a similar data pipeline model. The most common mistake for newcomers is writing SELECT/FROM — KQL does not use SQL syntax at all.

Remember: Azure's troubleshooting hierarchy: INAMIdentity (az account show), Network (NSGs, effective rules), Activity Log (who changed what), Metrics/Logs (Azure Monitor, KQL). Work through these layers in order and you will find most Azure issues before reaching into provider-specific diagnostics.

4. AKS Debugging

# Check AKS cluster status
az aks show --name my-cluster --resource-group my-rg --output table

# Get credentials
az aks get-credentials --name my-cluster --resource-group my-rg

# Check node pool status
az aks nodepool list --cluster-name my-cluster --resource-group my-rg --output table

# Run AKS diagnostics
az aks kollect --name my-cluster --resource-group my-rg --storage-account diagstorage

# Check cluster autoscaler status
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

# Check AKS-specific logs
az monitor log-analytics query \
  --workspace "workspace-id" \
  --analytics-query "AzureDiagnostics | where Category == 'kube-apiserver' | where log_s contains 'error' | take 20"

# Upgrade cluster
az aks get-upgrades --name my-cluster --resource-group my-rg --output table
az aks upgrade --name my-cluster --resource-group my-rg --kubernetes-version 1.29.0

# Scale node pool
az aks nodepool scale --cluster-name my-cluster --resource-group my-rg \
  --name nodepool1 --node-count 5

Common AKS issues:

Pods stuck Pending:
  1. kubectl describe pod <pod>  check Events section
  2. Check node pool has capacity (az aks nodepool show)
  3. Check resource quotas (kubectl describe resourcequota)
  4. Check if autoscaler is hitting max (cluster-autoscaler-status configmap)
  5. Check for taints preventing scheduling

ImagePullBackOff:
  1. Check ACR permissions (AKS needs AcrPull role on the registry)
  2. az aks check-acr --name my-cluster --resource-group my-rg --acr myregistry.azurecr.io

Node NotReady:
  1. az aks nodepool list  check provisioningState
  2. kubectl describe node <node>  check conditions
  3. Check VM instance status in the VMSS

5. App Gateway / Load Balancer Issues

# Check Application Gateway health
az network application-gateway show-backend-health \
  --name my-appgw --resource-group my-rg --output table

# Check backend pool status
az network application-gateway show --name my-appgw --resource-group my-rg \
  --query "backendAddressPools[]" --output table

# Check probe configuration
az network application-gateway probe list --gateway-name my-appgw --resource-group my-rg --output table

# Check Load Balancer health probe status
az network lb probe list --lb-name my-lb --resource-group my-rg --output table

# Check backend pool members
az network lb address-pool list --lb-name my-lb --resource-group my-rg --output json | jq '.[].backendIPConfigurations[].id'

# View App Gateway access logs
az monitor log-analytics query \
  --workspace "workspace-id" \
  --analytics-query "AzureDiagnostics | where Category == 'ApplicationGatewayAccessLog' | where httpStatus_d >= 500 | take 20"

Common App Gateway issues:

502 Bad Gateway:
  - Backend health probe failing  check probe path/port
  - NSG blocking probe traffic (allow AzureLoadBalancer source)
  - Backend not listening on configured port
  - SSL certificate mismatch (if using HTTPS backend)

Probe failing:
  - Health probe path returns non-200
  - Probe timeout too short
  - NSG blocking health probe source IPs
  - App not ready (startup probe not configured)

6. az CLI Patterns

# Find resources across subscriptions
az resource list --query "[?type=='Microsoft.Compute/virtualMachines']" --output table

# Check recent deployments (ARM template / Bicep)
az deployment group list --resource-group my-rg --output table
az deployment group show --name my-deployment --resource-group my-rg --query "properties.error"

# Serial console output (boot diagnostics)
az vm boot-diagnostics get-boot-log --name my-vm --resource-group my-rg

# Run a command on a VM (remote execution)
az vm run-command invoke --name my-vm --resource-group my-rg \
  --command-id RunShellScript --scripts "df -h && free -m && top -bn1 | head -20"

# Check resource locks (prevent accidental deletion)
az lock list --resource-group my-rg --output table

# Check Azure service health (is it Azure's fault?)
az rest --method get --url "https://management.azure.com/subscriptions/SUB_ID/providers/Microsoft.ResourceHealth/availabilityStatuses?api-version=2023-07-01-preview" | jq '.value[] | {resource: .id, status: .properties.availabilityState}'

# Cost analysis
az consumption usage list \
  --start-date $(date -d '-7 days' +%Y-%m-%d) \
  --end-date $(date +%Y-%m-%d) \
  --output table

Key Takeaway

Azure troubleshooting follows the same principles as other clouds but with Azure-specific tools: confirm your identity and subscription (az account show), check RBAC assignments (remember scope inheritance), check NSGs (both subnet-level and NIC-level, priority-ordered), and use Azure Monitor with KQL queries for logs and metrics. For AKS, combine az aks diagnostics with kubectl. The Activity Log is your audit trail, NSG flow logs show network decisions, and az vm run-command lets you debug instances without SSH access.


Wiki Navigation

  • Azure Troubleshooting Flashcards (CLI) (flashcard_deck, L1) — Azure Troubleshooting