Azure Troubleshooting Footguns¶

Mistakes that cause outages, access failures, or misdiagnosis on Azure.

1. Forgetting NSGs apply at both subnet AND NIC level¶

You add an allow rule to the subnet NSG. Traffic is still blocked. There is a second NSG on the NIC that denies the traffic. Both layers must allow it for packets to pass.

Fix: Check effective rules that combine both NSGs: az network nic list-effective-nsg --name my-nic --resource-group my-rg. Always check both levels.

2. NSG priority ordering confusion¶

You add a deny rule at priority 200 and an allow rule at priority 300 for the same port. The deny wins because lower numbers are evaluated first. Your allow rule is never reached.

Fix: Understand that NSG rules are evaluated by priority (lowest number first). Plan your rule numbering. Leave gaps (100, 200, 300) so you can insert rules later.

3. Deleting a resource group without checking for shared resources¶

You delete a resource group to clean up a dev environment. The resource group contained a shared Key Vault that three production applications depend on. Secrets are gone. Production apps start failing.

Fix: Check for resource locks before deletion: az lock list --resource-group my-rg. List all resources first: az resource list --resource-group my-rg --output table. Use resource locks on critical resources.

Gotcha: Azure does not support restoring a deleted resource group. Key Vaults with soft-delete enabled can be recovered (90-day default retention), but Container Apps, VMs, and most other resources are permanently gone. Apply CanNotDelete locks on any resource group containing shared infrastructure. Enable soft-delete and purge protection on Key Vaults by default.

4. Managed identity not assigned to the correct scope¶

Your VM has a managed identity, but the role assignment is scoped to the wrong resource group. The identity can read storage in rg-shared but your data is in rg-data. Access Denied, and the error message does not tell you the scope is wrong.

Fix: Check role assignment scope: az role assignment list --assignee <principal-id> --output table. Verify the scope column matches where the target resource lives.

5. Using the wrong subscription context¶

Your az CLI is set to the dev subscription. You run a deployment command without --subscription. Resources deploy to dev instead of prod. Or worse, you delete something in prod thinking it is dev.

Fix: Always check context: az account show. Use --subscription explicitly in scripts. Set AZURE_SUBSCRIPTION_ID per-session for safety.

Remember: az account show is Azure's equivalent of aws sts get-caller-identity. Run it at the start of every troubleshooting session. In CI, always pass --subscription explicitly — never rely on the default context, which varies per machine and per user.

6. AKS missing AcrPull role on container registry¶

Your AKS pods show ImagePullBackOff. The container image exists in ACR. But the AKS cluster's managed identity does not have the AcrPull role on the registry. Kubernetes cannot authenticate to pull the image.

Fix: Verify with az aks check-acr --name my-cluster --resource-group my-rg --acr myregistry.azurecr.io. Grant the role: az role assignment create --assignee <aks-identity> --role AcrPull --scope <acr-resource-id>.

7. Not checking Activity Log during incidents¶

You spend an hour debugging a broken deployment. It turns out someone changed the NSG 30 minutes before the incident. You would have found this in 60 seconds by checking the Activity Log.

Fix: Early in any incident, check recent changes: az monitor activity-log list --resource-group my-rg --start-time $(date -u -d '2 hours ago' +%Y-%m-%dT%H:%M:%SZ) --output table.

Debug clue: The Activity Log answers "who changed what, and when" for the past 90 days. Filter by operation type for faster results: --filters "eventTimestamp ge '2026-01-01' and operationType eq 'Write'". For changes older than 90 days, you need to have configured diagnostic settings to export Activity Logs to Log Analytics or a storage account.

8. App Gateway health probe path mismatch¶

Your Application Gateway health probe checks / but your app only responds on /health. The probe returns 404. App Gateway marks all backends as unhealthy and returns 502 to every client request.

Fix: Match the health probe path to your app's actual health endpoint. Verify probe config: az network application-gateway probe list --gateway-name my-appgw --resource-group my-rg.

9. Forgetting to allow AzureLoadBalancer source in NSG¶

You lock down your NSG and only allow your VNet CIDR. Health probes from Azure Load Balancer or Application Gateway are blocked because they come from the AzureLoadBalancer service tag, not your VNet range.

Fix: Add an NSG rule allowing AzureLoadBalancer as source for your health check port. This is required for every Azure load-balanced service.

10. ARM deployment failures with no useful error in the CLI¶

Your deployment fails with a generic error. The CLI shows "DeploymentFailed" but no details. The actual error is buried in the deployment operations, not the top-level response.

Fix: Drill into deployment operations: az deployment group show --name my-deployment --resource-group my-rg --query "properties.error". Or list all operations: az deployment operation group list --name my-deployment --resource-group my-rg --output table.