Azure Troubleshooting¶
31 cards — 🟢 7 easy | 🟡 12 medium | 🔴 6 hard
🟢 Easy (7)¶
1. A user has the Contributor role on a resource group but cannot assign roles to others. Why?
Show answer
Contributor grants full resource management but NOT Microsoft.Authorization/* actions. Role assignments require the Owner role or User Access Administrator role. Check with: az role assignment list --assigneeRemember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
2. Traffic to a VM on port 443 is blocked despite an NSG allow rule. What do you check?
Show answer
1) Check rule priority -- a lower-numbered Deny rule may take precedence. 2) Check both subnet-level and NIC-level NSGs (both must allow). 3) Use 'az network watcher ip-flow-verify' to test. 4) Verify the VM's OS firewall also allows port 443.Remember: Azure NSG = AWS Security Group equivalent. Check NSG rules + effective routes + NIC association when troubleshooting connectivity.
3. AKS pods are stuck in Pending state. What do you check?
Show answer
1) Node pool capacity -- az aks nodepool list to check node count and VM size. 2) Resource requests may exceed available node resources (kubectl describe pod shows events). 3) Cluster autoscaler may be disabled or at max count. 4) Taints on nodes may prevent scheduling. 5) PVC binding issues if pods require persistent volumes.Remember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
4. When should you use Azure Load Balancer vs Application Gateway?
Show answer
Azure Load Balancer is Layer 4 (TCP/UDP) -- use for non-HTTP workloads, low latency, or internal load balancing. Application Gateway is Layer 7 (HTTP/HTTPS) -- use when you need path-based routing, SSL termination, WAF, session affinity, or URL rewriting. Common mistake: using ALB for HTTP workloads and missing out on L7 features, or using App Gateway for non-HTTP protocols where it will not work.Remember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
5. An Azure VM fails to start with an allocation error. What do you do?
Show answer
1) The VM size may not be available in the current fault domain or availability zone. 2) Try deallocating the VM fully (Stop-Deallocate, not just Stop) then starting again. 3) Resize to a different VM size in the same family. 4) Redeploy the VM to a different host cluster. 5) Try a different availability zone or region if flexibility allows.Remember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
6. An application using a service principal suddenly gets AADSTS7000215 (invalid client secret). What happened?
Show answer
The client secret expired. Azure AD/Entra ID secrets have configurable expiry (default max 2 years). Fix: create a new secret in App Registrations, update the application's configuration with the new secret value. Proactive fix: use managed identity where possible to avoid secret rotation issues.Remember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
7. An application that was accessing a storage account fine suddenly gets connection timeouts. What changed?
Show answer
Most likely the storage account firewall was enabled. Check: 1) Storage account > Networking -- if 'Enabled from selected virtual networks and IP addresses' is set, the caller must be in an allowed VNet/IP list. 2) Private endpoints may be required. 3) Check if 'Allow Azure services on the trusted services list' is enabled for Azure-to-Azure access.Gotcha: Azure managed disk IOPS limits vary by tier. Standard HDD: 500 IOPS, Premium SSD: up to 20,000 IOPS. Check SKU limits first.
🟡 Medium (12)¶
1. A custom RBAC role works at subscription scope but fails at resource group scope. What do you check?
Show answer
1) Verify the custom role's assignableScopes includes the target resource group or a parent scope. 2) Check that the Actions/NotActions in the role definition cover the needed resource provider operations at that scope. 3) Use az role definition list --custom-role-only to inspect the role.Remember: Azure RBAC hierarchy: Management Group > Subscription > Resource Group > Resource. Permissions inherit downward.
2. What is the difference between NSG and ASG, and when does ASG-based filtering fail?
Show answer
NSGs filter traffic with rules based on IP/port/protocol. ASGs are logical groupings of NICs that simplify NSG rules (use ASG as source/destination instead of IPs). ASG rules fail when: 1) the NIC is not associated with the ASG, 2) source and destination ASGs are in different VNets, or 3) multiple ASGs are used in a single rule incorrectly.Remember: Azure NSG = AWS Security Group equivalent. Check NSG rules + effective routes + NIC association when troubleshooting connectivity.
3. VNet peering is connected but VMs cannot communicate. What do you check?
Show answer
1) Peering status must be Connected on BOTH sides. 2) 'Allow forwarded traffic' and 'Allow gateway transit' settings on each side. 3) Route tables -- UDRs may override peering routes. 4) NSGs on both subnets/NICs. 5) Address spaces must not overlap. 6) DNS resolution may need custom DNS or Azure Private DNS zone links on both VNets.Remember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
4. AKS pods cannot reach external services. What do you check?
Show answer
1) Check if the AKS cluster uses kubenet or Azure CNI -- routing differs. 2) With kubenet, verify the UDR on the AKS subnet includes pod CIDR routes. 3) NSG on AKS subnet must allow outbound. 4) If using an egress firewall or Azure Firewall, check that required FQDNs are allowed. 5) Check CoreDNS pods for DNS resolution failures. 6) For private clusters, check that DNS resolution to the API server works.Remember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
5. Backend pool VMs are showing as unhealthy in Azure Load Balancer. What do you check?
Show answer
1) Health probe -- verify the probe port and path match what the application listens on. 2) NSG on the backend subnet must allow the health probe source IP 168.63.129.16. 3) The application must respond with HTTP 200 for HTTP probes. 4) OS firewall must allow probe traffic. 5) Check if the VM's NIC is associated with the backend pool correctly.Remember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
6. Application Gateway returns 502 Bad Gateway. What do you check?
Show answer
1) Backend health -- check App Gateway backend health blade for specific errors. 2) Backend VMs/pods must be listening on the configured port. 3) NSG on the backend subnet must allow traffic from the App Gateway subnet. 4) If using HTTPS to backend, the backend certificate must be trusted or whitelisted. 5) Health probe timeout may be too short. 6) The backend response may exceed the request timeout setting.Remember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
7. A VM is running but you cannot RDP/SSH to it. What do you check?
Show answer
1) Check boot diagnostics screenshot for OS-level issues (BSOD, kernel panic, disk full). 2) NSG rules allowing inbound 22/3389. 3) Check if a public IP is assigned or if you need Bastion/VPN. 4) Serial console access for emergency troubleshooting. 5) Run az vm repair to create a repair VM and mount the OS disk. 6) Check Azure Instance Metadata Service for scheduled events.Remember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
8. A managed disk cannot be attached to a VM. What are common causes?
Show answer
1) Disk and VM must be in the same region. 2) Disk may already be attached to another VM (non-shared disks are exclusive). 3) VM size may have reached its max data disk count. 4) Ultra Disk or Premium SSD v2 requires specific VM sizes and availability zones. 5) Disk encryption set (DES) mismatch between disk and VM.Gotcha: Azure managed disk IOPS limits vary by tier. Standard HDD: 500 IOPS, Premium SSD: up to 20,000 IOPS. Check SKU limits first.
9. A service principal can authenticate but gets 403 Forbidden accessing Azure resources. What do you check?
Show answer
1) Verify the service principal has the correct RBAC role at the right scope (subscription, RG, or resource). 2) Check that you are using the correct tenant. 3) If multi-tenant app, verify the SP is provisioned in the target tenant. 4) Check Conditional Access policies that may block non-interactive logins. 5) If accessing data plane (e.g., Key Vault, Storage), RBAC vs. access policy may apply.Remember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
10. An application gets a 403 when reading a Key Vault secret. What do you check?
Show answer
1) Determine if the vault uses Access Policies or Azure RBAC for the data plane. 2) For access policies: verify the principal has Get permission under Secrets. 3) For RBAC: verify the principal has Key Vault Secrets User role. 4) Check Key Vault firewall -- if enabled, the caller's IP or VNet must be allowed, or use a private endpoint. 5) Check if the secret is disabled or expired.Remember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
11. Azure Monitor shows no data for a VM. What do you check?
Show answer
1) Verify the Azure Monitor Agent (AMA) extension is installed and healthy on the VM. 2) Check that a Data Collection Rule (DCR) exists and is associated with the VM. 3) DCR must target the correct Log Analytics workspace. 4) Workspace must not have hit its daily cap. 5) NSG must allow outbound to Azure Monitor endpoints (or use Private Link). 6) Allow 10-15 minutes for initial data ingestion delay.Remember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
12. A VM in one VNet cannot resolve records in an Azure Private DNS zone. What do you check?
Show answer
1) The Private DNS zone must be linked to the VM's VNet (check Virtual Network Links). 2) Auto-registration link is only needed for automatic record creation, not resolution. 3) The VM must use Azure-provided DNS (168.63.129.16) or a custom DNS that forwards to it. 4) If using hub-spoke with custom DNS in the hub, ensure the forwarder is configured to reach Azure DNS.Gotcha: Azure VMs use 168.63.129.16 as the internal DNS resolver. If custom DNS is configured, ensure it can resolve Azure Private DNS zones.
🔴 Hard (6)¶
1. A user has Owner on a resource group but still gets AuthorizationFailed on certain resources. What could cause this?
Show answer
1) Azure Deny Assignments (often from Blueprints or managed apps) override even Owner. Check with az rest --method GET on the denyAssignments API. 2) Resource locks (CanNotDelete or ReadOnly) block modifications. 3) Azure Policy with Deny effect can block specific operations regardless of RBAC.Remember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
2. In a hub-spoke topology, spoke-to-spoke traffic through an NVA in the hub fails. What is wrong?
Show answer
1) VNet peering on each spoke must have 'Allow forwarded traffic' enabled. 2) Hub peering must have 'Allow gateway transit' if applicable. 3) UDRs on spoke subnets must route to the NVA's private IP (not the peering). 4) The NVA must have IP forwarding enabled at both the Azure NIC level AND inside the OS. 5) NVA NSG must allow the spoke-to-spoke traffic.Remember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
3. AKS workload identity pods get 401 errors when accessing Azure resources. What do you check?
Show answer
1) Verify the managed identity exists and has the correct RBAC role on the target resource. 2) Check the federated identity credential matches the service account namespace and name exactly. 3) Ensure the pod spec has the correct serviceAccountName. 4) Verify the azure-workload-identity webhook is running (kube-system). 5) Check the AZURE_CLIENT_ID, AZURE_TENANT_ID, and AZURE_FEDERATED_TOKEN_FILE environment variables are injected into the pod.Remember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
4. A VM experiences high disk latency despite using Premium SSD. What do you investigate?
Show answer
1) Check if the VM size supports Premium storage and its IOPS/throughput limits (VM-level caps). 2) Disk IOPS and throughput are capped by disk size tier -- a P10 (128 GiB) is capped at 500 IOPS. 3) Enable disk bursting analysis. 4) Check host caching setting -- ReadOnly for read-heavy, None for write-heavy. 5) Use Azure Monitor Disk IO metrics to compare actual vs. max IOPS. 6) Consider Temp disk or NVMe for ephemeral workloads.Gotcha: Azure managed disk IOPS limits vary by tier. Standard HDD: 500 IOPS, Premium SSD: up to 20,000 IOPS. Check SKU limits first.
5. Key Vault works from local machine but not from a VM in a VNet with private endpoint. What do you check?
Show answer
1) Private endpoint must be in an approved state. 2) Private DNS zone privatelink.vaultcore.azure.net must be linked to the VNet. 3) Verify DNS resolves the vault FQDN to the private IP (nslookupRemember: Azure Resource Health shows whether the platform itself is having issues. Check this before debugging your own configuration — the problem might be on Azure's side.
6. After configuring a private endpoint for a storage account, some VMs can connect but others in a different VNet cannot. What do you investigate?
Show answer
1) The private DNS zone (privatelink.blob.core.windows.net) must be linked to ALL VNets that need access. 2) VMs in unlinked VNets resolve the public IP and get blocked by the storage firewall. 3) Verify with nslookupGotcha: Azure managed disk IOPS limits vary by tier. Standard HDD: 500 IOPS, Premium SSD: up to 20,000 IOPS. Check SKU limits first.