Azure Troubleshooting — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about Azure outages, debugging, and operational surprises.

Azure's biggest outage was caused by an expired TLS certificate¶

On February 1, 2013, Azure Storage experienced a worldwide outage because an internal SSL certificate expired. The incident lasted roughly 12 hours and affected virtually every Azure service. Microsoft implemented automated certificate rotation afterward, but certificate-related outages still occur across the industry.

Azure AD had an outage that locked users out of everything Microsoft¶

On September 28, 2020, a configuration update to Azure Active Directory caused a global authentication outage lasting over 5 hours. Since Azure AD underpins authentication for Microsoft 365, Teams, Xbox Live, and Azure itself, hundreds of millions of users were affected simultaneously.

The Azure CLI and Portal sometimes show different states¶

A well-known Azure debugging gotcha: the Azure Portal, Azure CLI, and ARM API can show different resource states during provisioning or failure scenarios. This happens because they query different caching layers. Experienced Azure engineers learn to always verify with the ARM API directly during troubleshooting.

Azure's DNS outage of 2021 was caused by a code bug that lay dormant for months¶

On April 1, 2021, Azure DNS experienced an outage when a code change from months earlier was triggered by a specific operational command. The latent bug had passed all testing but manifested under production-specific conditions, demonstrating that time-delayed bugs are among the hardest infrastructure problems to prevent.

Resource locks have prevented many accidental deletions — and caused many outages¶

Azure Resource Locks (CanNotDelete and ReadOnly) have saved countless production resources from accidental deletion. However, ReadOnly locks have also caused numerous outages because they prevent modifications to resources that need runtime updates, like storage accounts that need to rotate access keys.

NSG flow logs are separate from NSG effective rules, and this confuses everyone¶

When troubleshooting Azure networking, NSG flow logs show what traffic was allowed or denied, while "Effective Security Rules" shows the computed rule set. These can disagree due to processing order, ASG membership, or service tags, and the inconsistency is the single most common source of Azure networking support tickets.

Azure has over 60 regions, more than any other cloud provider¶

By 2024, Azure operated in over 60 geographic regions — more than AWS or GCP. This was a deliberate strategy to win government and regulated-industry customers who needed data residency guarantees. However, not all services are available in all regions, and service parity gaps are a persistent troubleshooting challenge.

The South Central US datacenter fire of 2018¶

On September 4, 2018, a lightning strike near Azure's South Central US datacenter triggered a voltage spike that damaged cooling equipment. The thermal protection systems then shut down servers to prevent hardware damage. The outage lasted over 24 hours and destroyed data on some virtual machines — a painful reminder that cloud infrastructure has physical dependencies.

ARM template deployment errors are famously unhelpful¶

Azure Resource Manager (ARM) template error messages are notorious for being vague or misleading. A common one — "DeploymentFailed: At least one resource deployment operation failed" — tells you nothing about what actually went wrong. The community has created multiple blog posts and tools specifically to decode ARM error codes into human-readable explanations.

Activity Log entries expire after 90 days¶

Azure Activity Log entries are retained for only 90 days by default. Teams conducting post-incident reviews more than 90 days after an event have discovered their audit trail has disappeared. Forwarding Activity Logs to a Log Analytics Workspace or Storage Account for long-term retention is critical but not configured by default.