On September 4, 2018 around 230am Pacific Time, Microsoft identified problems with the cooling systems in one part of its Texas data center complex, which caused a spike in temperature and forced it to shut down equipment in order to prevent a more catastrophic failure, according to the Azure status page. These issues have also caused cascading effects for some Microsoft Office 365 users as well as those who rely on Microsoft Active Directory to log into their accounts.
The cooling system is the most critical part of a modern data center, given the intense heat produced by thousands of servers cranking away in an enclosed area. Most cloud companies have automatic shutdown procedures that are triggered by a sharp rise in temperature, and while that’s a good idea, it requires admins to reboot everything, and that takes time.
Microsoft said it would provide an update on its progress by 10am PT on its Azure status page, which, in accordance with Murphy’s Law, has itself been down at several points this morning. At the moment the main issues seem confined to Texas, but the problems with Active Directory and Visual Studio Team Services — Microsoft’s hosted developer environment — could affect customers in multiple regions.
Update 10:01am: Microsoft extended its deadline for providing an update on the status of the service until 1pm PT. It is still reporting that multiple services are affected by these issues, and Visual Studio Team Services appears to be down across multiple regions, giving developers around the world a snow day.
Update 10:39am: Microsoft’s own services appear to be the most widely used apps and sites affected by these problems, with Xbox Live and OneDrive also experiencing problems at the moment.
Update 11:40am: Blame it on the weather: Microsoft updated the Azure status page to say that a lightning strike in the San Antonio area (where Microsoft’s Texas data center complex is located) “resulted in a power voltage increase that impacted cooling systems. Automated datacenter procedures to ensure data and hardware integrity went into effect and critical hardware entered a structured power down process.” Power has been restored to the buildings and systems are coming online, but Microsoft has yet to signal the all-clear for the dozens of services affected in the South Central US data center, as well as the broader Active Directory and Visual Studio Team Services issues.
Update 3:15pm: One of the biggest Azure meltdowns in recent memory continues on well into Tuesday afternoon, and while Microsoft still has yet to declare an end to the issues, it updated the Azure status page with the plan as of now:
- Restore power to the South Central US datacenter (COMPLETED)
- Recover software load balancers for Azure Storage scale units in South Central US (COMPLETED)
- Recover impacted Azure Storage scale units in South Central US (In Progress)
- Recover the remaining Storage-dependent services in South Central US (In Progress)