Microsoft Updated on Azure Storage Service Interruption

Microsoft Azure network recently faced service interruption on its storage platform. As a result of the interruption, websites of certain section of customers were affected.

“The incident has now been resolved and we are seeing normal activity in the system,” said Jason Zander, Corporate Vice President, Microsoft Azure Team.

Microsoft provides a special status page which displays the details of service interruptions if there is any. However, as of the time of this writing I am unable to see any downtime. Microsoft has issued a detailed e-mail to all affected customers. It begins with the content as shown below:

As part of a performance update to Azure Storage, an issue was discovered that resulted in reduced capacity across services utilizing Azure Storage, including Virtual Machines, Visual Studio Online, Websites, Search and other Microsoft services.

Incident Information

Incident ID
3071402

Incident Title
Microsoft Azure Service Incident: Connectivity to multiple Azure Services – Partial Service Interruption

Service(s) Impacted
Azure Storage, Virtual Machines, SQL Geo-Restore, SQL Import/export, Websites, Azure Search, Azure Cache, Management Portal, Service Bus, Event Hubs, Visual Studio, Machine Learning, HDInsights, Automation, Virtual Network, Stream Analytics, Active Directory, StorSimple and Azure Backup Services

The Azure service interruption occured on 11/19/2014 00:51:00 AM (UTC) and restored by 11/19/2014 11:45:00 AM (UTC).

The customers in the following regions are affected by the recent azure service interruption

Regions
Central US
East US
East US 2
North Central US
South Central US
West US
North Europe
West Europe
East Asia
Southeast Asia
Japan East
Japan West

The main cause of the service interruption is that a bug in the Blob Front-Ends which was exposed by the configuration change made as a part of the performance improvement update, which resulted in the Blob Front-Ends to going into an infinite loop.

Mistakes occur and Microsoft has taken bold steps to make sure that this bug is not repeated in the future

Ensure that the deployment tools enforce the standard protocol of applying production changes in incremental batches is always followed.
Improve the recovery methods to minimize the time to recovery.
Fix the infinite loop bug in the CPU reduction improvement from the Blob Front-Ends before it is rolled out into production.
Improve Service Health Dashboard Infrastructure and protocols.

You will find a complete text of the email issued by Microsoft on the official blog.