This week Microsoft customers have encountered errors performing authentication operations for any Microsoft services and third-party applications that depend on Azure Active Directory (Azure AD) for authentication.
It turns out that Microsoft mistakenly removed the digital keys used for cryptographic signing operations.
One key was used in a complex cloud-to-cloud migration and marked as “retain” for a longer period than usual, as said.
A bug in the Azure AD ignored the “retain” state and removed the key, which meant users were no longer able to authenticate and use their applications.
Microsoft rolled back the key metadata after the issue was identified, cached metadata led to residual impact for a further twelve hours.
What is Azure Active Directory (AD)?
In Azure Active Directory (Azure AD), authentication involves more than just the verification of a username and password.
To improve security and reduce the need for help desk assistance, Azure AD authentication includes the following components:
- Self-service password reset
- Azure AD Multi-Factor Authentication
- Hybrid integration to write password changes back to the on-premises environment
- Hybrid integration to enforce password protection policies for an on-premises environment
- Passwordless authentication
Mitigation for the Azure AD service was finalized at 21:05 UTC on 15 March 2021. A growing percentage of traffic for services then recovered.
Below is a list of the major services with their extended recovery times:
- 22:39 UTC 15 March 2021 Azure Resource Manager.
- 01:00 UTC 16 March 2021 Azure Key Vault (for most regions).
- 01:18 UTC 16 March 2021 Azure Storage configuration update was applied to the first production tenant as part of a safe deployment process.
- 01:50 UTC 16 March 2021 Azure Portal functionality was fully restored.
- 04:04 UTC 16 March 2021 Azure Storage configuration change applied to most regions.
- 04:30 UTC 16 March 2021 the remaining Azure Key Vault regions (West US, Central US, and East US 2).
- 09:25 UTC 16 March 2021 Azure Storage completed their recovery, and we declared the incident fully mitigated.
Root Cause and Mitigation:
Azure AD utilizes keys to support the use of OpenID and other Identity standard protocols for cryptographic signing operations. As part of standard security hygiene, an automated system, on a time-based schedule, removes keys that are no longer in use.
Over the last few weeks, a particular key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key.
Metadata about the signing keys are published by Azure AD to a global location in line with Internet Identity standard protocols. Once the public metadata was changed at 19:00 UTC on 15 March 2021, applications using these protocols with Azure AD began to pick up the new metadata and stopped trusting tokens/assertions signed with the key that was removed. At that point, end-users were no longer able to access those applications.
Microsoft investigation is ongoing and fixed some issues. But some of the regions are still experiencing intermittent failures when attempting to perform operations for these resources are still waiting for fixes to the issue.