Storm-0558 key acquisition commentary

Yesterday, Microsoft released a blog post stating that they have completed their investigation, and detailed what they found about the key aquisition. This writeup provides information to security teams both to understand how the threat actor was able to acquire this token, but also provides valuable insight to the public about the security controls that Microsoft uses internally to secure this type of data.

Breaches like this serve as invaluable case studies for all security teams to learn from. Even more important than understanding the technical details of the attack, security teams can learn about how security controls can still fail in unexpected ways, and the importance of a layered defence.

We will briefly summarize Microsoft's article on the known steps in the attack, then share our own learnings from this event.

How it happened

We recommend that you read Microsoft's blog article on this topic. It is fairly accessible, and can be found at the following link:

https://msrc.microsoft.com/blog/2023/09/results-of-major-technical-investigations-for-storm-0558-key-acquisition/

In summary, the events that led to Storm-0558 being able to access customer emails include the following:

In 2021, a consumer signing system (presumably the production system that signs tokens for end users) crashed.
A crash dump was collected from the system.
The crash dump is stripped of sensitive data. This is presumably automated, although it is not detailed in the article. Importantly, a race condition prevented the MSA signing key from being redacted from the crash dump. This is the first control that failed.
The crash dump, including the MSA signing key, was moved to a dedicated debugging environment for analysis. At this time, engineers assumed there was no sensitive data in the crash dump.
At some unspecified time after April 2021, A Microsoft Engineer's account was compromised. That account had access to a debugging environment and the crash dump.
Storm-0558 exfiltrated the signing key from the dump at some unknown time after April 2021. Logs were not collected to allow for the identification of a specific date due to log retention policies.
Storm-0558 forged tokens to access the outlook web application as arbitrary users from arbitrary tenants. This was done using Python and Powershell scripts from the attacker's infrastructure, and did not require the adversary to have further access to Microsoft infrastructure. This was first detected on May 15, 2023, two years after the signing key was accidentally moved into the debugging environment.

Commentary

While there are still many unanswered questions about the breach, the part that is most striking to me is how much Microsoft did right, while still falling victim to this attack. Lets take a look at the primary security controls that are stated in this article, which should have prevented this attack:

Microsoft has a separate debugging and production environment.
Production environments can only be accessed via dedicated secure access workstations and dedicated administrative users
Microsoft automates the removal of secrets from production environments prior to moving it into non-production environments

These three controls significantly reduce the likelihood of an attacker getting access to sensitive key material in production. In fact, if all three of these controls work properly, it should not be possible for an attacker to gain access to these secrets without compromising both an administrative account and a secure access workstation.

But there's the rub - a single defense-in-depth control failed, and that failure compromised the entire zero-trust security architecture Microsoft has implemented to protect these secrets. In the end, this event boiled down to a secret scanner that failed.

So what is the takeaway here? Review your secret scanners?

Sure, you can do that. But far more importantly, review your defense in depth strategy. This event proves that a "zero trust architecture" is not by itself enough to prevent attacks - zero trust controls can also fail. So what happens if one of your controls fails? How many failed controls does it take for an attacker to gain access to your critical assets?

In this case, lets take a look at a few more controls Microsoft could have had in place to prevent this:

1. Hardware Security Module (HSM) - MSA signing keys are extremely sensitive assets for Microsoft. They are used to perform cryptographic operations that are central to end users authorization. HSMs are made for this exact use case: the key material does not need to leave the HSM, it simply performs cryptographic operations on its inputs using the key it stores, and returns the outputs.

If an HSM were used in production, the crash dump on the consumer signing system should not have contained a key in the first place, because the key should only be processed by the HSM. As a secondary defense in depth control, using an HSM should have prevented this event from occuring.

2. Key rotation and certificate validation - In a blog post by Wiz, security researchers identified that the secret was initially created as early as April 2016, and expired in April of 2021. The earliest known usage of this key by Storm-0558 was May 2023, which indicates that Outlook was accepting tokens signed by expired keys. This is another failed security control in the application code of outlook online, which should not be accepting tokens signed by expired keys.

In addition, five years is a long time for a highly sensitive key to be active. For sensitive keys such as these, Microsoft could be rotating these keys at least yearly to prevent the usage of old, lost keys, such as the ones used during this attack.

What should we do?

The most important thing we can learn from this event is how volatile our security posture can be. To avoid falling victim to an event such as this, consider going back to the basics - threat model your infrastructure and understand what your security dependencies are. Where are your critical assets? What security controls are in place to prevent them from being compromised? How many “layers” of controls would have to fail for that to happen? Are your assumptions about controls that are implemented correct?

Often times, its the basics that are missing from our security architecture. Review and document your expected defense in depth controls, and validate them on an ongoing basis