Takeaways from the CrowdStrike Incident
What happened?
Almost a month ago, in what security consultant Troy Hunt (Have I Been Pwned) described as “basically what we were all worried about with Y2K,” a flawed update to CrowdStrike’s Falcon Sensor shipped a configuration update at 04:09 UTC on Friday, July 19, roughly 9 minutes after midnight on the US East Coast. This update remained live until 05:27 UTC, or roughly 1.5 hours after the initial release, and affected a reported 8.5 million Windows devices, according to Microsoft. However, that number may be an undercount as it depended upon crash reports being sent to Microsoft. The sensor configuration update included a broken channel file—a metadata file used to group customer systems for specific software updates—that caused an out-of-bounds memory read triggering an exception error for Microsoft Windows systems that had the sensor installed.
As part of CrowdStrike Falcon’s Endpoint Protection Platform capabilities, it operates at the kernel level to protect baseline Windows processes. The channel file at issue, C-00000291.sys, “controls how Falcon evaluates named pipe execution on Windows systems,” according to CrowdStrike. “Named pipes are used for normal, interprocess or intersystem communication in Windows.” The update had been pushed to address new malicious actor abuse of named pipes, but instead resulted in a logic error that caused a Blue Screen of Death (BSOD) error and endless reboot loops for affected systems. The bug was introduced, according to CrowdStrike’s Preliminary Post Incident Review, due to a separate bug in CrowdStrike’s in-house code validator tool used to verify that an update will work, allowing a flawed bit of code to pass review.
The affected release channels included CrowdStrike data centers in EU-1, US-1, US-2, US-GOV-1, and US-GOV-2. While the impact of the error was felt globally, the three countries most affected appear to be Australia, the United Kingdom, and the United States. The update was issued just after midnight for the East Coast of the United States, the UK is five hours ahead, meaning the update was online between 5:09 and 6:27 am in the UK, and Australia is between 12 and 14 hours ahead of the United States, meaning the update was out between noon and 3:30 pm across Australia, during the workday. This time difference is key, because it meant that more systems were online and receiving updates during the time that the broken file was available, while the UK and US had far fewer systems online at the time.
The issue hit the United States hard, with 911 systems in parts of Alaska, Arizona, Florida, Iowa, Indiana, Kansas, Michigan, Minnesota, New York, Ohio, Pennsylvania, and Virginia. 911 was down for all of New Hampshire, and most of Oregon. It caused traffic delays at US-Canada border crossings, slowed public transit systems and disabled vehicle tracking in Boston, New York, and Washington, D.C. Banks including Chase, Bank of America, Wells Fargo, Capital One, U.S. Bank, TD Bank, and Charles Schwab were all impacted with many services inaccessible. Healthcare systems were also heavily impacted, with many hospitals canceling all non-urgent surgeries and visits, and some, including the Memorial Sloan Kettering Cancer Center in New York canceling all procedures that required anesthetic.
In the U.K., systems at Edinburgh Airport, Gatwick Airport, and London Heathrow were affected, the National Health Service (NHS) was unable to handle general practitioner (GP) appointments and outpatient care, as well as patient records. Airlines were some of the most affected by the issue, with United, Delta, American, and Allegiant all issuing ground stops, and Delta Air Lines facing persisting delays due to the impact on their systems. American was the fasted to recover, canceling 408 flights on Friday, but only 50 on Saturday, using resiliency systems built for weather delays to help with a rapid recovery, while Delta was the hardest hit, canceling 1,207 flights on Friday and 1,208 on Saturday, and delays continued until the following Wednesday, with 4,961 flights canceled overall.
It could happen to anyone, and with growing centralization, it WILL happen again.
While some reporting has discussed culpability for the issue between CrowdStrike and Microsoft, it misses a key part of the bigger picture story: not only that an outage of this magnitude is possible, but that it will likely happen again. CrowdStrike is the second-largest cybersecurity company in the United States, according to the New York Times. Their own website boasts that their customer base includes “298 of the Fortune 500,” and “538 of the Fortune 1,000.” More than half of the largest companies in the United States by annual revenue use CrowdStrike. Microsoft is the most ubiquitous operating system in the world, with a global market share of 72%. While CrowdStrike also has offerings for both MacOS and Linux-based systems, neither were impacted in this incident as Apple stopped offering kernel-level access to third-party developers in 2020.
In a statement to the Wall Street Journal, Microsoft has argued that a December 2009 agreement with the European Commission that was intended to support competition by third-party security vendors also played a role in enabling this incident, by allowing them to access the same APIs for kernel-level access, whereas Apple does not allow such access. Microsoft themselves also noted that part of the reason this issue was so widely-damaging was because of the level of ubiquity and interconnectedness of specific software around the world. “This incident demonstrates the interconnected nature of our broad ecosystem — global cloud providers, software platforms, security vendors and other software vendors, and customers,” it said in its statement.
Others have made the case that this incident was due at least in-part to the global ubiquity of CrowdStrike and Microsoft. “This was not, however, an unforeseeable freak accident, nor will it be the last of its kind,” wrote Brian Klaas in The Atlantic. “Instead, the devastation was the inevitable outcome of modern social systems that have been designed for hyperconnected optimization, not decentralized resilience.”
The RAND Corporation’s Jonathan Welburn wrote a op-ed in the Wall Street Journal titled “CrowdStrike is too big to fail,” drawing parallels between the systemic importance of tech companies like CrowdStrike and the systemically-important big banks that led to the 2008 financial crisis. In that piece, Welburn points to the Cybersecurity and Infrastructure Security Agency’s (CISA) April 30 memo about the development of a list of Systemically Important Entities (SIE) that would be “organizations that own, operate, or otherwise control critical infrastructure…[that could] cause nationally significant and cascading negative impacts to national security (including national defense and continuity of Government), national economic security, or national public health or safety.”
CrowdStrike and Microsoft would certainly both be on such a list. Arguably, all security companies that operate at the kernel level and with the reach of a company like CrowdStrike should be.
None of this is to say that CrowdStrike is fully at fault in this case. To be sure, process improvements as they detail in their preliminary report and are certain to detail further in a finalized one are important next steps. But it is unfair to argue that they alone hold the blame for a failure that is so inherent to the system that we live in, particularly given the context of incidents in the past, from the Mirai botnet DDoS attack shutting down Dyn and thereby everything from Twitter and Reddit to Netflix and CNN, to the impact of malware and ransomware attacks on companies like Maersk and Colonial Pipeline. Rather, CrowdStrike is the latest example in a growing problem inherent to the trust we put in an imperfect system online.
A lack of safeguards on automation, and trust in third-party vendors
These issues of trust lead directly to how CrowdStrike could have mitigated the impact this flawed update had. In rolling it out to all of its customers simultaneously, with only a code-validation testing tool instead of deployment testing, a much broader swath of systems were affected than had to be.
A retrospective by Bruce Schneier and Barath Raghavan for Lawfare argued that the scope of the problem would have been far smaller had CrowdStrike traded some of the speed they strive for to deploy the latest protection updates to clients for validation. “Last week’s update wouldn’t have been a major failure if CrowdStrike had rolled out this change incrementally: first 1 percent of their users, then 10 percent, then everyone,” they write. They emphasize that doing so is costly in engineering time and slower to roll out fixes, but it goes further: such an approach deliberately leaves some client systems vulnerable to attack for longer. This is the essential tradeoff of automating and scaling deployment of protection at such a fundamental software level, at such a scale.
Automation is not inherently bad. It makes sure that systems are prepared for attack, even when the IT team is literally sleeping. But it comes with a requirement of trust. Trust in CrowdStrike that they will fix threats before they can be exploited. Trust in CrowdStrike that their update will prevent malicious actors from successful exploitation. In this case, trust in CrowdStrike that the update will not cause other problems. To be clear, CrowdStrike did not err in trying to protect its clients from a new method of attack. They were doing their job. But the way CrowdStrike did so is where the danger lay, and this time, where they failed to execute successfully.
This isn’t a flaw in CrowdStrike, but in the broader security environment. Security companies have an incentive to encourage customers to buy more of their tools, attempting to become a one-stop shop for every client need. The problem arises in where this encourages an all-the-eggs-in-one-basket strategy, where redundancy and safeguards are reduced or eliminated due to the trust that must be placed in a single vendor.
Agent and agentless systems, trust, and how to maintain it
Finally, this trust issue ties into the way that CrowdStrike and other EDR services operate through an agent-based model. CrowdStrike’s Falcon Sensor, the system that broke on Friday, is one such agent. In order to do their job, CrowdStrike’s agent needs to be on every system being protected, in order to detect malware, block exploits, and find indicators of attack. But in doing so, that strength can become a weakness. Every system with the Falcon Sensor installed was vulnerable to this flawed update, and all that were online to receive it were affected, 8.5 million of them. As we touched on above, Microsoft has associated the problem with its 2009 deal with the European Union, arguing that had CrowdStrike been only able to run in the user-space, rather than kernel level, such an incident could not occur. While some have tied a kernel panic issue that occurred on Linux machines running CrowdStrike’s Falcon Sensor in April, in some Red Hat and Debian linux distros, it appears that it was actually a Linux kernel bug operating in user-space that caused the problem. Rather than undercutting Microsoft’s claim, it seems to lend credence to it, as MacOS devices, which have not allowed third-party kernel access since 2020, were not affected.
While there will always be a need for agent-based models such as how CrowdStrike operates, there is potential for change in the wake of this incident. We may see a push by Microsoft to revert back to before its 2009 agreement in order to reclaim exclusive access to kernel-level operations, or we may see a shift in how much security teams are willing to trust third-party vendors to operate in that space, instead moving to a user-space level API model. Finally, another alternative exists, with not every system having a tool like the Falcon Sensor installed and instead providing agentless detection and remediation capabilities, in cases where that is a possibility.
Both options, agents, and agentless, require a significant amount of trust. Either in installing software from a trusted third-party vendor at an incredibly deep level across the enterprise, or trusting a third-party like SixMap to remotely access a system and deploy a patch to a production environment without using agents. It remains to be seen when we look back on this incident 5 years from now if there is a shift away from kernel-level access for security products and how much the security landscape evolves in terms of agent versus agentless approaches.