On Friday, July 19, 2024 at 04:09 UTC, as part of regular operations, CrowdStrike released a content configuration update for its Windows-based “Falcon Sensor” to gather telemetry on possible novel threat techniques. Unfortunately, the update had issues, which caused Windows to restart with the infamous “blue screen of death,” a common occurrence when Windows detects a defective driver operating in “kernel mode” that could potentially cause damage to the Windows OS. Without attention, the blue screen of death remains until the problem is remedied.

Most all virus and threat scanners must operate in kernel mode (as opposed to “user mode” where applications run) to be able to protect the OS from potential viruses and threats. This means that a problematic threat scanner driver or its definition/configuration file could accidentally keep a Windows computer from rebooting after downloading an update. Removing the defective definition/configuration file or disabling the scanner/driver can temporarily fix the issue, allowing a Windows computer to restart, albeit without the protection, until a new definitions file is distributed from the vendor, in this case, CrowdStrike.

In this situation, it was not a virus or threat that caused the failed reboots of thousands to millions of Windows computers, both clients and servers: It was a defective CrowdStrike update. And the worst part of it was that IT technicians would have to fix the problem by physically being on-location to log into the machines in “Administrator Mode.” Using Remote Desktop was not an option as Windows has to be in full operation to run that service. Being on premises also included Windows virtual machines operating in datacenters, which would also have needed the same personal attention—One. By. One. This is why it took so long to recover machines.

FOOD ENGINEERING asked experts from Tenable, an exposure management platform catering to industrial systems for comments.

FE: How can Windows industrial users protect their equipment from “accidental” updates that shut down systems?

Scott Caveza, staff research engineer, Tenable: Accidental updates like this are fortunately quite rare. While we’ve observed bad Windows operating system updates that have crashed systems or caused reboot loops in some circumstances, these issues are still rare. To combat this, some organizations will test updates on non-critical systems to verify that the updates do not cause any inadvertent issues. However, by delaying patching of systems, this can introduce more risk, especially when the updates are for security patches. Taking a “defense-in-depth” approach is the best option and part of that involves ensuring that planning for outages and recovery are part of the playbook for all organizations.

FE: How should industrial users vet cybersecurity problems to avoid failures of this nature in the future? 

Caveza: In an industrial environment, uptime is mission critical and downtime can incur major costs. The question becomes, does the impact from a cyberattack or crashed systems from other technical issues have more impact? In the Crowdstrike scenario, recovery was relatively quick for those impacted, once guidance was released. A cyberattack could result in more downtime as incident response processes take over. While both incur significant operational costs, it’s important to remember that a cyberattack could be exponentially worse. To that end, organizations need to plan for both contingencies and have plans and processes in place to understand what recovery looks like and how it will be performed. While it’s impossible to predict if a bad update will plague your security solution or block data flows inadvertently, time is better spent on planning on the next outage and understanding how to recover systems quickly and effectively.

FE: Can security systems like Crowdstrike be set up to run outside of kernel mode (in user mode) and still protect a machine?

Satnam Narang, senior staff research engineer, Tenable: There are some security solutions that don’t utilize kernel mode but those are limited to other types of solutions. Most Endpoint Detection and Response (EDR) solutions require kernel-mode access and it is, unfortunately, the nature of the beast when it comes to EDR. 

FE: What options can be set up to make it easy to bypass a bad “virus/malware definitions update,” which shuts down a driver, in turn shutting down Windows? Does a Windows user really want to reboot a dozen times, hoping that Windows OS might recognize the problem and disable the driver automatically?

Narang: The onus starts with more thorough testing by the vendors prior to deploying these updates. It may also require organizations to conduct additional testing of said updates themselves. There are also a variety of lessons to be learned from what happened as a result of the bad/faulty update. 

FE: What are some of the lessons to be learned? Would one of them be having a successful backup ready for unforeseen circumstances?

Narang: Plan for the unexpected. Unexpected interruptions, whether from a cyberattack or a failed update are both continuity-impacting incidents. Planning for one helps build resilience for the other. Identification of your mission critical assets and how you would recover and restore functionality as quickly as possible should be core to the plan. Once a plan is established, the next crucial step is to actually test it. Tabletop the backup plan and treat it as an actual incident. There’s much to learn for each environment and despite having a well-documented plan, it means nothing if it’s not been put through the paces. The final step is to remember that the backup and recovery plan is a living process. It will require continued maintenance and testing, just as with the systems themselves.

For more information from CrowdStrike, see its Tech Info: “Remediation And Guidance Hub: Falcon Content Update For Windows Hosts” on its website. More information on Tenable can be found at tenable.com.

About the interviewees

Scott Caveza_Tenable

Scott Caveza joined Tenable in 2012 as a research engineer on the Nessus Plugins team. Over the years, he has written hundreds of plugins for Nessus, and reviewed code for even more from his time being a team lead and manager of the Plugins team. Previously leading the Security Response team and the Zero Day Research team, Caveza is currently a member of the security response team, helping the research organization respond to the latest threats. He has more than a decade of experience in the industry with previous work in the security operations center (SOC) for a major domain registrar and web hosting provider. Caveza is a current CISSP and actively maintains his GIAC GWAPT Web Application Penetration Tester certification.


Satnam Narang_Tenable

Satnam Narang joined Tenable in 2018. He has more than 15 years’ experience in the industry (M86 Security and Symantec). He contributed to the Anti-Phishing Working Group, helped develop a social networking guide for the National Cyber Security Alliance, uncovered a huge spam botnet on Twitter and was the first to report on spam bots on Tinder. He's appeared on NBC Nightly News, Entertainment Tonight, Bloomberg West, and the Why Oh Why podcast.