The CrowdStrike bug: A technical explanation for the non-technical
Oh the irony. On 19th July 2024 the CrowdStrike software sold to prevent cyber-attacks caused the world's worst IT outage. It's range and devastation was close to that feared from the ultimately non-event millennium bug, taking down businesses from airlines and TV stations to health systems.
What happened (in layman's terms)?
The CrowdStrike security tool Falcon
was updated with a slight mistake - like a typo; the lack of a super-simple line of code that checks whether something exists before trying to access it.
Falcon, like Windows itself and most low-level code even today in 2024, is written in the 1972 programming language C
(or rather, it's 1985 update C++). C comes from a time where programs could access (or address
) any part of a machine's memory. Modern operating systems have a program that continually checks for anything attempting to access parts of memory not belonging to them (so your word processor cannot access your banking app etc). As soon as a user program strays outside of it's boundaries the Operating System simply kills it.
However, if a piece of system software (such as what CrowdStrike make) breaks out of it's box the OS has no option but to think of itself as broken; that's when you get the Blue Screen of Death.
So what went wrong with that?
The memory of a computer is just a huge list of numbered slots from 0 to e.g. 4,294,967,296 for a 4GB system, each of which can contain a single number. So a program might be allocated say 1,000,000 to 1,200,000 to work within, and anything of substance will need a few numbers. Even the word HELLO needs 5 slots - 1 for a number representing each letter.
In C, slot 0
is special, meaning "null" or empty - something doesn't exist (or hasn't been allocated any space yet). The sleuths who pulled apart the CrowdStrike error noticed it was trying to access memory address 156 (a very low address, and clearly invalid for the running CrowdStrike program) so was probably trying to get some sub-parts of an object that didn't exist. That's why Windows said no.
What can be done in future?
There is a modern successor to C called Rust
that offers essentially the same low level verbosity (taking more or less the same amount of code to write something) but manages memory itself to prevent these types of things.
But also testing. Why was it not tested? Or not tested successfully?
CrowdStrike is a hack. It is attempting to change the functionality of the host system. I suspect this event will make IT departments think twice about the sensibleness of that approach. It may also ultimately prove an existential event for CrowdStrike itself.
Other plausible explanations (for those with a technical background)
A second hypothesis exists that because Windows is uniquely stupid in allowing storage device drivers to be paged out to swap file.
The Falcon Sensor
inserts itself between file system / hardware driver (so it can monitor all reads/writes from kernel space) and could have been paged out leading to the catch 22 situation of "how do I get the driver responsible for reading the page file, out of the page file".