Saying "Have you tried turning it off and on again?" is getting tiresome
July 19, 2024 4:27 AM Subscribe

Evesham Journal: Live: Worcestershire hit by global IT outage. BBC Live: Worldwide travel and banking hit after cybersecurity update causes IT chaos. Guardian Live: Global IT outage live: software update causes chaos with transport, banks and businesses. Daily Mirror: Emergency Cobra meeting held as NHS and airports hit by global IT outage. Online service stability checker.
posted by Wordshore (84 comments total) 13 users marked this as a favorite

According to Crowdstrike, Mac and Linux hosts are not impacted.
posted by needled at 4:34 AM on July 19 [6 favorites]

The fun part is that in many circumstances it isn't fixable remotely, so end users will have to navigate the windows recovery environment to delete C:\Windows\System32\drivers\CrowdStrike\C-00000291*.sys. Have fun talking the average WFH employee through that.
posted by Klipspringer at 4:35 AM on July 19 [6 favorites]

Haha! My decision to not use any anti-malware is vindicated!
posted by Molesome at 4:41 AM on July 19 [5 favorites]

There's probably meetings happening right now in the editorial rooms of the Daily Express and Daily Mail, with the topic being "How can we blame this on Keir Starmer in a way our readership will find plausible?"
posted by Wordshore at 4:41 AM on July 19 [10 favorites]

Their ads that replaced the obnoxious Prime Day ones on Prime Video didn’t have long to age poorly!
posted by Captaintripps at 4:48 AM on July 19 [1 favorite]

The telecom company that laid me off a few weeks ago seems to be particularly hard-hit, to which I say HAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHA
posted by DirtyOldTown at 4:54 AM on July 19 [41 favorites]

How can we blame this on Keir Starmer...
Just x days into the Starmer government, essential UK air traffic has ground to a halt and stranded y UK citizens with no way to get to their destinations safely. Observers are watching the death toll closely. Under Labour, airline delays are already up z percent, and there's no way to be sure these delays won't increase forever.
posted by pracowity at 4:54 AM on July 19 [2 favorites]

The scale of failure here and the remediation problems are going to be wild. No incremental rollout, no rollback strategy, nothing? Please walk to all your computers to log in as administrator and uninstall our software by clicking things?

Bonkers.
posted by mhoye at 4:54 AM on July 19 [9 favorites]

I got a notification from NYC's emergency alert app about this a few hours ago. Imagine fucking up your deploy so bad it's reported through the same system as natural disasters.
posted by phooky at 4:54 AM on July 19 [10 favorites]

Sorry. Accidentally hit "post" too soon. Continuing my train of thought: HAHAHAHAHAHAHAHA.
posted by DirtyOldTown at 4:55 AM on July 19 [48 favorites]

CrowdStrike is actively working with customers impacted by a defect found in a single content update for Windows hosts.
https://www.crowdstrike.com/blog/statement-on-windows-sensor-update/

Oh well that's not so bad then, imagine if they had all gone wrong.
posted by Lanark at 5:02 AM on July 19

This is a much larger scale than some mere natural disaster. This was a planetary event impacting millions and costing billions, probably trillions.

It doesn't seem to have caused direct human life loss and the exces and CEO's who put all this culture in place for one engineer and his low level manager to fuck up this much will be sure to pay NOTHING.

But that's just capitalism.
posted by Comstar at 5:02 AM on July 19 [10 favorites]

Let's see how this affects:

* The MS Teams-powered virtual job interview I have at noon today, and
* My roommate's flight home from Germany this afternoon.
posted by EmpressCallipygos at 5:06 AM on July 19 [3 favorites]

It's a side-issue but it's driving me up the wall just how bad some of the reporting on this - the Guardian's in particular - is.

Like, CrowdStrike's CEO has come out and said "Yes, it was us, we pushed a bad update, we did this" and still all the fucking stories are "MICROSOFT UPDATE CLOSES HOSPITALS".
posted by parm at 5:07 AM on July 19 [8 favorites]

Who needs cyber crime when we’ve got cybersecurity professionals (and, occasionally, guys with backhoes)?
posted by GenjiandProust at 5:07 AM on July 19 [4 favorites]

You can tell it wasn't a Russian op because it didn't happen on election day
posted by seanmpuckett at 5:10 AM on July 19 [1 favorite]

For clarity:

This was caused by an update to a piece of third-party security software. It affects Windows (mostly desktop) computers that are specifically running the CrowdStrike endpoint security software, because those are the computers CrowdStrike endpoint security software runs on.

Saying this is a Microsoft problem is broadly the same as reporting that a chain of gas stations who suddenly started filling your tank with sand was a Ford/GM problem.

(the fallout from this is going to be a fucking nightmare, be kind to your IT folks, they're going to be hella stressed right now)
posted by parm at 5:10 AM on July 19 [10 favorites]

The local tea room has a handwritten-sign saying "Sorry cash only because computers". Their afternoon tea is particularly delightful and I'm meeting a moth collector there later, so I'm relieved they haven't closed completely. Just going to have to use some of the cash I'd put by for tombola this weekend, but no hardship. #SconeTime
posted by Wordshore at 5:11 AM on July 19 [16 favorites]

Let's see how this affects:
* My roommate's flight home from Germany this afternoon.

We are flying home from Romania via Germany tomorrow so we have been monitoring this closely. If your roommate's flight is on Lufthansa, they should be fine. They only had issues with their budget domestic subsidiary. Their international flights are all operating normally without interruption.
posted by DirtyOldTown at 5:13 AM on July 19 [3 favorites]

Causing global kernel panic bsods with a midnight Friday update to your cybersecurity software is incredible.
posted by lucidium at 5:15 AM on July 19 [15 favorites]

LOfucknL - this bigass hoity toiety security firm simultaneously pushed out a crap update to their entire worldwide corporate userbase in the middle of the night. LOL
posted by sammyo at 5:15 AM on July 19 [4 favorites]

"Their afternoon tea is particularly delightful and I'm meeting a moth collector"

Wordshore... may I please just come over and .... be you.
posted by sammyo at 5:19 AM on July 19 [19 favorites]

Types into an LLM “We should have a general strike. Stop working!”

LLM replies “Crowdstrike stops working! Great idea! Off to implement!"
posted by lalochezia at 5:23 AM on July 19 [9 favorites]

My cable modem has been on the fritz since a thunderstorm the other day, so access to my job has been entirely through my iphone's hotspot (no VOIP phone, no Teams meetings or calls). Outlook is running slower than if I had an old 14.4k modem attached. And so this global outage that is affecting hundreds of my customers is perfectly timed. I'm barely tapping out one, "I'm sorry, our tech guys are on the case" email before five more complaints come in.
posted by mittens at 5:28 AM on July 19 [1 favorite]

This is a Microsoft problem in as much as it is Microsoft Windows that is crashing, Im sure some people in Redmond will now be wondering if they should limit or ring-fence the access which 3rd party apps have to the system kernel. Perhaps adding some more stringent validation tests to the Code Signing process.
posted by Lanark at 5:30 AM on July 19 [7 favorites]

I’m at a medical procedure this morning and they have a few machines up but they are doing a paper checkin process in part and the login kiosks are out. Also I got laid off from tech job in February where they have like 50,000 + hosts running crowdstrike and metrics and dashboards where they slap you if you’re missing it or running an old version . If you don’t know , Crowdstrike is probably THE main intrusion/ virus protection type app run by big companies . Luckily the Linux version seems ok or this would be wayyy worse if that’s imaginable .
posted by caviar2d2 at 5:34 AM on July 19 [2 favorites]

Also a my wife’s medical company (US) all the remote employees are at home with the Bitlocker key prompt. Most of them are nurses with low tech literacy .
posted by caviar2d2 at 5:36 AM on July 19

This was caused by an update to a piece of third-party security software. It affects Windows (mostly desktop) computers that are specifically running the CrowdStrike endpoint security software, because those are the computers CrowdStrike endpoint security software runs on.

Also windows servers, which are often running critical bits of glue infrastructure that allow desktop windows computers to access other services, or run those services directly - those have been an even bigger problem I think.

We're not running this software locally thankfully, but at least two of our SaaS cloud providers *are* somewhere in their infrastructure, including our MIS software that is the key source for HR, payroll, and various other depts so they're still down; fortunately, we can survive an outage for a bit as this isn't the first time, but if it drags on beyond a couple of days that'll be fun. And of course the fix needs to be manual per device.

Some hospitals and GP surgeries due to a patient records system going offline, airports and various train services are struggling with similar supply-chain issues for handling checkins, and Sky News *stopped broadcasting* for 3 hours this morning and basically came back running on paper sheets for the presenters! I am so glad I'm not going anywhere near an airport this week.

Whoever designed their systems to allow this push to worldwide production of an immediately BSOD loop crash without having a test system that would have picked it up first (which is as complicated to detect as RUNNING IT, it appears) deserves to be sacked, and the C-suite executives that probably tried to save money by cutting back on pre-roll testing should be absolutely nailed to the wall. I mean, there's IT cockups, and there's this 'knock vast amounts of critical infrastructure offline globally' effort - and from an anti-malware purveyor, to boot!

Of course, the lucky low-level peon who happened to push the button is likely going to carry the entire can for what is a massive, massive process failure.

Hopefully senior people will pay a bit more attention to their IT teams in the inevitable aftermath meetings when they witter on about 'Disaster Recovery' and 'non-homogenous infrastructure' and 'this could destroy our business if it breaks' rather than immediately reach for the big red Funding Denied stamp as usual.
posted by Absolutely No You-Know-What at 5:37 AM on July 19 [10 favorites]

Ugh. I hope this gets fixed soon, as my youngest son is flying home from the international Scout Jamboree in Iceland tonight, and has a connection on Porter at Pearson Airport. All Porter flights are cancelled right now because of this. I don’t relish the thought of an unplanned round-trip drive from Ottawa to Toronto.
posted by fimbulvetr at 5:45 AM on July 19

Well, the fix is to stop machines that have not yet deployed the dodgy update from doing so; so if it hadn't killed your windows stuff yet, the fix will stop it from happening to you.

If you're already in a bluescreen loop-of-death though, i.e. windows boots, immediately crashes, and reboots, your only option is to roll back the OS to a backup image or boot into safe mode/recovery mode and remove the offending driver by hand before it'll start again. So uh, not really a fix, more a 'stop the axe murderer rampage we unleashed from axing any more people, sorry about all the ones already missing limbs' type deal.
posted by Absolutely No You-Know-What at 5:46 AM on July 19 [2 favorites]

There is a good wikipedia page tracking the incident.
posted by rongorongo at 5:47 AM on July 19 [2 favorites]

The sucky thing about being an IT security professional is that no one knows who you are until something goes wrong. Many many IT security people are finally having their day in the sun today.
posted by Tell Me No Lies at 5:53 AM on July 19 [3 favorites]

Also windows servers, which are often running critical bits of glue infrastructure that allow desktop windows computers to access other services, or run those services directly - those have been an even bigger problem I think.

Yes sorry - I used "desktop PCs" to differentiate from cloud services as a lot of reporting misinformation about this was talking about "this is the problem putting everything in the cloud!". It's also affecting VMs, of which there are probably even more than physical machines. "Instances of Windows running Cloudstrike" is the most technically correct.

This is a Microsoft problem in as much as it is Microsoft Windows that is crashing, Im sure some people in Redmond will now be wondering if they should limit or ring-fence the access which 3rd party apps have to the system kernel. Perhaps adding some more stringent validation tests to the Code Signing process.

Microsoft is caught between a rock and a hard place here. If it blocks third-party security software (which, by its very nature, requires incredibly low-level access to the system such that this kind of fuckup of a very plausible outcome) then it gets hit by massive antitrust cases. If it permits it then this kind of thing is always a possibility.

Realistically, your alternative is to turn desktop PCs into iPhones - third party apps are installable only via a controlled app store, don't get root, have to declare what interfaces and system components they want to interact with up-front, etc etc (and honestly - I think that's a good idea for the vast majority of users) - but that's never going to fly for a variety of technical and political reasons. Not least that, with that model, third-party security software simply wouldn't be possible.
posted by parm at 5:56 AM on July 19 [2 favorites]

So my windows laptop (that definitely has crowdstrike) that's been powered down all night might be OK?
posted by paper chromatographologist at 5:56 AM on July 19

Yeah, you're probably fine. You have a backup of your critical data too, right? Just in case...
posted by Absolutely No You-Know-What at 5:58 AM on July 19 [2 favorites]

It's hard to imagine how this could even happen. For Crowdstrike to even operate, they must have pretty robust automated testing of updates on a variety of systems, right? Else they never would have survived to get this big. I hope and assume the Real Dirt will be known soon.
posted by McBearclaw at 5:59 AM on July 19 [1 favorite]

Yeah, you're probably fine. You have a backup of your critical data too, right? Just in case...

Nope .... because I don't keep any critical data it!
posted by paper chromatographologist at 6:02 AM on July 19 [3 favorites]

A Twitter wag described a worldwide outage that doesn't impact Teams or Outlook as like a snowfall that doesn't stick and isn't enough to cause a school cancellation or delay.
posted by delfin at 6:06 AM on July 19 [3 favorites]

I thank the baby Jesus that none of my clients use Crowdstrike. It does seem like there is a short window between the system booting and the agent crashing the kernel, though, so at least some systems may come good on their own. If not, I hope your remote access solution works in Safe Mode with Networking, otherwise you're gonna have a fun time talking people through deleting the update file. Either that or spending a lot of money on shipping and/or travel this week if you have a lot of remote users.
posted by wierdo at 6:07 AM on July 19

One of the mods at r/crowdstrike, which is run by the company, has stickied this post with instructions for the workaround to the top of the sub:

Edit 11:27 PM PT:
CrowdStrike Engineering has identified a content deployment related to this issue and reverted those changes.

Workaround Steps:
Boot Windows into Safe Mode or the Windows Recovery Environment
Navigate to the C:\Windows\System32\drivers\CrowdStrike directory
Locate the file matching “C-00000291*.sys”, and delete it.
Boot the host normally.

So, do folks have to do this for every single affected computer, one-by-one? Yow.
posted by mediareport at 6:07 AM on July 19 [4 favorites]

I speak as somebody unaware of what Cloudstrike was until today.

a-and as an extra bonus, today you know who's not using Crowdstrike, too!
posted by chavenet at 6:08 AM on July 19 [1 favorite]

The view of the global crisis from Evesham is delightful, reminiscent of those century-old headlines TITANIC SINKS: FATE OF LOCAL MAN UNKNOWN. I feel enlightened. Wordshore, please take (and post on Bluesky) a picture of that sign at your tea-room: it is one for the ages.
posted by Hogshead at 6:08 AM on July 19 [4 favorites]

The company I work for ALMOST switched to Cloudstrike last year, we're all very glad they didn't.

And Nthing the questions about WTF happened to their process for deployment, and MS' code review for that matter, that an insta-BSOD update got through? Like serioulsy HOW did a zillion dollar security company decide it was totally cool to push a completely untested change of any sort?

I don't care if it's just supposed to be a change to alter the background color of a button from 4E21F7 to 4E21F8 you'd think there'd be a requirement for some pretty serious testing prior to rollout.

Yes, some poor programmer was the person who made the mistake and hit commit, but if your organization is structred in such a way that it's possible for a problem this huge, and obvious, to be rolled out then it indicates deep, possibly intractable, problems with thier technological culture, standards, procedures, and general approach and attitude.

A lot of people will be canceling Crowdstrike, not becaue of this one specific error but because of what it indicates about the underlying culture at Crowdstrike and their either ineptitude or unwillingness to actually set things up correctly.
posted by sotonohito at 6:12 AM on July 19 [7 favorites]

So, do folks have to do this for every single affected computer, one-by-one? Yow.

Yup, or nuke it and restore a backup from prior to the update. For virtualised systems there's some additional options where you can remove the files from the disk without booting it, but for desktop PCs? I'm pouring one out for all the helldesk operatives trying to walk the vast swathes of non-technical remote workers through that repair for their laptop over the phone. Hell, even in-house desktop PCs are in for a loooong weekend cleaning this mess up.
posted by Absolutely No You-Know-What at 6:13 AM on July 19

Saying this is a Microsoft problem is broadly the same as reporting that a chain of gas stations who suddenly started filling your tank with sand was a Ford/GM problem.

… if there were only three brands of cars in the world, the gas station specialized in filling only one of them, and it came to your house and did it automatically.

Are Microsoft servers wildly and visibly affected? Yes. Is anything that is not a Microsoft product affected? No. That makes this a Microsoft problem whether they caused it or not.
posted by Tell Me No Lies at 6:14 AM on July 19 [2 favorites]

I don’t relish the thought of an unplanned round-trip drive from Ottawa to Toronto.

Well at least you don't have to drive to Iceland.
posted by terrapin at 6:20 AM on July 19 [4 favorites]

Worse for those helldesk types: if the company had bitlocker enabled you need to not only try to talk someone non technical through safe mode booting but ALSO try to get them to enter the 25 digit bitlocker emergency backup key. And that's assuming you have access to it, which you might not.

Bitlocker is a great idea right up until it breaks at which point you realize everything it was securing is totally inaccessable if anything goes wrong. Which is GOOD from a certain point of view, if everything is backed up on the cloud you want your local machine to lose the data rather than let it get out, but it's a pain in the ass regardless.
posted by sotonohito at 6:21 AM on July 19 [3 favorites]

whatever happened to it departments managing big fleets only rolling out software updates after testing them out in a small lab environment? absolutely amazing to me that they are doing automatic updating of third party software for the entire infrastructure. this is especially driving me nuts because my in laws are staying with me and are supposed to fly out today finally but now allegiant air doesn’t even have a website and nobody knows if the flight is cancelled or what
posted by dis_integration at 6:22 AM on July 19 [1 favorite]

Crowdstrike's blog post - linked above says that the issue has been isolated and that a "fix has been deployed". Is there anybody who would like to have a go any explaining how successful they are going to be in deploying that fix - or of explaining the likely scale of the event (the initial reports were from the UK)? I speak as somebody unaware of what Cloudstrike was until today.

"A fix has been deployed" is of... limited help in the near term. IT does mean "this won't keep going indefinitely" and "a fix is available." Also, it means any computers that weren't running overnight are fine, because they'll download a valid update. However, any computers that were updated overnight are screwed. My wife actually ran into this in the morning: SHe and all her colleagues who had their laptops in bags from the office overnight are fine; those who did WFH and had their laptops plugged in overnight are SOL.

The issue is that the problematic update causes systems to crash, almost instantly after starting up. There is a fix process but it's relatively unwieldy and executing it requires modest-but-real technical savvy on the part of somebody with physical access. There is simply no way to fix this remotely. So if you're a remote worker, you either need significant technical awareness to execute the fix instructions, or you need to get your laptop to a technician, physically, who then must do a relatively time-consuming and tedious series of operations to restore it. And there's only so many of those folks to go around.

This is gonna suck.
posted by Tomorrowful at 6:23 AM on July 19 [2 favorites]

Worse for those helldesk types: if the company had bitlocker enabled you need to not only try to talk someone non technical through safe mode booting but ALSO try to get them to enter the 25 digit bitlocker emergency backup key. And that's assuming you have access to it, which you might not.

Oh boy, I didn't even consider bitlocker. Just call them into the office and re-image the laptops from scratch, I think. Or buy new laptops, might be cheaper. Or today might be a good day to find a new job chopping wood somewhere in Alaska.
posted by Absolutely No You-Know-What at 6:23 AM on July 19 [2 favorites]

The local tea room has a handwritten-sign saying "Sorry cash only because computers".

I like that Wordshore, in full Wordshorocity, has identified the cruelest possible future — being denied tea rooms in a cyberpunk dystopia.
posted by GenjiandProust at 6:25 AM on July 19 [1 favorite]

There’s a good argument to be made now that CrowdStrike has CAUSED more outages that it has PREVENTED. That kind of calls into question the WHOLE FUCKING POINT of the software in the first place.
posted by 1970s Antihero at 6:25 AM on July 19 [11 favorites]

So I can take Monday off, yeah?
posted by pompomtom at 6:29 AM on July 19 [3 favorites]

So happy I signed our business up for Bitdefender instead.
posted by pipeski at 6:30 AM on July 19

I would like to see Cloudstrike executives in horse-drawn cage driven before an angry mob, pelted with offal and feces. Then they can fail up to director positions at KPMG, recommending their ex-competitors programs.
posted by seanmpuckett at 6:31 AM on July 19 [1 favorite]

crowdstrike stock down 20% and falling on market open. see, qa and automated testing is in fact very valuable to your business, it turns out
posted by dis_integration at 6:34 AM on July 19 [6 favorites]

The upside of this from an IT perspective is that your boss knows that half the world has gone down, therefore it's not your fault.

It's like the old Keynes quote about the advantages of failing conventionally.
posted by clawsoon at 6:34 AM on July 19 [3 favorites]

I have Crowdstrike Falcon on my work laptop…..guess I’m going to find out in a few minutes…….

*Pours one out for Crowdstrike marketing teams a few weeks out from Black Hat / Defcon.*

Can’t imagine being on the Crowdstrike booth this year at BH will be fun….they better have some absolute grade A swag being handed out like candy. I remember after the Edward Snowden data leak the vendor across from Booz Allen at Black Hat that year set up a full life size Edward Snowden across the aisle from them…..just smirking at them. I expect some similar trolling of Crowdstrike this year…..
posted by inflatablekiwi at 6:34 AM on July 19 [4 favorites]

Tired: Five Nines Uptime
Wired: Nine-to-five downtime
posted by thecaddy at 6:35 AM on July 19 [6 favorites]

Dammit, my work's Microsoft 365 is not affected so I have to do work

Thank you sweet baby Jesus for Linux on my home laptop tho
posted by Kitteh at 6:37 AM on July 19 [2 favorites]

Oh, and it seems that by the time I woke up this morning most of the affected airlines were flying again, so that's good. Props to the folks who were able to get their end points working in the 10 second window they had.

As an aside, this situation further reinforces my belief that companies that dump their mainframe/midrange systems for Windows are short-sighted dumbasses. I'll take dumb terminals connected with Twinax for my mission critical shit, thanks. Use Windows or whatever for the stuff that's just annoying if it breaks or only a problem in the long term. If your business shuts down entirely without it, cheaping out is stupid.

"Oh, but what about all the redundancy we built in," people ask. First, you can do that with the old systems. Second, how much good did that do you today? Oh yeah, fuck all. I want my bank and my airline and my water system and electric power provider running on shit that isn't Internet first. Serial lines are perfectly fucking fine, thanks.
posted by wierdo at 6:38 AM on July 19

Saying this is a Microsoft problem is broadly the same as reporting that a chain of gas stations who suddenly started filling your tank with sand was a Ford/GM problem.

There is a lot of bad reporting calling this a Microsoft problem, for sure, but it's being exacerbated by the fact that there was also a huge, apparently unrelated, Azure outage yesterday that seems to have been mostly fixed by now, but that took Teams and OneDrive and some other MS cloud services for large numbers of customers yesterday and into this morning (US Time).

So yes, the current appropriate headline is not "Giant Microsoft Screwup Causes Massive Business Disruption" but only because somebody else managed an even bigger screwup a few hours later.
posted by The Bellman at 6:40 AM on July 19 [2 favorites]

whatever happened to it departments managing big fleets only rolling out software updates after testing them out in a small lab environment?

Short-sighted and tight-fisted executives who only see 'line goes up' departments as valuable. And good marketing for 'our service does it better in the cloud!' And, if I'm being fair, ossified IT departments who lived by 'computer says no' when asked to ever change anything ever.
posted by Absolutely No You-Know-What at 6:41 AM on July 19 [3 favorites]

I'm on vacation (and not traveling), so this is just amusing to me.

But I feel bad for any underlings who are fired over this. The company should have a system that prevents the rollout of anything that hasn't been thoroughly tested to prevent disasters like this, and that should be verified from the top.
posted by pracowity at 6:45 AM on July 19 [1 favorite]

Good times over in the crowdstrike subreddit. Here’s the pinned thread. Even if you’re not technical, it will give you an idea of the huge scale of this and there is lots of gallows humor. Sample post: “this is what Y2K wishes it was”. Plenty of anecdotes about entire businesses down, unable to buy groceries in Australia even with cash, etc.
posted by caviar2d2 at 6:46 AM on July 19 [2 favorites]

Saying this is a Microsoft problem is broadly the same as reporting that a chain of gas stations who suddenly started filling your tank with sand was a Ford/GM problem.

What if Ford/GM endorsed or mandated that gas station?

This is the way of the software world now, innit? Outsource your security, data management, etc, and when that supplier fucks up, shrug and point. We allegedly had our data leaked when one fund's 3rd party provider screwed up; we get a "gosh, sorry" letter and a free 2-year sign up to a credit-monitoring service. Wow.

I feel the same about security outsourcing as i do about AI: the risk and responsibility must remain with the client-facing entity. If Microsoft used Crowdstrike, and Crowdstrike fucked me up... Microsoft is on the hook, and they can't hide behind Crowdstrike.

But on the other hand, sloppiness is the price of advances coming fast and cheap. Would we be willing 5o pay the extra costs of a safer, better-managed IT system?

In my software career, I was there overnight for one or two bad rollout. It happens. But it was one big but non-critical website. I've also been part of teams where we planned and worked tirelessly for safe, well-managed rollouts of more critical infrastructure... and they were successful.
posted by Artful Codger at 6:47 AM on July 19

What really boggles my mind about this is that it appears to affect 100% of the updated systems. One thing for it to be triggered by a specific configurations that might be overlooked in testing, but how did it not get noticed immediately?

I'd love to know how many hours this was made available...
posted by rambling wanderlust at 6:48 AM on July 19 [1 favorite]

John Scalzi: Great, this Crowdstrike thing means Mac users are going to be even more intolerably smug than they already are, he typed, from his exquisite Space Black Mac Pro 16 inch laptop, which accesses the Internet perfectly

While true, that doesn't mean Mac users will be able to do a lot of things like access their bank accounts, book tickets, pay for lunch, etc. I just heard from my daughter that they can't access their Chase account because of this. Quite possibly their debit cards won't be able to work.
posted by Thorzdad at 6:50 AM on July 19 [1 favorite]

CEO's who put all this culture in place for one engineer and his low level manager to fuck up this much will be sure to pay NOTHING.

I would not bet on that at all. This is a PR disaster, and investors will expect some C-suite heads to roll because of that.
posted by Tell Me No Lies at 6:51 AM on July 19

ceo might get fired, but 911 is down in many places, which means this software update bug is going to probably result in someone dying and really the consequences should be greater than that, something i’m ok with as a software engineer
posted by dis_integration at 6:54 AM on July 19 [1 favorite]

The sneaky Russian plan to sit back and watch America crumble on its own is going well I see.
posted by Space Coyote at 6:58 AM on July 19

Lots of people on that Reddit thread wondering if this is the biggest cybersecurity event ever. Too early to tell, but imagine a threat that can knock a machine out totally (from Reddit .. “Safe mode? Bitlocker says fuck you!”). Now imagine instead of it slowly spreading to outdated or unpatched systems, it has an express lane to hit every Windows machine in the world at once. Basically around 300 of the Fortune 500 uses Crowdstrike and has IT policies and automation in place that mandated and forced this to happen to all their Windows machines all over the world, including Windows servers in the cloud . Yikes.
posted by caviar2d2 at 6:59 AM on July 19 [1 favorite]

By the by, if you’ve ever heard myself or some other networking expert say “if you understand the underpinnings, you know that there’s no way that the Internet can actually work. It’s way too complicated”, this is the sort of thing we have in mind.

The miracle is that it doesn’t happen once a week.
posted by Tell Me No Lies at 7:01 AM on July 19 [4 favorites]

whatever happened to it departments managing big fleets only rolling out software updates after testing them out in a small lab environment?

Don't quote me, but I believe that the corporate IT zeitgeist over the past few years has switched from being most afraid of internal rollout failures to being most afraid of external hacks.

Not without reason, of course - seems like lately every big company has been getting hacked one way or another. Zero-day stuff gets taken advantage of very quickly, so you have to be ready to roll out fixes immediately rather than waiting on lab testing.

And now you've got multiple paths into your network with the explosion of work-from-home.

A couple of big security companies have responded to all that with very attractive sales pitches of, "Just install our product on all your machines and all your employees' WFH machines, and you'll never have to think too hard about security ever again."

Security audit firms are happy to chime in and tell you that buying one of these products is the easiest way to pass the security audit, so put it all together and it becomes a no-brainer for CTOs.
posted by clawsoon at 7:02 AM on July 19 [3 favorites]

whoever let the Crowdstrike CEO go on Good Morning America looking like hungover Tintin without a full explanation or an apology and with a frog in his throat should probably evaluate their options as well

(That interview on Youtube) - the lack of preparedness extends to the knowledge of the presenters too however. Rather than have an interviewer preface their questions with "its all magic to me but..." they needed somebody sufficiently knowledgeable to kick off with "Does your company test its software before it deploys it?"
posted by rongorongo at 7:04 AM on July 19 [4 favorites]

whatever happened to it departments managing big fleets only rolling out software updates after testing them out in a small lab environment?

Yeah, I'm deep in vulnerability review at the moment, and one that's kind of biting us on the ass is due to some downrev libraries in a vendor's firewall product, but the last three updates we've received from them have caused our lab machines to seize up after about 15 minutes and go offline.

I'm torn on whether I'd rather just have had the whole dev environment shit the bed three times and cleaning it up (so everybody in the company would be like "yeah those were shit updates and the vendor should feel bad") versus explaining to an auditor that the reason we currently have downrev libraries is because we didn't let the update shit the bed.
posted by Kyol at 7:05 AM on July 19 [2 favorites]

I sure do wonder how a lot of places are feeling about laying off a bunch of infosec people this year.

I sure do.
posted by humbug at 7:08 AM on July 19

the exces and CEO's who put all this culture in place for one engineer and his low level manager to fuck up this much will be sure to pay NOTHING

Surely they will be handsomely, I mean richly, rewarded for eventually patching up the damage?

Emergency Cobra meeting held

People talk about Cobra as though it were magic, but I’ve been told it just means ‘Cabinet Office Briefing Room A’.
posted by Phanx at 7:14 AM on July 19

Another way to gauge the severity:

* Go to downdetector.com, a well-known crowdsourced app/site status page.
* Look at the widgets from major companies on the home page.
* Note that almost all the outage graphs are hockey sticks.
posted by caviar2d2 at 7:14 AM on July 19 [2 favorites]

but 911 is down in many places, which means this software update bug is going to probably result in someone dying

Wow!

Outages in Australia have included trains (in Victoria), banks, supermarkets and shops,

but 000 (fire/ambulance/police) has not been affected, and neither have telecommunications.

(We did have a high profile telecommunications outage earlier this year which did affect 000, but that was the phone companies fault, and only affected customers who were with one specific phone company.)
posted by chariot pulled by cassowaries at 7:19 AM on July 19

Security audit firms are happy to chime in and tell you that buying one of these products is the easiest way to pass the security audit, so put it all together and it becomes a no-brainer for CTOs.

There are ways to square that circle; automate your routine security updates for e.g. desktops but on a time delay for most with a subset of guinea pigs/canaries so that there's a window to halt wider rollout if there's an upstream cockup like this, or override and push early if the risk of the zero day is greater than that of the risk of a bad update (sometimes you do get both, a windows update for a zero day that also nuked user profile files for some springs to mind)

For critical infrastructure, you stage patching and have a rollback plan, including hot spares and disaster recovery, and infosec professionals testing fast. The window between patch release and rollout does have to be far shorter these days, but it doesn't have to be zero for everything and everyone.

But of course, it requires adability, foresight, and paying for capable infosec professionals rather than outsourcing the whole kaboodle and decision chain to a relatively cheap provider. Since such IT people should be highly trained, skilled and flexible, and thus usually not cheap, so paying for that is anathma to modern mangement practices.
posted by Absolutely No You-Know-What at 7:21 AM on July 19

« Older &$!#% | A new unity like a bundle of sticks Newer »

You are not currently logged in. Log in or create a new account to post comments.

MetaFilter

Saying "Have you tried turning it off and on again?" is getting tiresome
July 19, 2024 4:27 AM Subscribe

Tags

Share

Saying "Have you tried turning it off and on again?" is getting tiresome July 19, 2024 4:27 AM Subscribe

Tags

Share

Saying "Have you tried turning it off and on again?" is getting tiresome
July 19, 2024 4:27 AM Subscribe