The untold story of QF72:
June 4, 2019 4:21 PM Subscribe
In 2008, rogue automation caused an Airbus A330 to fall out of the sky, a harbinger of the 737MAX debacle a decade later. An uncommanded nose dive on an Airbus A330 caused a mass casualty incident, injuring 9 out of 12 crew and over 100 passengers, 14 seriously enough that they required life flights to Perth. Attempting to reboot the malfunctioning flight control computer sends the plane into a second nose dive, forcing the pilot to fly the crippled plane by hand to the nearest airport with multiple systems disabled. Initial reporting blamed clear air turbulence and reminded passengers to keep their seat belts on for safety, but it soon became clear something terrible had gone wrong with the automation on the plane.
Interview excerpts from the documentary describe the extent of the injuries suffered by the passengers and how the pilot managed to restore control of the aircraft - by counter-intuitively letting go of the stick and giving up control.
Wikipedia link.
ATSB report.
Interview excerpts from the documentary describe the extent of the injuries suffered by the passengers and how the pilot managed to restore control of the aircraft - by counter-intuitively letting go of the stick and giving up control.
Wikipedia link.
ATSB report.
Re the Wikipedia page's explanation of how the AOA units work: I'm sure many alternate designs were discussed and that there's a reason I'm not privy to that explains why they went with the design they did, but if I'm understanding it right, it sounds like, even though the output of all three units was considered to determine the proper AOA reading, units 1 and 2 were primaries and that 3 was a sort of hot spare, since a persistent malfunction of either unit 1 or unit 2 was sufficient to lead the flight computer to believe that a corrupt AOA reading was the correct value. I don't really understand why they wouldn't have equal voting rights to prevent exactly this sort of situation, since you need an odd number of participants to prevent ties. I understand that "voting" gets a little complicated when you're dealing with three redundant systems that are outputting continuous values as opposed to a fixed set of labels, but it seems like my intuition that you'd average the two closest values and reject the outlier is not entirely contrary to the logic they were using.
posted by invitapriore at 5:44 PM on June 4, 2019 [3 favorites]
posted by invitapriore at 5:44 PM on June 4, 2019 [3 favorites]
I’m a pilot and I work in software so this is very very interesting to me. Unfortunately the article is very light on technical detail, I need to read the full report. (Yes one rogue angle of attack sensor should not be able to outvote the other two!)
posted by phliar at 6:13 PM on June 4, 2019 [4 favorites]
posted by phliar at 6:13 PM on June 4, 2019 [4 favorites]
Turn pilots into sysops decoupled from direct authority over aircraft controls and software errors will also crash the plane. This is an advantage that Boeing used to have over Airbus, from a design philosophy standpoint, but they moved away from it during the rushed design of the 737Max.
There's probably also a lesson here for designers of self-driving cars, but I digress.
posted by killdevil at 6:19 PM on June 4, 2019 [13 favorites]
There's probably also a lesson here for designers of self-driving cars, but I digress.
posted by killdevil at 6:19 PM on June 4, 2019 [13 favorites]
A related story to this is that Fuzzy Maiava (the flight attendant who suffered permanent psychological and physical injury and is unable to work as a result of this incident) was screwed out of compensation by Airbus and Northrup Grumman.
posted by L.P. Hatecraft at 6:57 PM on June 4, 2019 [5 favorites]
posted by L.P. Hatecraft at 6:57 PM on June 4, 2019 [5 favorites]
I'm interested to hear how an automated train that operated fine for 25 years suddenly decided to go the wrong direction and hit the end of the track.
posted by ctmf at 6:58 PM on June 4, 2019 [1 favorite]
posted by ctmf at 6:58 PM on June 4, 2019 [1 favorite]
There's probably also a lesson here for designers of self-driving cars, but I digress.
When an Airbus airliner says to the pilot: "You take over, I can't deal", the pilot has several minutes to finish his sudoku, reacquaint himself with the current position and state of the aircraft, and take control.
When a self driving car says it, the driver is a split second away from the next major hazard. It's really apples and oranges. The only way for a self driving car to hand control to a human is to come to a full stop first, and alert all nearby self driving cars that it is doing so. So it's a bit of an apples to oranges comparison.
posted by ocschwar at 7:21 PM on June 4, 2019 [5 favorites]
When an Airbus airliner says to the pilot: "You take over, I can't deal", the pilot has several minutes to finish his sudoku, reacquaint himself with the current position and state of the aircraft, and take control.
When a self driving car says it, the driver is a split second away from the next major hazard. It's really apples and oranges. The only way for a self driving car to hand control to a human is to come to a full stop first, and alert all nearby self driving cars that it is doing so. So it's a bit of an apples to oranges comparison.
posted by ocschwar at 7:21 PM on June 4, 2019 [5 favorites]
The article I'm looking forward to reading will be about how the French government bailed out Airbus by funding research into software verification. A French government think tank called INRIA has been pushing the state of the art in that field, to the point that Grenoble may wind up being the nucleus of the next Silicon Valley. Go to a top notch programming language theory conference these days, and you will be overhearing a lot of French. And Airbus's mistakes were a driving force behind it.
posted by ocschwar at 8:22 PM on June 4, 2019 [16 favorites]
posted by ocschwar at 8:22 PM on June 4, 2019 [16 favorites]
So anyway, next year for self driving cars is it? AI the blockchain folks!
posted by GallonOfAlan at 11:58 PM on June 4, 2019 [1 favorite]
posted by GallonOfAlan at 11:58 PM on June 4, 2019 [1 favorite]
The bit-flip angle has been in the back of my mind for a while. Years ago, I worked on (what was then considered) a large disk-based data storage device, and while it wasn't rad-hardened, it did include elements to resist bitflips. It had ECC memory that could correct single-bit errors, and detect double-bit errors (which would trigger an emergency shutdown & reboot). In particular, I remember it had a "Memory Scrub" task, the lowest-priority task in the system, that would just endlessly loop, reading every byte in memory. The values it read were discarded - the only point was to try to trigger the ECC to correct any latent single-bit errors before they mutated to double-bit or worse.
Anyhow, since then, whenever I hear about self-driving, A. I., deep learning... I always wonder, are they guarding against stray radiation flipping bits?
posted by Rat Spatula at 5:51 AM on June 5, 2019 [2 favorites]
Anyhow, since then, whenever I hear about self-driving, A. I., deep learning... I always wonder, are they guarding against stray radiation flipping bits?
posted by Rat Spatula at 5:51 AM on June 5, 2019 [2 favorites]
The neural net weights at least will shrug off corrupted bits from flying space neutrons, by their nature. If they use SRAM-based FPGAs for anything, the upset rate is quite high. I'm guessing engineers know about this stuff by now, given how many doohickeys are in modern autos.
posted by RobotVoodooPower at 6:38 AM on June 5, 2019
posted by RobotVoodooPower at 6:38 AM on June 5, 2019
When an Airbus airliner says to the pilot: "You take over, I can't deal", the pilot has several minutes to finish his sudoku, reacquaint himself with the current position and state of the aircraft, and take control.
That depends on where in the flight they are, doesn't it?
posted by TedW at 7:05 AM on June 5, 2019 [1 favorite]
That depends on where in the flight they are, doesn't it?
posted by TedW at 7:05 AM on June 5, 2019 [1 favorite]
Interesting non-technical read. The key issue seems to be here:
The bottom line is that automation of the computer codes and the algorithms are designed by people, which is what they are actually being designed to protect against ... People make mistakes and that is never going to change. There needs to be more understanding of who is designing these things and what processes are in place."
"People make mistakes" applies both to pilots and to the designers of the flight control setup. I would guess that integrated over the huge number of flights every day, pilots make many, many more errors, but those errors have far less catastrophic consequences – partly because the flight control systems can pick up after them. Whereas flight control system mistakes are rarer but more catastrophic.
What was it, back from USENET days? To err is human - it takes a computer to really fuck it up?
posted by RedOrGreen at 10:43 AM on June 5, 2019 [2 favorites]
The bottom line is that automation of the computer codes and the algorithms are designed by people, which is what they are actually being designed to protect against ... People make mistakes and that is never going to change. There needs to be more understanding of who is designing these things and what processes are in place."
"People make mistakes" applies both to pilots and to the designers of the flight control setup. I would guess that integrated over the huge number of flights every day, pilots make many, many more errors, but those errors have far less catastrophic consequences – partly because the flight control systems can pick up after them. Whereas flight control system mistakes are rarer but more catastrophic.
What was it, back from USENET days? To err is human - it takes a computer to really fuck it up?
posted by RedOrGreen at 10:43 AM on June 5, 2019 [2 favorites]
I was helping to prepare a software release when, as a final step before mastering the CD, we ran checksums on all of the files. One file came up with a bad checksum. Running the checksum again, it came up correct. Ran it hundreds of times and the answer was randomly either correct or a particular incorrect value. On further investigation it was a single bit in the file that was flipping.
So I tried copying the file, figuring there was something marginal in that bit ot the disc. And the funny thing was it kept happening in the copy. It did not matter where on the disc the file was, that particular bit would flip randomly. I looked again at the file and the flipping bit was after a long string that happened to be 0101010101010. My conclusion was that there was a marginal circuit, cable or connector somewhere and that data from the disc was being encoded using NRZI encoding that flips every other bit in an attempt to prevent long strings of no-change regions. But in this case it creates them and as a result the data stream burst can become unsynchronized.
They always say it's never the hardware. But sometimes it really is the hardware.
posted by sjswitzer at 10:43 AM on June 5, 2019 [10 favorites]
So I tried copying the file, figuring there was something marginal in that bit ot the disc. And the funny thing was it kept happening in the copy. It did not matter where on the disc the file was, that particular bit would flip randomly. I looked again at the file and the flipping bit was after a long string that happened to be 0101010101010. My conclusion was that there was a marginal circuit, cable or connector somewhere and that data from the disc was being encoded using NRZI encoding that flips every other bit in an attempt to prevent long strings of no-change regions. But in this case it creates them and as a result the data stream burst can become unsynchronized.
They always say it's never the hardware. But sometimes it really is the hardware.
posted by sjswitzer at 10:43 AM on June 5, 2019 [10 favorites]
We were long told that an overloaded computer nearly caused the abortion of the Apollo 11 landing. Maybe. Or maybe whatever we 'learned' about changed in 2005 ... 36 years after the incident.
"Software engineer Don Eyles concluded in a 2005 Guidance and Control Conference paper that the problem was due to a hardware design bug....", as the result of "an electrical phasing mismatch" which caused "spurious cycle stealing...."
Source of quote. Eyles paper.
Gremlins have been around for a long time!
posted by Twang at 4:51 PM on June 5, 2019 [1 favorite]
"Software engineer Don Eyles concluded in a 2005 Guidance and Control Conference paper that the problem was due to a hardware design bug....", as the result of "an electrical phasing mismatch" which caused "spurious cycle stealing...."
Source of quote. Eyles paper.
Gremlins have been around for a long time!
posted by Twang at 4:51 PM on June 5, 2019 [1 favorite]
Or as we say in our office - "one computer talking to another computer - what could possibly go wrong?"
Part of the problem is that if there is bad data/info - it gets transmitted through all the interconnected systems. An example right at the WTF end of the spectrum - the wrong date of birth was put on the hospital certificate at the time the baby was born (major stressful birth - both mother and child at risk of birth - more than 24 hours of surgery - etc - so no-one paying much attention to paperwork). The baby is now 5 years old and the parents have a filing cabinet full of documentation trying to keep up with/correct the error.
posted by Barbara Spitzer at 6:23 PM on June 5, 2019 [1 favorite]
Part of the problem is that if there is bad data/info - it gets transmitted through all the interconnected systems. An example right at the WTF end of the spectrum - the wrong date of birth was put on the hospital certificate at the time the baby was born (major stressful birth - both mother and child at risk of birth - more than 24 hours of surgery - etc - so no-one paying much attention to paperwork). The baby is now 5 years old and the parents have a filing cabinet full of documentation trying to keep up with/correct the error.
posted by Barbara Spitzer at 6:23 PM on June 5, 2019 [1 favorite]
I've written production test code for computing hardware, and it's harder than you might think. Take memory tests. You might think that if you write all ones to all memory locations and then all zeros, you'll find any stuck bits, lines or chips - and you will. But there are other error conditions that will only come out of you write alternating ones and zeros. or alternating pages of ones and zeros, or if you hit a particular magic boundary when this or that chip is out of spec.
You end up, as always, making tradeoffs. If you're making 100,000 home computers and you spend an extra hour per unit in testing looking for a one in ten million error. it's probably not worth it. The cost of replacing units that fail in the field is a lot less than doing the testing. (However, 25 percent DOA will kill you, as it did for one company I worked for.)
If you're building avionics, your equations are different. You spend much longer in testing, you build in redundancy, you keep a much closer eye on failure stats
The famous Apollo 11 1201 alarm on descent was... a checklist failure. It didn't cause a mission abort or worse because of many factors, including a very robust executive code architecture, a very great deal of pre-flight simulation and testing, and a mission ops culture that emphasized personal responsibility and trust. Technology and people both will fail, and you design not just your product but your organisation for that in the knowledge that the design process itself is fallible.
posted by Devonian at 5:56 AM on June 6, 2019 [3 favorites]
You end up, as always, making tradeoffs. If you're making 100,000 home computers and you spend an extra hour per unit in testing looking for a one in ten million error. it's probably not worth it. The cost of replacing units that fail in the field is a lot less than doing the testing. (However, 25 percent DOA will kill you, as it did for one company I worked for.)
If you're building avionics, your equations are different. You spend much longer in testing, you build in redundancy, you keep a much closer eye on failure stats
The famous Apollo 11 1201 alarm on descent was... a checklist failure. It didn't cause a mission abort or worse because of many factors, including a very robust executive code architecture, a very great deal of pre-flight simulation and testing, and a mission ops culture that emphasized personal responsibility and trust. Technology and people both will fail, and you design not just your product but your organisation for that in the knowledge that the design process itself is fallible.
posted by Devonian at 5:56 AM on June 6, 2019 [3 favorites]
The neural net weights at least will shrug off corrupted bits from flying space neutrons, by their nature.
I'm curious about this, can anyone provide a reference?
posted by Rat Spatula at 7:51 PM on June 6, 2019
I'm curious about this, can anyone provide a reference?
posted by Rat Spatula at 7:51 PM on June 6, 2019
« Older Why New York Can’t Have Nice Things | I Tweeted About The Joker Being a Woman Who Was... Newer »
This thread has been archived and is closed to new comments
posted by q*ben at 5:21 PM on June 4, 2019