Troubleshooting a spacecraft nine hours away, as the photon flies.
May 21, 2018 4:20 PM Subscribe

"We’ve lost contact with the spacecraft." Unintentional loss of contact with Earth should never happen to any spacecraft. It had never before happened to New Horizons over the entire nine-year flight from Earth to Pluto. How could this be happening now, just 10 days out from Pluto?
posted by bitmage (37 comments total) 35 users marked this as a favorite

It was the Mi-Go, wasn’t it?
posted by GenjiandProust at 4:28 PM on May 21, 2018 [16 favorites]

Puts all of my IT war stories into perspective. Great read.
posted by clawsoon at 5:06 PM on May 21, 2018 [3 favorites]

Aliums.
posted by BaffledWaffle at 5:19 PM on May 21, 2018 [1 favorite]

Pluto took revenge for its demotion.
posted by dances_with_sneetches at 5:39 PM on May 21, 2018 [6 favorites]

Rogue Octopodes.
posted by Homo neanderthalensis at 5:39 PM on May 21, 2018 [6 favorites]

Goddamned Thargoids.
posted by delfin at 5:40 PM on May 21, 2018 [2 favorites]

They're using our satellites against us

I would love to see some footage of those few days.
posted by Query at 5:49 PM on May 21, 2018

I can't find any details of the "hard-to-detect timing flaw" so I imagine the details are either too boring or too embarrassing to disseminate.
posted by RobotVoodooPower at 5:51 PM on May 21, 2018 [1 favorite]

This is good, with the caveat that "architect" is a shitty verb.
posted by killdevil at 6:00 PM on May 21, 2018 [3 favorites]

Wow, that brought tears to my eyes.

I think I've been spending too much time in the politics threads: a real-life story of competent, prepared people doing their best in an emergency situation to bring good science to the world is like a freaking miracle.
posted by medusa at 6:03 PM on May 21, 2018 [31 favorites]

Near Pluto, you say? Obviously it's interference from the Mass Relay.
posted by scaryblackdeath at 7:04 PM on May 21, 2018 [6 favorites]

If I had to guess, it sounds like an out of memory error. I can’t imagine the space craft would run an OOM killer.
posted by advicepig at 7:13 PM on May 21, 2018

It’s every IT professional’s dream, isn’t it, to be on-call for for a mission that involves a target system millions of miles away that exists solely to advance the state of human knowledge rather than to maintain a payment infrastructure that exists solely to make some yacht-aspiring fuckface another few million dollars. I’ve always been a pretty poor technical firefighter on account of just not giving much of a fuck about the “mission,” but I’d show up for a problem like this. Please disregard this comment if you’re a poster on the Jobs subsite going over my comment history here, I have many other strengths.
posted by invitapriore at 7:17 PM on May 21, 2018 [51 favorites]

To be pedantic they did not "loose control" the ship computer crashed. The backup was a slower box and took an hour to get back up. Then they were bumping up on an astronomical deadline. Pretty amazing story and gives background to the intense photos of the team that were in the media at the time.
posted by sammyo at 7:17 PM on May 21, 2018

Klingon Bird of Prey.
posted by entropicamericana at 7:53 PM on May 21, 2018 [2 favorites]

I always sorta wonder how much permanent damage the long-term systems end up taking. I mean, it's a very small sample pool so it's not like a hundred node Hadoop cluster with three thousand spinning disks (and you have hardware failures every other day to deal with), but over time in the environment they're in how much damage is there that they have to work around?
posted by Kyol at 7:54 PM on May 21, 2018 [1 favorite]

There is a little cryostat (tin can for keeping things cold) that sits in one of the clean rooms I work in. It has a little label on it that says 'New Horizons'. I'm going to think of this story every time I look at it from now on.
posted by runcibleshaw at 7:55 PM on May 21, 2018

Your circuit's dead, there's something wrong...

Also, what are the odds that the Mission Operations Manager (MOM) would be named Bowman?
posted by Halloween Jack at 8:15 PM on May 21, 2018 [6 favorites]

What happen is that they never sent a satellite, it was all fake and when they reach the moment when they have to answer the: "OK guys what's happening and what are you seeing" they didn't have other choice but to say "Oops we lost it", end of the story.
posted by CRESTA at 9:08 PM on May 21, 2018 [1 favorite]

The spacecraft obviously reached the limit of the simulation.
posted by um at 3:13 AM on May 22, 2018 [6 favorites]

The Register story from two days after the initial incident:

"To prepare for the final days of its mission, the probe was doing two things at once. First, it was taking the scientific data it has already harvested, compressing it, and writing it to a portion of its 128GBit storage (two 8GB solid-state recorders). At the same time the instrument command sequence for the flyby was being uploaded.

"The combined workload slightly exceeded the processor's capabilities, and triggered a watchdog feature designed to prevent the spacecraft's software from crashing. This watchdog switched the main computer system over to the backup computer, while putting the main system into sleep mode as a safety measure.

"The processor is a Synova Mongoose-V: a 12MHz MIPS R3000 CPU hardened against radiation. The R3000 is a 32-bit chip that's pretty similar to the one used in the original 1994-era Sony PlayStation among many other devices."

posted by kersplunk at 4:35 AM on May 22, 2018 [6 favorites]

The spacecraft obviously reached the limit of the simulation.

No, that was Voyager 1.
posted by Johnny Assay at 4:42 AM on May 22, 2018 [1 favorite]

The Reg story certainly makes it sound less of a skin-of-the-teeth event, if they'd already got contingency plans that came close enough to the problem.

As for the original problem, I guess it was that the watchdog process wasn't quite high enough priority compared to the upload and compression tasks, and the sleep/reboot sequence was a bit ferocious. I'm a bit surprised that a compression task could block a watchdog and I imagine that some configuration decisions were made prior to that which in retrospect may have been ill-advised, but I'd have to know more about the hardware and software architecture. Which I think is all public, so perhaps I'll go looking later and test my assumptions. 12 MHz isn't much CPU, and compression to disk will exercise a lot of bus and IO...
posted by Devonian at 5:59 AM on May 22, 2018 [1 favorite]

I think the most surprising part to me was that the command sequence that had been submitted was both a) not easily replayable from Earth and b) not stored somewhere nonvolatile on the spacecraft? Or did the watchdog reboot kill "b" - the reason the system had a watchdog event might've been a command sequence, better fall back to the SMbus and let the ground station try again and "a" is just lesson learned that they should keep a transmittable version of the mission on hand.

So many questions.
posted by Kyol at 7:20 AM on May 22, 2018

IMO, there is no way that NASA doesn't have logs of every bit they've ever sent to or received from a spacecraft.

I'm not in spacecraft operations, but you'd be surprised how much stuff isn't saved or isn't saved someplace easily accessible. A lot of institutional knowledge is solely in the minds of people who were around at the time the thing happened that you're trying to find out about.
posted by runcibleshaw at 8:35 AM on May 22, 2018 [7 favorites]

I'm not in spacecraft operations, but you'd be surprised how much stuff isn't saved or isn't saved someplace easily accessible. A lot of institutional knowledge is solely in the minds of people who were around at the time the thing happened that you're trying to find out about.

And getting/maintaining drives to read old media is a non-trivial task.
posted by mikelieman at 8:39 AM on May 22, 2018 [4 favorites]

IMO, there is no way that NASA doesn't have logs of every bit they've ever sent to or received from a spacecraft.

NASA used to have a problem with tape archival and reuse — they misplaced a lot of data and, like the BBC with Doctor Who, reused tapes from previous missions.
posted by nathan_teske at 9:01 AM on May 22, 2018 [3 favorites]

I think the most surprising part to me was that the command sequence that had been submitted was both a) not easily replayable from Earth and b) not stored somewhere nonvolatile on the spacecraft?

IMO, there is no way that NASA doesn't have logs of every bit they've ever sent to or received from a spacecraft.

My dad's company used to make the telemetry recorders used by the Deep Space Network. Every bit sent up or down was indeed stored on site at the ground stations, at least for awhile.
posted by killdevil at 9:25 AM on May 22, 2018 [1 favorite]

he R3000 is a 32-bit chip that's pretty similar to the one used in the original 1994-era Sony PlayStation among many other devices

I know seasoned programmers who were reduced to tears trying to write for the PlayStation, so that's great.
posted by lumpenprole at 12:49 PM on May 22, 2018

That was more about Sony's wacky graphics setup than the comprehensibility of the MIPS processor. All in all, MIPS isn't a bad architecture.

Working in embedded, I get pits in my stomach when a customer calls with a system problem or crash and I'm only 30 miles away from the test device.

4,500,000,000 miles and one chance to get it right? I'd go catatonic. Bravo/brava to all of these engineers for figuring it out.
posted by JoeZydeco at 1:24 PM on May 22, 2018 [1 favorite]

mikelieman: "And getting/maintaining drives to read old media is a non-trivial task."

Not to downplay this as an issue but New Horizons is only 12 years old; not quite enough time for this to be a serious problem.
posted by Mitheral at 2:55 PM on May 22, 2018

Right, it's more that I'm sorta surprised there isn't a mechanism for them to take the current state off of the NHOPS and pitch it to a bootloader or something along those lines. Or just "ok, we know we're in state x, we need to get it to state y so we can load the science program z, how do we get from x to y?"

Or was it more that, with the time crunch pushing the beginning of the sequence back, some of the inter-measurement timings and motions would need to be recalculated to ensure they'd get the measurements they expected and not a big beautiful inky black nothingness? I'll be the first to admit that I'm usually just sort of amazed that Very Smart People™ can actually precalculate how many degrees to turn at what time after a star sighting to know what they'll be looking at when the shutters fire.

On the other hand, what are the odds it would have a fault in a barely recoverable period before maximum science, and is it worth building recovery systems that meet that challenge, and what functionality would need to be cut out to make it fit?

On the gripping hand, wait, why wasn't NHOPS in exactly the same state prior to the NHOPS upload, so it would (probably) encounter the same overload scenario to warn them off the task upload while it was busy compressing data? Soooo many questions. I'd love to see the after action, lessons learned version of this story if you know what I mean.
posted by Kyol at 6:23 PM on May 22, 2018 [1 favorite]

Goofy! Put that back! That's Pluto's.
posted by Splunge at 9:19 AM on May 23, 2018

why wasn't NHOPS in exactly the same state prior to the NHOPS upload, so it would (probably) encounter the same overload scenario to warn them off the task upload while it was busy compressing data?

The spacecraft had untransmitted science data - the Pluto images - that were being compressed, so the testbeds couldn't be in the same state, and as the behaviour of compression software depends on the data being compressed it wouldn't have been possible to create an accurate synthetic data set. And when would the test have been run? It would have to have been at least nine hours before the live operations took place to have been of any use, and how do you keep a testbed synched to a future state of a spacecraft that's constantly changing? Relativity sucks.

As to why files didn't persist across the main computer reboot - if something's gone wrong and the watchdog has triggered, it's possible that whatever it was that went wrong was a software issue that would reoccur unless you set everything back to a known good, clean condition. The most important thing is to get the spacecraft back to a baseline condition where ground control can investigate and rebuild the system status to operational. As the OP says, most missions are normally in a state where they'll just carry on cruising or orbiting or sitting on a surface while the computer comes back; time crunches like this one are unusual.
posted by Devonian at 9:19 AM on May 24, 2018 [2 favorites]

Great article. I'm in the middle of reading The Martian right now, and the style of the this article was similar enough that I had to scroll back up to check that it wasn't written by the same author. Which I guess means that if you really liked the article (and if you show up to this thread even later than I did), you'd probably like that book.
posted by mabelstreet at 1:41 PM on May 29, 2018

The article was excerpted from Chasing New Horizons: Inside the Epic First Mission to Pluto, which I bought and consumed over the weekend. Very highly recommended.

There's not much more on this incident - apparently, they never tested for the case of managing the upload while compressing images, because they never thought it would happen and it was an unusual coincidence - but it was clearly an oversight.

There is a very great deal on issues they did catch, and the sheer scale and number of problems you have to overcome to get a mission to fly. The politics! The dirty tricks! The dirty tricks you have to play to counter their dirty tricks! And a very satisfying amount of engineering and operational detail.

One of the things the book goes into is how they begged for Hubble time to image Pluto space during New Horizon's cruise from Jupiter, to see if there were any other moons there prior to encounter. Hubble didn't want to give up the time, so it took a while to shake that out of the tree (an experience repeated later when they needed to target KBOs), and then, of course, there were new moons revealed. Which was exciting - except that, given Hubble's resolution and the normal spread of object sizes, there was a good chance that there were even more moons - and rings - below Hubble's resolving limits. And if you do the sums, the chances of a catastrophic collision, the numbers aren't good.

Which meant evolving a huge set of contingencies on what to do if you detect danger on the way in, coming up with large numbers of alternate fly-by trajectories (each of which has to be tested and have its own contingencies) to be triggered depending on what you see and when, plus deciding to send back early images at the expense of doing more science, and so on.

If that sounds like your sort of thing (and you like hearing about how JPL can really muck stuff up) then this book's for you...
posted by Devonian at 5:50 AM on May 30, 2018 [1 favorite]

As it turned out, a decision made years earlier proved to be a life-saver during the recovery. Alan had become so concerned that the team did not have a fully complete backup to NHOPS, that a second one was built.

You just know there was some discussion around whether that was cost-effective and the "told you so" must have felt EPIC!
posted by fullerine at 7:03 AM on May 30, 2018

« Older "A tree can't make or break Christmas, only people... | Patent Depending Newer »

This thread has been archived and is closed to new comments

MetaFilter

Troubleshooting a spacecraft nine hours away, as the photon flies.
May 21, 2018 4:20 PM Subscribe

Tags

Share

Troubleshooting a spacecraft nine hours away, as the photon flies. May 21, 2018 4:20 PM Subscribe

Tags

Share

Troubleshooting a spacecraft nine hours away, as the photon flies.
May 21, 2018 4:20 PM Subscribe