Terabytes of Enron data have quietly gone missing
January 30, 2019 7:10 AM Subscribe
From Muckrock: Government investigations into California’s electricity shortage, ultimately determined to be caused by intentional market manipulations and capped retail electricity prices by the now infamous Enron Corporation, resulted in terabytes of information being collected by the Federal Energy Regulatory Commission. This included several extremely large databases, some of which had nearly 200 million rows of data, including Enron’s bidding and price processes, their trading and risk management systems, emails, audio recordings, and nearly 100,000 additional documents. That information has quietly disappeared, and not even its custodians seem to know why.
... While terabytes of information has disappeared, up to 4,516 documents remain available through a pair of predefined searches of FERC’s eLibrary. While FERC claims that they, not Lockheed Martin or CACI, do offer a trio of Enron datasets on CD, FERC has not responded to repeated requests for these datasets sent over the past two months.
FERC, the Federal Energy Regulatory Commission, is a US government agency established in 1977 to oversee the country's interstate transmission and pricing of a variety of energy resources, including electricity, natural gas, and oil.
MuckRock is a non-profit, collaborative news site that mixes work by journalists, researchers, activists, and regular citizens in requesting, analyzing, and sharing government documents. MuckRock is currently offering free accounts to recently laid off journalists.
Enron 15 Years Later: Where Are They Now (2016)
Original MF post on the Enron databases
Enron and California previously: 1, 2, 3
Sample MuckRock-related previous posts/comments: 1, 2, 3
... While terabytes of information has disappeared, up to 4,516 documents remain available through a pair of predefined searches of FERC’s eLibrary. While FERC claims that they, not Lockheed Martin or CACI, do offer a trio of Enron datasets on CD, FERC has not responded to repeated requests for these datasets sent over the past two months.
FERC, the Federal Energy Regulatory Commission, is a US government agency established in 1977 to oversee the country's interstate transmission and pricing of a variety of energy resources, including electricity, natural gas, and oil.
MuckRock is a non-profit, collaborative news site that mixes work by journalists, researchers, activists, and regular citizens in requesting, analyzing, and sharing government documents. MuckRock is currently offering free accounts to recently laid off journalists.
Enron 15 Years Later: Where Are They Now (2016)
Original MF post on the Enron databases
Enron and California previously: 1, 2, 3
Sample MuckRock-related previous posts/comments: 1, 2, 3
I find it ominous. So much can be gleaned from that data; removing it entirely sounds like a pretext to doing it again.
posted by ZeusHumms at 7:24 AM on January 30, 2019 [12 favorites]
posted by ZeusHumms at 7:24 AM on January 30, 2019 [12 favorites]
The greatest trick the devil ever pulled was convincing corporations that all data had to be archived forever.
posted by Damienmce at 7:33 AM on January 30, 2019 [1 favorite]
posted by Damienmce at 7:33 AM on January 30, 2019 [1 favorite]
Why would you need to keep 2TB of 17 year old data?
The emails, at least, are commonly used as a demonstration and training dataset for electronic discovery software. They're the de facto standard for teaching people how to find incriminating emails in a pile of boring business pablum. It's a nice bit of justice that thousands of lawyers, paralegals, and other law staff are regularly reminded of these crooks' names.
Beyond that, they're also used as one of the few large-scale corpuses of real emails available for text analysis and social network analysis research. For all of of its deficiencies (e.g. it's one company, it's pretty old, it mostly predates smartphones), it's been used in hundreds of papers because it's what exists.
Luckily that data has also been mirrored in several places. For example, you can get the Enron emails (210GB, 1.2 million emails) via Amazon S3.
I don't know if similar work has been done on the other data (e.g. forensic accounting analysis of the bidding and price data), but I can at least imagine value in it.
posted by jedicus at 7:34 AM on January 30, 2019 [39 favorites]
The emails, at least, are commonly used as a demonstration and training dataset for electronic discovery software. They're the de facto standard for teaching people how to find incriminating emails in a pile of boring business pablum. It's a nice bit of justice that thousands of lawyers, paralegals, and other law staff are regularly reminded of these crooks' names.
Beyond that, they're also used as one of the few large-scale corpuses of real emails available for text analysis and social network analysis research. For all of of its deficiencies (e.g. it's one company, it's pretty old, it mostly predates smartphones), it's been used in hundreds of papers because it's what exists.
Luckily that data has also been mirrored in several places. For example, you can get the Enron emails (210GB, 1.2 million emails) via Amazon S3.
I don't know if similar work has been done on the other data (e.g. forensic accounting analysis of the bidding and price data), but I can at least imagine value in it.
posted by jedicus at 7:34 AM on January 30, 2019 [39 favorites]
Having seen how easily bits rot when no-one is actively checking every month whether the backups still restore properly, I'm not surprised at all. I wouldn't be surprised if the people in charge of maintaining the data have all moved on to other projects or retired. I wouldn't be surprised if the systems maintained by Lockheed Martin and CACI were decommissioned during a routine accounting exercise. Nobody has used this for years? Turn it off.
If no-one asks for a piece of data for a decade, that data will go missing unless you are paying a librarian to maintain it. Businesses don't like to pay librarians.
posted by clawsoon at 7:40 AM on January 30, 2019 [15 favorites]
If no-one asks for a piece of data for a decade, that data will go missing unless you are paying a librarian to maintain it. Businesses don't like to pay librarians.
posted by clawsoon at 7:40 AM on January 30, 2019 [15 favorites]
The greatest trick the devil ever pulled was convincing corporations that all data had to be archived forever.
Blank tape sales are a good business, don't go screwing anything up for us!
( I kid, it's all in an Azure recovery volume these days )
posted by mikelieman at 7:51 AM on January 30, 2019
Blank tape sales are a good business, don't go screwing anything up for us!
( I kid, it's all in an Azure recovery volume these days )
posted by mikelieman at 7:51 AM on January 30, 2019
This all sounds like standard bureaucratic behavior, but I guess we can call it a conspiracy.
posted by Brocktoon at 7:52 AM on January 30, 2019 [1 favorite]
posted by Brocktoon at 7:52 AM on January 30, 2019 [1 favorite]
The greatest trick the devil ever pulled was convincing corporations that all data had to be archived forever.
Or convincing us that corporations would be good stewards of data of public interest.
posted by mhoye at 8:11 AM on January 30, 2019 [11 favorites]
Or convincing us that corporations would be good stewards of data of public interest.
posted by mhoye at 8:11 AM on January 30, 2019 [11 favorites]
Why would you need to keep 2TB of 17 year old data? I mean the company no longer exists and they were found guilty.
Throughout the early years of the Bush Jr. administration, Dick Cheney was repeatedly rumored to be connected to the Enron situation. He was even facing some initial investigation on the matter, going to greater and greater lengths to try to dodge it, including claiming that he was protected by executive privilege. However, right after the Senate ruled that executive privilege didn't cover that, Bush suddenly started talking about WMD's in Iraq and no one ever followed up. (How convenient, eh?)
And now that those 2TB are missing, no one ever can.
posted by EmpressCallipygos at 8:26 AM on January 30, 2019 [24 favorites]
Throughout the early years of the Bush Jr. administration, Dick Cheney was repeatedly rumored to be connected to the Enron situation. He was even facing some initial investigation on the matter, going to greater and greater lengths to try to dodge it, including claiming that he was protected by executive privilege. However, right after the Senate ruled that executive privilege didn't cover that, Bush suddenly started talking about WMD's in Iraq and no one ever followed up. (How convenient, eh?)
And now that those 2TB are missing, no one ever can.
posted by EmpressCallipygos at 8:26 AM on January 30, 2019 [24 favorites]
I'm sure there are a lot of Accenture and former Arthur Andersen alumni who would like their roles in the fraud erased.
posted by benzenedream at 9:25 AM on January 30, 2019 [2 favorites]
posted by benzenedream at 9:25 AM on January 30, 2019 [2 favorites]
The emails, at least, are commonly used as a demonstration and training dataset for electronic discovery software. They're the de facto standard for teaching people how to find incriminating emails in a pile of boring business pablum.
They're particularly useful for showing off the capacities of modern review software because the Enron people used a lot of code words for their more nefarious schemes. One of the skills you spend a lot of time developing if you have to review documents is figuring out your targets' vocabulary, so that you can do searches on words and phrases that maybe you wouldn't choose cold. The new software is good at co-location and other techniques so that you can see right away that "Death Star" occurs in a lot of the same places as, e.g., "special purpose entity." Thus obsoleting that skill, at least for projects where you can justify the cost of the software, sigh.
posted by praemunire at 9:29 AM on January 30, 2019 [6 favorites]
They're particularly useful for showing off the capacities of modern review software because the Enron people used a lot of code words for their more nefarious schemes. One of the skills you spend a lot of time developing if you have to review documents is figuring out your targets' vocabulary, so that you can do searches on words and phrases that maybe you wouldn't choose cold. The new software is good at co-location and other techniques so that you can see right away that "Death Star" occurs in a lot of the same places as, e.g., "special purpose entity." Thus obsoleting that skill, at least for projects where you can justify the cost of the software, sigh.
posted by praemunire at 9:29 AM on January 30, 2019 [6 favorites]
I am reminded, tangentially, that Elizabeth "Theranos" Holmes' father once worked for Enron.
posted by doctornemo at 9:42 AM on January 30, 2019 [3 favorites]
posted by doctornemo at 9:42 AM on January 30, 2019 [3 favorites]
To second jedicus, it seems like every e-discovery vendor's database of choice when demo-ing their platform are the Enron emails. I have sat through at least three trainings featuring them (Disco, Relativity, and I think Everlaw). Look, here's Ken Lay specifying what kind of sandwich he wants for lunch, and oh wait, here's some Enron traders talking about gaming the California electricity market. It's a great database for training because it's big enough to demonstrate features that only make sense for large data sets (sampling, filtering, analytics) and because most people are familiar enough with the main cast of characters that the search demonstrations are easy to follow.
posted by Aubergine at 2:24 PM on January 30, 2019 [2 favorites]
posted by Aubergine at 2:24 PM on January 30, 2019 [2 favorites]
The survival of the dataset despite the failure of the people who were supposed to take care of it reminds me of Linus Torvald's quip about real men doing their backups by uploading to ftp and letting the rest of the world mirror it.
posted by clawsoon at 2:50 PM on January 30, 2019
posted by clawsoon at 2:50 PM on January 30, 2019
Why would you need to keep 2TB of 17 year old data? I mean the company no longer exists and they were found guilty. That's 2TB you could be using for cat pics.
Are you kidding? 2TB is $60 at Walmart these days. In addition to the reasons given above, there's just plain old historical interest. Sometimes people's dismissive stances here mystify me.
posted by JHarris at 4:07 PM on January 30, 2019 [6 favorites]
Are you kidding? 2TB is $60 at Walmart these days. In addition to the reasons given above, there's just plain old historical interest. Sometimes people's dismissive stances here mystify me.
posted by JHarris at 4:07 PM on January 30, 2019 [6 favorites]
2TB is also about 111 seconds of uncompressed 8K video of cats, if I'm doing my math right, according to Wikipedia. (it's claiming that 120fps is the standard which makes for 144 Gbit/s at 7680×4320)
posted by XMLicious at 6:05 PM on January 30, 2019
posted by XMLicious at 6:05 PM on January 30, 2019
If you budget $60 to store something you won't have it a decade later. Doesn't matter if it's a terabyte or a kilobyte and media costs are not really a relevant variable.
As someone who has been affiliated with the same company off-and-on for a quarter century, I get the occasional "please give me this data that no one has thought about in two decades from a department that no longer exists" request rooted my way because no one has even a clue where to start. I cringed a little looking at the correspondence because, yeah, that does happen.
posted by mark k at 8:19 PM on January 30, 2019 [3 favorites]
As someone who has been affiliated with the same company off-and-on for a quarter century, I get the occasional "please give me this data that no one has thought about in two decades from a department that no longer exists" request rooted my way because no one has even a clue where to start. I cringed a little looking at the correspondence because, yeah, that does happen.
posted by mark k at 8:19 PM on January 30, 2019 [3 favorites]
I get what you are saying but I think your angle isn't quite right.
There are some academically interesting public-domain images I cropped out of Google books scans and uploaded to English Wikipedia more than ten years ago, which have now been copied to dozens of Wikipedias in other languages, sites trying achieve SEO objectives by republishing Wikipedia for free, Google Images and several other image search engines, offline copies of Wikipedia...
If I had fiddled around with steganographic tools and embedded your lower-limit 1K of data into one of the images, I'd have stored the data for ten years, possibly with a degree of durability such that it would still be findable in the aftermath of a nuclear war on a tablet with an offline copy of Wikipedia and OsmAnd offline maps and a spank bank of Biblical proportions, verily a multitude of gigacubits of porn, tucked away in some prepper's Faraday cage, at no cost.
So I think you need a few more constraints to make your challenge insurmountable.
posted by XMLicious at 9:17 PM on January 30, 2019
There are some academically interesting public-domain images I cropped out of Google books scans and uploaded to English Wikipedia more than ten years ago, which have now been copied to dozens of Wikipedias in other languages, sites trying achieve SEO objectives by republishing Wikipedia for free, Google Images and several other image search engines, offline copies of Wikipedia...
If I had fiddled around with steganographic tools and embedded your lower-limit 1K of data into one of the images, I'd have stored the data for ten years, possibly with a degree of durability such that it would still be findable in the aftermath of a nuclear war on a tablet with an offline copy of Wikipedia and OsmAnd offline maps and a spank bank of Biblical proportions, verily a multitude of gigacubits of porn, tucked away in some prepper's Faraday cage, at no cost.
So I think you need a few more constraints to make your challenge insurmountable.
posted by XMLicious at 9:17 PM on January 30, 2019
No, you're missing my point. It's not an insurmountable challenge; it was an insignificant goal.
Like you if embedded my 1K of data in some image as described but then changed jobs, then someone e-mails me and I say well, that was XMLicious' job, maybe I e-mail you and you even bother to reply saying you stored it steganographically and (sigh) documented it all when you left in a memo to your boss. But your boss quite seven years ago and the group was split into two parts, well, at this point I have better things to do than chase this down. Solving the engineering challenge in this situation, when it's not an engineering problem, is rather pointless.
Obviously if you think a set of data is important there are ways to make sure you don't lose it, like procedures that include routine review and enforced by internal audits. Obviously on some level Lockheed-Martin didn't consider Enron's 2 TB of data important to them as a whole, which is OK to criticize but the solution is not "swing by Walmart."
posted by mark k at 9:35 PM on January 30, 2019 [5 favorites]
Like you if embedded my 1K of data in some image as described but then changed jobs, then someone e-mails me and I say well, that was XMLicious' job, maybe I e-mail you and you even bother to reply saying you stored it steganographically and (sigh) documented it all when you left in a memo to your boss. But your boss quite seven years ago and the group was split into two parts, well, at this point I have better things to do than chase this down. Solving the engineering challenge in this situation, when it's not an engineering problem, is rather pointless.
Obviously if you think a set of data is important there are ways to make sure you don't lose it, like procedures that include routine review and enforced by internal audits. Obviously on some level Lockheed-Martin didn't consider Enron's 2 TB of data important to them as a whole, which is OK to criticize but the solution is not "swing by Walmart."
posted by mark k at 9:35 PM on January 30, 2019 [5 favorites]
I can't believe I started this derail.
posted by JHarris at 11:03 PM on January 30, 2019 [2 favorites]
posted by JHarris at 11:03 PM on January 30, 2019 [2 favorites]
Oh, I see: you're talking about the kind of organizational dysfunction where even if you could figure out how to carve 1k of data into a granite tablet, and erected it as a monument outside your office's front door all for $60, within ten years they'd have moved to a different office and lost the address of the original one where the monument is. Or not even that, they'd all just say, "I never noticed that monument with the data we're looking for on" after walking right past it for ten years.
posted by XMLicious at 4:31 AM on January 31, 2019 [2 favorites]
posted by XMLicious at 4:31 AM on January 31, 2019 [2 favorites]
(wasn't the point simply that 2TB isn't a whole heck of a lot of data in our modern world?)
posted by epersonae at 1:05 PM on January 31, 2019
posted by epersonae at 1:05 PM on January 31, 2019
YES, I thought that was obvious. I was going to say something bitterly sarcastic to that effect, but one doesn't want to hurt feelings, you know?
posted by JHarris at 1:59 PM on January 31, 2019
posted by JHarris at 1:59 PM on January 31, 2019
Are you kidding? 2TB is $60 at Walmart these days.
Great. How much was it in 200x? Because that's the price and time that matters.
posted by MikeKD at 4:29 PM on February 2, 2019 [1 favorite]
Great. How much was it in 200x? Because that's the price and time that matters.
posted by MikeKD at 4:29 PM on February 2, 2019 [1 favorite]
« Older Buy faible, sell haute | Want to know what your name tastes like? Newer »
This thread has been archived and is closed to new comments
posted by zeoslap at 7:17 AM on January 30, 2019 [5 favorites]