ETAION SHRDLCUM
January 7, 2013 6:20 PM Subscribe
New letter and word frequency counts Peter Norvig has used Google books data to generate new lists of letter frequency, the most common English words and their frequencies, and lots of other fun stuff (though I don't know if forschungsgemeinschaft is really an English word, unless it means forcing a mine shaft).
His (Metafilter)Longest Palindrome has gotten much longer, too.
His (Metafilter)Longest Palindrome has gotten much longer, too.
Here's a better Hangman strategy. It's best to take it very seriously, as one should all games of life and death.
posted by Llama-Lime at 6:24 PM on January 7, 2013 [17 favorites]
posted by Llama-Lime at 6:24 PM on January 7, 2013 [17 favorites]
Is he related to Russell Norvig?
posted by miyabo at 6:28 PM on January 7, 2013 [1 favorite]
posted by miyabo at 6:28 PM on January 7, 2013 [1 favorite]
All words are English words, eventually.
posted by pompomtom at 6:37 PM on January 7, 2013 [4 favorites]
posted by pompomtom at 6:37 PM on January 7, 2013 [4 favorites]
Now I want an English word generator, the output of which matches these distributions.
posted by TwelveTwo at 6:40 PM on January 7, 2013
posted by TwelveTwo at 6:40 PM on January 7, 2013
miyabo: Either I don't know who Russell Norvig is or you're thinking of the AI book by Stuart Russell and Peter Norvig.
posted by jjwiseman at 6:42 PM on January 7, 2013
posted by jjwiseman at 6:42 PM on January 7, 2013
We recede ever further into the future with each passing month.
I mean, think about this a minute: this guy downloaded word counts for a very substantial fraction of every work published in the English language. Ever. For free. From some unspecified location somewhere on a global network. And he massaged and manipulated all this data with a desktop computer. Ir probably wasn't even that expensive, just a fairly routine, run of the mill, garden-variety home PC.
743 billion written words analyzed, and he could do it sitting in his house, sipping tea, with his slippers on, for the cost of several hours of heavy traffic on his Internet connection, and a few dollars in electricity.
We are living in a goddamn science fiction novel.
posted by Malor at 6:53 PM on January 7, 2013 [24 favorites]
I mean, think about this a minute: this guy downloaded word counts for a very substantial fraction of every work published in the English language. Ever. For free. From some unspecified location somewhere on a global network. And he massaged and manipulated all this data with a desktop computer. Ir probably wasn't even that expensive, just a fairly routine, run of the mill, garden-variety home PC.
743 billion written words analyzed, and he could do it sitting in his house, sipping tea, with his slippers on, for the cost of several hours of heavy traffic on his Internet connection, and a few dollars in electricity.
We are living in a goddamn science fiction novel.
posted by Malor at 6:53 PM on January 7, 2013 [24 favorites]
This isn't much different than I recall reading in "Gödel, Escher, Bach" thirty years ago. Is this really breaking news?
posted by hwestiii at 6:55 PM on January 7, 2013
posted by hwestiii at 6:55 PM on January 7, 2013
Etaoin Shrdl Cum on Feel the Noize
posted by jonp72 at 7:06 PM on January 7, 2013 [1 favorite]
posted by jonp72 at 7:06 PM on January 7, 2013 [1 favorite]
I swear I remember reading a book when I was a kid that listed the nine most frequently used letters--in descending order--as ETAONIRSH. The book supplied, as a mnemonic, the following piece of dialogue: "Ever tickle an ostrich?" "No, I'd rather suck hummingbirds." I've remembered this now for over 25 years, and it has proven to come in very handy when playing hangman. I guess according to this new research I will need to revise my strategy and reverse a few letters in my list.
Seems rather peculiar, but I am pretty sure I am remembering this mnemonic correctly. The Googles do not return any enlightenment on its source, however. Anyone else remember this? I would love to know what book it was from.
posted by hurdy gurdy girl at 7:17 PM on January 7, 2013 [3 favorites]
Seems rather peculiar, but I am pretty sure I am remembering this mnemonic correctly. The Googles do not return any enlightenment on its source, however. Anyone else remember this? I would love to know what book it was from.
posted by hurdy gurdy girl at 7:17 PM on January 7, 2013 [3 favorites]
This isn't much different than I recall reading in "Gödel, Escher, Bach" thirty years ago. Is this really breaking news?
Did you take a wrong turn on the way to the Huffington Post?
posted by empath at 7:23 PM on January 7, 2013
Did you take a wrong turn on the way to the Huffington Post?
posted by empath at 7:23 PM on January 7, 2013
"Ever tickle an insane ostrich?" "No, somehow he really doesn't like contact." "...um."
posted by wanderingmind at 7:28 PM on January 7, 2013 [5 favorites]
posted by wanderingmind at 7:28 PM on January 7, 2013 [5 favorites]
run of the mill, garden-variety home PC
Waaait a minute, was the PC from the mill or the garden?
posted by moonmilk at 8:27 PM on January 7, 2013
Waaait a minute, was the PC from the mill or the garden?
posted by moonmilk at 8:27 PM on January 7, 2013
We are living in a goddamn science fiction novel.
This sure does seem like a weird thing to have this reaction about. The amazing ability of Google Translate to provide, well, not really good or accurate, but adequate translations of natural text, using the vast storehouses of data Google has available? That is really kind of startling, and shows the power of having data.
This is counting to a bigger number than you could count to before.
I liked it a lot, though!
But I'm not going to start going around saying ETAOIN SRHDLC, that sounds terrible.
posted by escabeche at 9:03 PM on January 7, 2013
This sure does seem like a weird thing to have this reaction about. The amazing ability of Google Translate to provide, well, not really good or accurate, but adequate translations of natural text, using the vast storehouses of data Google has available? That is really kind of startling, and shows the power of having data.
This is counting to a bigger number than you could count to before.
I liked it a lot, though!
But I'm not going to start going around saying ETAOIN SRHDLC, that sounds terrible.
posted by escabeche at 9:03 PM on January 7, 2013
posted by Brent Parker at 9:36 PM on January 7, 2013
To help your touch type speed, practice until you learn the most common 50 words on his list.
posted by JujuB at 10:53 PM on January 7, 2013
posted by JujuB at 10:53 PM on January 7, 2013
This is counting to a bigger number than you could count to before.
It's being able to do, at your desktop, what once would have taken the resources of a good sized nation to accomplish. Dunno how old you are, but I learned computing with an 8-bit machine, and the era of 64-bit computing and massive memory and storage totals are a near-constant marvel. I have a machine running a 4.4Ghz, with 16 gigs of RAM, sitting my desk, and I remember, not so long ago, running servers for hundreds of people that were maybe a tenth this fast. I really, really remember how hard stuff like this used to be.... dealing with data sizes of this magnitude was once extremely difficult. Heck, back in the days of computers with 64 or 128K of RAM, simply running a goddamn spell check was a difficult technical feat -- the programmer had to hold both your whole program and a big dictionary in nowhere near enough RAM for both. The algorithms they came up with were so very clever. And, on modern machines, so very obsolete.
I mean -- try to grok this if you can, it used to be basically impossible to hold a reasonable sized dictionary in RAM, much less a dictionary plus a document. Just couldn't do it. And, yet, the word-frequency corpus of a huge fraction of the total books ever published in English will fit in about 23 gigs of RAM -- and you can now, very cheaply, put 32 gigs of RAM on a home computer. We've gone, in my lifetime, from not being able to hold one dictionary, to being able to hold millions.
I suppose it's kind of like being constantly pleased with color television, but it's just .... this sort of little "authentic" detail, coloration for the world, as it were, was the sort of pie-in-the-sky bullshit that people like William Gibson used to invent. It was magic, mysterious, far away, and impossible, something that might happen in the Far Future. But, lo and behold, it's suddenly real.
For whatever reason, translation doesn't impress me nearly as much, whether or not that's actually justified. It's cool, but it's not the sort of thing that used to drive a sense of longing for more computer power.
posted by Malor at 11:01 PM on January 7, 2013 [4 favorites]
It's being able to do, at your desktop, what once would have taken the resources of a good sized nation to accomplish. Dunno how old you are, but I learned computing with an 8-bit machine, and the era of 64-bit computing and massive memory and storage totals are a near-constant marvel. I have a machine running a 4.4Ghz, with 16 gigs of RAM, sitting my desk, and I remember, not so long ago, running servers for hundreds of people that were maybe a tenth this fast. I really, really remember how hard stuff like this used to be.... dealing with data sizes of this magnitude was once extremely difficult. Heck, back in the days of computers with 64 or 128K of RAM, simply running a goddamn spell check was a difficult technical feat -- the programmer had to hold both your whole program and a big dictionary in nowhere near enough RAM for both. The algorithms they came up with were so very clever. And, on modern machines, so very obsolete.
I mean -- try to grok this if you can, it used to be basically impossible to hold a reasonable sized dictionary in RAM, much less a dictionary plus a document. Just couldn't do it. And, yet, the word-frequency corpus of a huge fraction of the total books ever published in English will fit in about 23 gigs of RAM -- and you can now, very cheaply, put 32 gigs of RAM on a home computer. We've gone, in my lifetime, from not being able to hold one dictionary, to being able to hold millions.
I suppose it's kind of like being constantly pleased with color television, but it's just .... this sort of little "authentic" detail, coloration for the world, as it were, was the sort of pie-in-the-sky bullshit that people like William Gibson used to invent. It was magic, mysterious, far away, and impossible, something that might happen in the Far Future. But, lo and behold, it's suddenly real.
For whatever reason, translation doesn't impress me nearly as much, whether or not that's actually justified. It's cool, but it's not the sort of thing that used to drive a sense of longing for more computer power.
posted by Malor at 11:01 PM on January 7, 2013 [4 favorites]
I smell an opportunity for found poetry.
The of, and to in a, is that,
For it, as was with be by on.
"Not he," i this are, or his from,
At which but have an had they?
You were their--
One--
All we can, her has there been!
If more, when will would who so no?
posted by not_on_display at 11:23 PM on January 7, 2013 [4 favorites]
The of, and to in a, is that,
For it, as was with be by on.
"Not he," i this are, or his from,
At which but have an had they?
You were their--
One--
All we can, her has there been!
If more, when will would who so no?
posted by not_on_display at 11:23 PM on January 7, 2013 [4 favorites]
OMFG are you friggin' KIDDING ME??? I LOVE THIS MAN. No, I seriously love this man. Peter Norvig, I LOVE YOU. What he's done is add this very small but PERFECT contribution to my PhD dissertation. I've been needing this exact piece. This exact piece. I LOVE THE FUTURE.
I just woke up. My day, week and month have been made. I didn't know that the bi-gram letter frequency of a billion google books could make me so happy, but there you go.
hexatron, I also love you for posting this.
posted by iamkimiam at 12:43 AM on January 8, 2013 [12 favorites]
I just woke up. My day, week and month have been made. I didn't know that the bi-gram letter frequency of a billion google books could make me so happy, but there you go.
hexatron, I also love you for posting this.
posted by iamkimiam at 12:43 AM on January 8, 2013 [12 favorites]
Malor, he probably did it at his work, where they have a lot of computers.
posted by thelonius at 12:44 AM on January 8, 2013 [1 favorite]
posted by thelonius at 12:44 AM on January 8, 2013 [1 favorite]
thelonius, Norvig does work for Google, but he did this on his own PC.
Here's what we can do with today's computing power (using publicly available data and the processing power of my own personal computer; I'm not not relying on access to corporate computing power)posted by nangar at 1:27 AM on January 8, 2013
How many years before his comment "Don't try this on your phone" is obsolete?
posted by CheeseDigestsAll at 5:11 AM on January 8, 2013 [1 favorite]
posted by CheeseDigestsAll at 5:11 AM on January 8, 2013 [1 favorite]
We are living in a goddamn science fiction novel.
Meanwhile, in the B plot, revolutionaries among the oppressed and defenseless, whose resources and futures are being sucked dry to make it possible, are....but that would be giving it away.
posted by DU at 5:23 AM on January 8, 2013 [2 favorites]
Meanwhile, in the B plot, revolutionaries among the oppressed and defenseless, whose resources and futures are being sucked dry to make it possible, are....but that would be giving it away.
posted by DU at 5:23 AM on January 8, 2013 [2 favorites]
The part where he gets into the most popular N-grams makes me wonder how representative of natural language (or, I guess, natural written language) the corpus is. The second most common 8-gram is "national"; #2 and #3 for the 9-grams are the first and last nine letters of "government." Is it possible that politics or current-events sources are overrepresented?
posted by psoas at 5:56 AM on January 8, 2013
posted by psoas at 5:56 AM on January 8, 2013
I don't think Google claims their corpus is meant to be anything like a random sample.
posted by escabeche at 6:27 AM on January 8, 2013
posted by escabeche at 6:27 AM on January 8, 2013
There aren't a lot of 8+ letter words in common use to choose from.
posted by empath at 6:50 AM on January 8, 2013
posted by empath at 6:50 AM on January 8, 2013
There aren't a lot of 8+ letter words in common use to choose from.
Two-thirds of the comments in this thread contain at least one unique eight-plus-letter word, including this one.
posted by Etrigan at 6:55 AM on January 8, 2013
Two-thirds of the comments in this thread contain at least one unique eight-plus-letter word, including this one.
posted by Etrigan at 6:55 AM on January 8, 2013
Here are all the 8-letter words in this thread (including comment 31 by Etrigan):
WORD COUNT
remember 4
computer 4
language 3
strategy 3
comments 2
probably 2
counting 2
fraction 2
possible 2
obsolete 2
hexatron 2
favorite 2
breaking 2
mnemonic 2
hundreds 1
publicly 1
Internet 1
peculiar 1
terrible 1
constant 1
supplied 1
research 1
moonmilk 1
accurate 1
national 1
anything 1
slippers 1
lifetime 1
bullshit 1
generate 1
thinking 1
machines 1
whatever 1
analyzed 1
location 1
document 1
dialogue 1
hwestiii 1
massaged 1
actually 1
suddenly 1
adequate 1
personal 1
millions 1
practice 1
reaction 1
SHRDLCUM 1
politics 1
posted by iamkimiam at 8:08 AM on January 8, 2013 [2 favorites]
WORD COUNT
remember 4
computer 4
language 3
strategy 3
comments 2
probably 2
counting 2
fraction 2
possible 2
obsolete 2
hexatron 2
favorite 2
breaking 2
mnemonic 2
hundreds 1
publicly 1
Internet 1
peculiar 1
terrible 1
constant 1
supplied 1
research 1
moonmilk 1
accurate 1
national 1
anything 1
slippers 1
lifetime 1
bullshit 1
generate 1
thinking 1
machines 1
whatever 1
analyzed 1
location 1
document 1
dialogue 1
hwestiii 1
massaged 1
actually 1
suddenly 1
adequate 1
personal 1
millions 1
practice 1
reaction 1
SHRDLCUM 1
politics 1
posted by iamkimiam at 8:08 AM on January 8, 2013 [2 favorites]
Pointing out that your list does not include "including" would be churlish, but then, I am rarely accused of insufficient churlishness.
posted by Etrigan at 9:43 AM on January 8, 2013
posted by Etrigan at 9:43 AM on January 8, 2013
This is a little like when I discovered the KJV was a dreadfully inaccurate Bible translation — and then decided I didn't care and kept reading and quoting it anyway because it's still the one that sounds right.
Okay, fine, ETAOIN SHRDLU is no longer our best estimate for English letter frequency ranking. It's still got a damn sight more poetry and gravitas than this SHRDLCUM business. Or, uh, something like that.
posted by and so but then, we at 10:02 AM on January 8, 2013
Okay, fine, ETAOIN SHRDLU is no longer our best estimate for English letter frequency ranking. It's still got a damn sight more poetry and gravitas than this SHRDLCUM business. Or, uh, something like that.
posted by and so but then, we at 10:02 AM on January 8, 2013
"including" is in the list of 9-letter words, where it has a frequency count of one.
posted by iamkimiam at 12:03 PM on January 8, 2013 [2 favorites]
posted by iamkimiam at 12:03 PM on January 8, 2013 [2 favorites]
You are absolutely correct, and I apologize for my churlishness.
posted by Etrigan at 1:18 PM on January 8, 2013
posted by Etrigan at 1:18 PM on January 8, 2013
I am amazed that so many people here have actually heard of the "etaion shrdlu" thing.
posted by jenfullmoon at 7:25 PM on January 8, 2013
posted by jenfullmoon at 7:25 PM on January 8, 2013
Malor, he probably did it at his work, where they have a lot of computers.
posted by thelonius
I am a huge fan of the deadpan understatement.
You just won all the points forever.
posted by benito.strauss at 7:44 PM on January 8, 2013
posted by thelonius
I am a huge fan of the deadpan understatement.
You just won all the points forever.
posted by benito.strauss at 7:44 PM on January 8, 2013
There aren't a lot of 8+ letter words in common use to choose from.
thelonius: he probably did it at his work, where they have a lot of computers.posted by benito.strauss at 8:05 PM on January 8, 2013 [3 favorites]
nangar: thelonius, Norvig does work for Google, but he did this on his own PC.Probably millions machines generate counting research. Bullshit, personal computer actually analyzed language.
I am amazed that so many people here have actually heard of the "etaion shrdlu" thing.
posted by jenfullmoon at 10:25 PM on January 8 [+] [!]
There seems to be a fair number of newspaper folks on Metafilter, though, to my surprise, a couple of ex-newsies that I work with now online hadn't heard of it. I think editors who worked in the old hot-type shops probably are most familiar with it.
posted by etaoin at 9:29 PM on January 19, 2013
posted by jenfullmoon at 10:25 PM on January 8 [+] [!]
There seems to be a fair number of newspaper folks on Metafilter, though, to my surprise, a couple of ex-newsies that I work with now online hadn't heard of it. I think editors who worked in the old hot-type shops probably are most familiar with it.
posted by etaoin at 9:29 PM on January 19, 2013
I think it might be more a case of "nerds of a certain age", who read Martin Gardner's column in SciAm in the 70s, which is where I first came across it.
posted by benito.strauss at 10:54 PM on January 19, 2013 [1 favorite]
posted by benito.strauss at 10:54 PM on January 19, 2013 [1 favorite]
« Older The Science of Sex Abuse | Heckling Defended Newer »
This thread has been archived and is closed to new comments
posted by Etrigan at 6:22 PM on January 7, 2013 [1 favorite]