ETAION SHRDLCUM
January 7, 2013 6:20 PM   Subscribe

New letter and word frequency counts Peter Norvig has used Google books data to generate new lists of letter frequency, the most common English words and their frequencies, and lots of other fun stuff (though I don't know if forschungsgemeinschaft is really an English word, unless it means forcing a mine shaft). posted by hexatron (41 comments total) 22 users marked this as a favorite
 
Well, this totally fuxxors my Hangman strategy.
posted by Etrigan at 6:22 PM on January 7, 2013 [1 favorite]


Here's a better Hangman strategy. It's best to take it very seriously, as one should all games of life and death.
posted by Llama-Lime at 6:24 PM on January 7, 2013 [17 favorites]


Is he related to Russell Norvig?
posted by miyabo at 6:28 PM on January 7, 2013 [1 favorite]


All words are English words, eventually.
posted by pompomtom at 6:37 PM on January 7, 2013 [4 favorites]


Now I want an English word generator, the output of which matches these distributions.
posted by TwelveTwo at 6:40 PM on January 7, 2013


miyabo: Either I don't know who Russell Norvig is or you're thinking of the AI book by Stuart Russell and Peter Norvig.
posted by jjwiseman at 6:42 PM on January 7, 2013


We recede ever further into the future with each passing month.

I mean, think about this a minute: this guy downloaded word counts for a very substantial fraction of every work published in the English language. Ever. For free. From some unspecified location somewhere on a global network. And he massaged and manipulated all this data with a desktop computer. Ir probably wasn't even that expensive, just a fairly routine, run of the mill, garden-variety home PC.

743 billion written words analyzed, and he could do it sitting in his house, sipping tea, with his slippers on, for the cost of several hours of heavy traffic on his Internet connection, and a few dollars in electricity.

We are living in a goddamn science fiction novel.
posted by Malor at 6:53 PM on January 7, 2013 [24 favorites]


This isn't much different than I recall reading in "Gödel, Escher, Bach" thirty years ago. Is this really breaking news?
posted by hwestiii at 6:55 PM on January 7, 2013


Etaoin Shrdl Cum on Feel the Noize
posted by jonp72 at 7:06 PM on January 7, 2013 [1 favorite]


I swear I remember reading a book when I was a kid that listed the nine most frequently used letters--in descending order--as ETAONIRSH. The book supplied, as a mnemonic, the following piece of dialogue: "Ever tickle an ostrich?" "No, I'd rather suck hummingbirds." I've remembered this now for over 25 years, and it has proven to come in very handy when playing hangman. I guess according to this new research I will need to revise my strategy and reverse a few letters in my list.

Seems rather peculiar, but I am pretty sure I am remembering this mnemonic correctly. The Googles do not return any enlightenment on its source, however. Anyone else remember this? I would love to know what book it was from.
posted by hurdy gurdy girl at 7:17 PM on January 7, 2013 [3 favorites]


This isn't much different than I recall reading in "Gödel, Escher, Bach" thirty years ago. Is this really breaking news?

Did you take a wrong turn on the way to the Huffington Post?
posted by empath at 7:23 PM on January 7, 2013


"Ever tickle an insane ostrich?" "No, somehow he really doesn't like contact." "...um."
posted by wanderingmind at 7:28 PM on January 7, 2013 [5 favorites]


run of the mill, garden-variety home PC

Waaait a minute, was the PC from the mill or the garden?
posted by moonmilk at 8:27 PM on January 7, 2013


No mention of Pogo yet?
posted by Chrysostom at 8:54 PM on January 7, 2013


We are living in a goddamn science fiction novel.

This sure does seem like a weird thing to have this reaction about. The amazing ability of Google Translate to provide, well, not really good or accurate, but adequate translations of natural text, using the vast storehouses of data Google has available? That is really kind of startling, and shows the power of having data.

This is counting to a bigger number than you could count to before.

I liked it a lot, though!

But I'm not going to start going around saying ETAOIN SRHDLC, that sounds terrible.
posted by escabeche at 9:03 PM on January 7, 2013


Eat Taters Accomplice? Oranges in Norwich sound reasonably haute demanding less cash.
posted by Brent Parker at 9:36 PM on January 7, 2013


To help your touch type speed, practice until you learn the most common 50 words on his list.
posted by JujuB at 10:53 PM on January 7, 2013


This is counting to a bigger number than you could count to before.

It's being able to do, at your desktop, what once would have taken the resources of a good sized nation to accomplish. Dunno how old you are, but I learned computing with an 8-bit machine, and the era of 64-bit computing and massive memory and storage totals are a near-constant marvel. I have a machine running a 4.4Ghz, with 16 gigs of RAM, sitting my desk, and I remember, not so long ago, running servers for hundreds of people that were maybe a tenth this fast. I really, really remember how hard stuff like this used to be.... dealing with data sizes of this magnitude was once extremely difficult. Heck, back in the days of computers with 64 or 128K of RAM, simply running a goddamn spell check was a difficult technical feat -- the programmer had to hold both your whole program and a big dictionary in nowhere near enough RAM for both. The algorithms they came up with were so very clever. And, on modern machines, so very obsolete.

I mean -- try to grok this if you can, it used to be basically impossible to hold a reasonable sized dictionary in RAM, much less a dictionary plus a document. Just couldn't do it. And, yet, the word-frequency corpus of a huge fraction of the total books ever published in English will fit in about 23 gigs of RAM -- and you can now, very cheaply, put 32 gigs of RAM on a home computer. We've gone, in my lifetime, from not being able to hold one dictionary, to being able to hold millions.

I suppose it's kind of like being constantly pleased with color television, but it's just .... this sort of little "authentic" detail, coloration for the world, as it were, was the sort of pie-in-the-sky bullshit that people like William Gibson used to invent. It was magic, mysterious, far away, and impossible, something that might happen in the Far Future. But, lo and behold, it's suddenly real.

For whatever reason, translation doesn't impress me nearly as much, whether or not that's actually justified. It's cool, but it's not the sort of thing that used to drive a sense of longing for more computer power.
posted by Malor at 11:01 PM on January 7, 2013 [4 favorites]


I smell an opportunity for found poetry.

The of, and to in a, is that,
For it, as was with be by on.
"Not he," i this are, or his from,
At which but have an had they?

You were their--
One--
All we can, her has there been!
If more, when will would who so no?
posted by not_on_display at 11:23 PM on January 7, 2013 [4 favorites]


OMFG are you friggin' KIDDING ME??? I LOVE THIS MAN. No, I seriously love this man. Peter Norvig, I LOVE YOU. What he's done is add this very small but PERFECT contribution to my PhD dissertation. I've been needing this exact piece. This exact piece. I LOVE THE FUTURE.

I just woke up. My day, week and month have been made. I didn't know that the bi-gram letter frequency of a billion google books could make me so happy, but there you go.

hexatron, I also love you for posting this.
posted by iamkimiam at 12:43 AM on January 8, 2013 [12 favorites]


Malor, he probably did it at his work, where they have a lot of computers.
posted by thelonius at 12:44 AM on January 8, 2013 [1 favorite]


thelonius, Norvig does work for Google, but he did this on his own PC.
Here's what we can do with today's computing power (using publicly available data and the processing power of my own personal computer; I'm not not relying on access to corporate computing power)
posted by nangar at 1:27 AM on January 8, 2013


ah, I had not seen that - that is impressive
posted by thelonius at 1:47 AM on January 8, 2013


How many years before his comment "Don't try this on your phone" is obsolete?
posted by CheeseDigestsAll at 5:11 AM on January 8, 2013 [1 favorite]


We are living in a goddamn science fiction novel.

Meanwhile, in the B plot, revolutionaries among the oppressed and defenseless, whose resources and futures are being sucked dry to make it possible, are....but that would be giving it away.
posted by DU at 5:23 AM on January 8, 2013 [2 favorites]


Sorry just my little joke.
posted by miyabo at 5:49 AM on January 8, 2013


The part where he gets into the most popular N-grams makes me wonder how representative of natural language (or, I guess, natural written language) the corpus is. The second most common 8-gram is "national"; #2 and #3 for the 9-grams are the first and last nine letters of "government." Is it possible that politics or current-events sources are overrepresented?
posted by psoas at 5:56 AM on January 8, 2013


I don't think Google claims their corpus is meant to be anything like a random sample.
posted by escabeche at 6:27 AM on January 8, 2013


There aren't a lot of 8+ letter words in common use to choose from.
posted by empath at 6:50 AM on January 8, 2013


There aren't a lot of 8+ letter words in common use to choose from.

Two-thirds of the comments in this thread contain at least one unique eight-plus-letter word, including this one.
posted by Etrigan at 6:55 AM on January 8, 2013


Here are all the 8-letter words in this thread (including comment 31 by Etrigan):
WORD COUNT
remember 4
computer 4
language 3
strategy 3
comments 2
probably 2
counting 2
fraction 2
possible 2
obsolete 2
hexatron 2
favorite 2
breaking 2
mnemonic 2
hundreds 1
publicly 1
Internet 1
peculiar 1
terrible 1
constant 1
supplied 1
research 1
moonmilk 1
accurate 1
national 1
anything 1
slippers 1
lifetime 1
bullshit 1
generate 1
thinking 1
machines 1
whatever 1
analyzed 1
location 1
document 1
dialogue 1
hwestiii 1
massaged 1
actually 1
suddenly 1
adequate 1
personal 1
millions 1
practice 1
reaction 1
SHRDLCUM 1
politics 1
posted by iamkimiam at 8:08 AM on January 8, 2013 [2 favorites]


I clearly do not know how to do a tab in HTML.
posted by iamkimiam at 8:09 AM on January 8, 2013


Pointing out that your list does not include "including" would be churlish, but then, I am rarely accused of insufficient churlishness.
posted by Etrigan at 9:43 AM on January 8, 2013


This is a little like when I discovered the KJV was a dreadfully inaccurate Bible translation — and then decided I didn't care and kept reading and quoting it anyway because it's still the one that sounds right.

Okay, fine, ETAOIN SHRDLU is no longer our best estimate for English letter frequency ranking. It's still got a damn sight more poetry and gravitas than this SHRDLCUM business. Or, uh, something like that.
posted by and so but then, we at 10:02 AM on January 8, 2013


"including" is in the list of 9-letter words, where it has a frequency count of one.
posted by iamkimiam at 12:03 PM on January 8, 2013 [2 favorites]


You are absolutely correct, and I apologize for my churlishness.
posted by Etrigan at 1:18 PM on January 8, 2013


I am amazed that so many people here have actually heard of the "etaion shrdlu" thing.
posted by jenfullmoon at 7:25 PM on January 8, 2013


Malor, he probably did it at his work, where they have a lot of computers.
posted by thelonius


I am a huge fan of the deadpan understatement.

You just won all the points forever.
posted by benito.strauss at 7:44 PM on January 8, 2013


There aren't a lot of 8+ letter words in common use to choose from.
thelonius: he probably did it at his work, where they have a lot of computers.
nangar: thelonius, Norvig does work for Google, but he did this on his own PC.
Probably
millions
machines
generate
counting
research.

Bullshit,
personal
computer
actually
analyzed
language.
posted by benito.strauss at 8:05 PM on January 8, 2013 [3 favorites]


I am amazed that so many people here have actually heard of the "etaion shrdlu" thing.
posted by jenfullmoon at 10:25 PM on January 8 [+] [!]

There seems to be a fair number of newspaper folks on Metafilter, though, to my surprise, a couple of ex-newsies that I work with now online hadn't heard of it. I think editors who worked in the old hot-type shops probably are most familiar with it.
posted by etaoin at 9:29 PM on January 19, 2013


I think it might be more a case of "nerds of a certain age", who read Martin Gardner's column in SciAm in the 70s, which is where I first came across it.
posted by benito.strauss at 10:54 PM on January 19, 2013 [1 favorite]


« Older The Science of Sex Abuse   |   Heckling Defended Newer »


This thread has been archived and is closed to new comments