That's what they said
November 21, 2009 10:14 AM   Subscribe

The Michigan Corpus of Academic Spoken English is a searchable collection of almost 2 million words of transcribed spoken English from the University of Michigan, including student study groups, office hours, dissertation defenses, and campus tours. Researchers use the Michigan corpus to investigate questions about usage, like "less or fewer?" (cf. this contentious Ask Meta thread) and more general topics, like "Vague Language in Academia." Browse or search MICASE yourself.
posted by escabeche (20 comments total) 19 users marked this as a favorite
 
From the campus tour:

"guys move in come on come on come on. this is ridiculous. i'm not gonna shoot you guys i promise i'm not gonna shoot anybody. now, couple quick things, first thing that the c- campus tour is my favorite part of the entire orientation thing. this is where i get off the most, i absolutely get a big kick out of this because i think i give a pretty decent tour, and i've been doing tours actually for the last two years. so anyway, basically i have a really good time giving a tour. if you do not want to be here, i don't want you to be here and the reason is, because, i don't want you to cramp my style i don't want you to cramp anybody else that wants to actually be here, so if you don't want to be here that's fine."

"i can tell you why i have a water gun. now the reason i have a water gun is because not so that i can shoot you guys [SU-f: no no no ] i promise see this is this is definitely group love this is spade love okay? definitely group love all the other groups pretty much suck okay? family at U-of-M talked about that a lot this morning but all the other groups they suck. now okay, we're divided, okay, so we're gonna be yelling a lot of, semi-obscene things at the other groups, we're gonna be shooting them with water "

Also, "fuck" is mostly said in lab.
posted by escabeche at 10:15 AM on November 21, 2009 [4 favorites]


[this is awesome]

"fuck" is mostly said in lab.

I can corroborate this from my own personal experience.
posted by grouse at 10:24 AM on November 21, 2009 [1 favorite]


Some of these conversations make the people sound totally stupid:

"mhm i wro- i w- i'm interested in the um, international aspect, [S1: uhuh, uhuh ] more, of a um, of a, program or whatnot so, like the international, business i was gonna do, it's a really, you know open field, you know like all that stuff but i don't, think that that's what i wanna do anymore, so"

(And that was someone in an "Honors Advising" session.)
posted by autoclavicle at 10:25 AM on November 21, 2009


Well you are less likely to encounter small hydrogen explosions and unintentional high amperage 10,000 volt discharges outside of labs.
posted by Zalzidrax at 10:29 AM on November 21, 2009


This is super interesting! I only wish other campuses were creating large tagged speech corpora like this – I'd be neat to compare. Especially where dialects are concerned. I couldn't find it, but was hoping that in their statistical/demographic breakdown page there was a way to sort by Michigan resident vs. nonresident speaker. There's gotta be some cool stuff that happens with younger individuals over time in academic settings with lots of dialect mixture, or where the geographic area has a well-defined or recognizable regional dialect in place.
posted by iamkimiam at 10:37 AM on November 21, 2009


Michigan was one of the first and best universities to make linguistics a serious subject and discipline ...
posted by Postroad at 10:37 AM on November 21, 2009


For autoclavicle.
posted by dhartung at 10:38 AM on November 21, 2009


This is awesome. Go Wolverines! (Though their football team is maximally sucking today ..... )
posted by blucevalo at 10:41 AM on November 21, 2009


I majored in anthro with a focus on linguistics for a while... it made me neurotic as hell.
posted by autodidact at 10:41 AM on November 21, 2009


I only wish other campuses were creating large tagged speech corpora like this – I'd be neat to compare.

I would like to compare with the corpus of speech at MetaFilter meetups in different geographical locations.
posted by grouse at 10:43 AM on November 21, 2009




Or perhaps I should have said it's <EVENT DESC="STROKES BEARD">kind of neat.
posted by Rhomboid at 10:53 AM on November 21, 2009


One of the things that's interesting to me about transcribed speech corpora is the lack of 'cleanup' compared to transcription for things like quotes in newspapers. (Actually, I recall reading arguments around transcriptions of politician's speech - filler info being retained to make people seem less articulate, etc.) Hestitations, false starts, 'um's etc. are all potentially interesting for linguistic research, but I always find them kind of hilarious to read in print. Granted,

"i'm interested in the international aspect, more of a program like the international business i was gonna do. it's a really open field, like all that stuff but i don't think that that's what i want to do anymore, so..."

is still kind of word salad, but not quite as terrible as it appears with all the hesitation/revision noise left in.
posted by heyforfour at 11:12 AM on November 21, 2009 [1 favorite]


also, I would be really interested in research on dialect mixture like iamkimiam describes!
posted by heyforfour at 11:13 AM on November 21, 2009


With a little bit of data massaging, I expect one could use this for training text to speech engines. If one fancied abiding by the terms:

An Individual License grants one named person permission to use the MICASE audio files and accompanying transcripts for non-commercial purposes...
Neither the audio files nor the transcripts may be redistributed to non-licensed users or used in commercial applications, research, materials, or publications without the express permission of the University of Michigan English Language Institute.


500 dollars and you can't share it with a coworker or use it for research without their permission. This is what academic fair use is for, but do we really need to mention it to a research institution? In contrast, I can download pretty much anything genomic related without charge. Example. Perhaps we need to expand the public access mandate?
posted by pwnguin at 11:29 AM on November 21, 2009 [1 favorite]


Rhomboid: "Or perhaps I should have said it's <EVENT DESC="STROKES BEARD">kind of neat."

<EVENT DESC="STICKS ITS OWN HEAD UP ITS ASS" WHO=PAYPAL>
posted by idiopath at 12:29 PM on November 21, 2009


I'm in love with that tour guide, man. I wanna be his buddy.
posted by lauranesson at 12:37 PM on November 21, 2009


My god. It's like a big David Foster Wallace novel.
posted by zer0render at 1:26 PM on November 21, 2009


Don't be absurd; it's at best like a medium-sized David Foster Wallace novel.
posted by escabeche at 4:02 PM on November 21, 2009


Having done a very small bit of voice recognition work, this would be tremendously handy, would that it's license be more permissive.

There are no open source speech corpora available. Some people are trying to make their own: VoxForge, however, it is slow going as voxforge is totally dependent on volunteers. This data set could literally revolutionize how human beings interact with computers. (Give the Free Software folks enough data to accurately transcribe speech on the fly)

It's a damn shame that a public university keeps this data behind a restrictive license.
posted by Freen at 4:56 PM on November 21, 2009 [2 favorites]


« Older Smoking Apples   |   We are a marriage preservation service Newer »


This thread has been archived and is closed to new comments