A treasure trove for researchers
February 21, 2012 7:01 PM Subscribe
Polltopia (625 kb zip file) "It's a treasure trove for researchers that I'm sure is unmatched in the world of modern polling: [Daily Kos has] assembled all the raw data for every single Daily Kos/SEIU poll conducted in 2011 into a single file. That's 46 polls, including questionnaires ... in a nifty 623 KB package. No one else releases information this granular, so if you've ever wanted to take a deep, deep dive into raw polling data, this is your chance."
Wouldn't KOS/SEIU polls be about as skewed (but in the opposite direction of) as polls conducted by FOX News/NRA?
posted by buggzzee23 at 7:12 PM on February 21, 2012
posted by buggzzee23 at 7:12 PM on February 21, 2012
Yeah, but the skew is more granular.
posted by twoleftfeet at 7:15 PM on February 21, 2012 [3 favorites]
posted by twoleftfeet at 7:15 PM on February 21, 2012 [3 favorites]
You're assuming that the people they poll are people who visit their sites. Here's some information about their polling methodology.
posted by crunchland at 7:23 PM on February 21, 2012 [4 favorites]
posted by crunchland at 7:23 PM on February 21, 2012 [4 favorites]
From what little I've seen so far, Kos valued accuracy over partisan results so it shouldn't be as skewed as Fox / Ramussen.
posted by honestcoyote at 7:24 PM on February 21, 2012
posted by honestcoyote at 7:24 PM on February 21, 2012
Time to crack open a can of R.
posted by TwelveTwo at 7:24 PM on February 21, 2012 [4 favorites]
posted by TwelveTwo at 7:24 PM on February 21, 2012 [4 favorites]
Wouldn't KOS/SEIU polls be about as skewed (but in the opposite direction of) as polls conducted by FOX News/NRA?
Depends. It's not like ideological belief lies on one side of a spectrum, and polling accuracy lies on the other. The issue is sample selection, question selection, and the statistical models used to transform raw data into publicly-published numbers.
Which is to say, releasing the raw underlying data is interesting. It doesn't mean that Kos' polls are more accurate, but it makes it much easier for third parties with relevant experience to judge the accuracy.
posted by verb at 7:25 PM on February 21, 2012 [2 favorites]
Depends. It's not like ideological belief lies on one side of a spectrum, and polling accuracy lies on the other. The issue is sample selection, question selection, and the statistical models used to transform raw data into publicly-published numbers.
Which is to say, releasing the raw underlying data is interesting. It doesn't mean that Kos' polls are more accurate, but it makes it much easier for third parties with relevant experience to judge the accuracy.
posted by verb at 7:25 PM on February 21, 2012 [2 favorites]
These polls are carried out by PPP, looks like, which is thought of as Democratic-leaning but which does fine in Nate Silver's pollster accuracy ratings; about as well as the generally-thought-of-as-Republican Rasmussen Reports.
posted by escabeche at 8:09 PM on February 21, 2012
posted by escabeche at 8:09 PM on February 21, 2012
"Wouldn't KOS/SEIU polls be about as skewed (but in the opposite direction of) as polls conducted by FOX News/NRA?"
Not really. PPP is a reliable agency with stronger ratings than Rasmussen. A couple years ago, the guys purportedly running Kos polls got caught essentially making up all their data; now it's pretty solid.
Where you'll find bias is in question selection and order, generally, though (IIRC) PPP generally does a good job of varying ordering to minimize interference. Question wording is always dodgy, but the methodology is open for you to check.
posted by klangklangston at 8:36 PM on February 21, 2012
Not really. PPP is a reliable agency with stronger ratings than Rasmussen. A couple years ago, the guys purportedly running Kos polls got caught essentially making up all their data; now it's pretty solid.
Where you'll find bias is in question selection and order, generally, though (IIRC) PPP generally does a good job of varying ordering to minimize interference. Question wording is always dodgy, but the methodology is open for you to check.
posted by klangklangston at 8:36 PM on February 21, 2012
Q2: Do you approve or disapprove of Barack Obamaâs job performance?
Holy Mojibake, Batman! U+2019 (RIGHT SINGLE QUOTATION MARK) is encoded in UTF-8 as 0xE2 0x80 0x99. If you feed a file containing that sequence into a program that thinks its input is ISO-8859-1, it will see it as three characters. If that program subsequently outputs UTF-8, those three characters will be encoded as the three code points U+00E2 (LATIN SMALL LETTER A WITH CIRCUMFLEX), U+0080 (<control>) U+0099 (<control>) resulting in the byte sequence 0xc3 0xa2 0xc2 0x80 0xc2 0x99, which will be displayed as "â". Some programs will show missing character boxes for the control characters, others won't.
Anyway, that's why the text of every question that contains an apostrophe looks mangled. The error in the workflow was due to a program thinking it was reading a file encoded with ISO-8859-1 (aka "Latin-1") when it was really encoded with UTF-8. Everything would have worked fine if this program was told the proper encoding of the data it was manipulating, but somebody forgot and it used a default. There is no such thing as "plain text". Everything has an encoding, and it's your responsibility to tell the software you use how the data it's processing is encoded. When you don't know or rely on defaults you create Mojibake.
(The problem is exacerbated by Microsoft products and blogging software that automatically turn the standard apostrophe (U+0027) into the right single quotation mark (U+2019). However, they cannot bare the full responsibility; the real issue is a workflow where programs are not told the correct encoding of the data they're processing. That this mistake happens to cause no harm with when the data encoding is ASCII does not mean it's not an issue.)
posted by Rhomboid at 8:43 PM on February 21, 2012 [2 favorites]
Holy Mojibake, Batman! U+2019 (RIGHT SINGLE QUOTATION MARK) is encoded in UTF-8 as 0xE2 0x80 0x99. If you feed a file containing that sequence into a program that thinks its input is ISO-8859-1, it will see it as three characters. If that program subsequently outputs UTF-8, those three characters will be encoded as the three code points U+00E2 (LATIN SMALL LETTER A WITH CIRCUMFLEX), U+0080 (<control>) U+0099 (<control>) resulting in the byte sequence 0xc3 0xa2 0xc2 0x80 0xc2 0x99, which will be displayed as "â". Some programs will show missing character boxes for the control characters, others won't.
Anyway, that's why the text of every question that contains an apostrophe looks mangled. The error in the workflow was due to a program thinking it was reading a file encoded with ISO-8859-1 (aka "Latin-1") when it was really encoded with UTF-8. Everything would have worked fine if this program was told the proper encoding of the data it was manipulating, but somebody forgot and it used a default. There is no such thing as "plain text". Everything has an encoding, and it's your responsibility to tell the software you use how the data it's processing is encoded. When you don't know or rely on defaults you create Mojibake.
(The problem is exacerbated by Microsoft products and blogging software that automatically turn the standard apostrophe (U+0027) into the right single quotation mark (U+2019). However, they cannot bare the full responsibility; the real issue is a workflow where programs are not told the correct encoding of the data they're processing. That this mistake happens to cause no harm with when the data encoding is ASCII does not mean it's not an issue.)
posted by Rhomboid at 8:43 PM on February 21, 2012 [2 favorites]
This would be pretty awesome, except... did they just release raw numerical data without any value labels in either the data or the questionnaires?
posted by aaronetc at 8:47 PM on February 21, 2012
posted by aaronetc at 8:47 PM on February 21, 2012
I don't get this --- I'm looking at 2011-01-06. The csv file has headings Q1 through Q21, but there are only 20 questions in the question_key_2011-01-06.txt file. And Q1 is "Do you have a favorable or unfavorable opinion of Barack Obama" but the numbers under Q1 range from 1 to 3 with one 4; what do these numbers mean?
posted by escabeche at 9:08 PM on February 21, 2012
posted by escabeche at 9:08 PM on February 21, 2012
Q21 in 1/6 is a constant, which leads me to believe it's some kind of interview disposition variable (e.g., "Completed Interview"). Doesn't explain the lack of labels, though. Other datasets are also screwy -- 7/28 has an extra six variables only measured in a subsample, which aren't listed in the questionnaire (one is obviously ZIP code).
posted by aaronetc at 9:29 PM on February 21, 2012
posted by aaronetc at 9:29 PM on February 21, 2012
12-15 looks much cleaner. I think the numbers refer to the order in which the possible responses are listed.
posted by escabeche at 9:42 PM on February 21, 2012
posted by escabeche at 9:42 PM on February 21, 2012
Isn't Daily Kos the outfit that bans folks for suggesting that Democratic politicians be subjected to primary challenges from the left?
posted by hamida2242 at 11:19 PM on February 21, 2012
posted by hamida2242 at 11:19 PM on February 21, 2012
I couldn't find any evidence of that in a google search for "daily kos baned." It does look like people do get banned from posting on Daily Kos, though, for various reasons, just like people get banned from Metafilter for various reasons. But I don't see what that has to do with this release of data. That would be like questioning the validity of the infodump because Metafilter banned some people.
posted by crunchland at 12:12 AM on February 22, 2012
posted by crunchland at 12:12 AM on February 22, 2012
Wouldn't KOS/SEIU polls be about as skewed (but in the opposite direction of) as polls conducted by FOX News/NRA?
Liberals have a reality bias.
posted by DU at 6:56 AM on February 22, 2012
Liberals have a reality bias.
posted by DU at 6:56 AM on February 22, 2012
crunchland: "That's 46 polls, including questionnaires ... in a nifty 623 KB package."
It's quaint how excited polisci nerds get over such a trivial amount of information.
posted by pwnguin at 8:55 AM on February 22, 2012 [2 favorites]
It's quaint how excited polisci nerds get over such a trivial amount of information.
posted by pwnguin at 8:55 AM on February 22, 2012 [2 favorites]
Row. 1000 respondents, broken down by week, excluding holiday weeks, for all of 2011.
posted by crunchland at 6:35 PM on February 22, 2012
posted by crunchland at 6:35 PM on February 22, 2012
I don't know if anybody's still reading this, but here's a link to a cleaned up version I made. It's a combined file, with all of the surveys (except one that had bad data) in one CSV file, and all of the questions harmonized (wordings changed). It includes all of the demographic components (including some extra bonus area code related stuff, like state name) as well as the 23 political questions in the survey that were asked in all, 1/2 or 1/4 of the surveys. (There's a hell of a lot of one-off questions; I dropped all of those.)
A data dictionary is included, with my interpretation of what the codes are. Let me know if you have any questions; feel free to pass it along.
posted by Homeboy Trouble at 10:42 PM on February 22, 2012 [2 favorites]
A data dictionary is included, with my interpretation of what the codes are. Let me know if you have any questions; feel free to pass it along.
posted by Homeboy Trouble at 10:42 PM on February 22, 2012 [2 favorites]
If anyone's still reading, here's my first go at playing around with this data.
posted by escabeche at 2:55 PM on February 27, 2012
posted by escabeche at 2:55 PM on February 27, 2012
« Older Food on my dog... that is all. | You can put a gun rack in a Volt Newer »
This thread has been archived and is closed to new comments
posted by crunchland at 7:11 PM on February 21, 2012