No Hypothesis
December 19, 2010 5:36 PM Subscribe
MeFi's own Elizabeth Pisani, of The Wisdom of Whores, on Big Data and the End of the Scientific Method (PDF).
"...end of the scientific method." I can't see straight because my eyes are rolling so hard.
posted by smcameron at 5:54 PM on December 19, 2010 [1 favorite]
posted by smcameron at 5:54 PM on December 19, 2010 [1 favorite]
ya know... 5 comments in nearly three years does not make her "MeFi's own"...
(and, I don't think we even have a clear title to her...)
posted by HuronBob at 6:09 PM on December 19, 2010 [3 favorites]
(and, I don't think we even have a clear title to her...)
posted by HuronBob at 6:09 PM on December 19, 2010 [3 favorites]
ya know... 5 comments in nearly three years does not make her "MeFi's own"...
+1
posted by spock at 6:11 PM on December 19, 2010 [2 favorites]
+1
posted by spock at 6:11 PM on December 19, 2010 [2 favorites]
Systems biology is the kind of research I do. It does not replace scientific method. Basically all it does is generate hypotheses which are then tested and validated just as they were before, and that's not going to change. Yes we can now find out all kinds of things we couldn't before, and data driven analysis is awesome for giving us leads we would not have thought of on our own. But research doesn't stop there and the new ideas still need to be tested and followed up, just like all science, and the scientific method is as necessary as ever.
Data mining is part of how we do things now but it's only part, and I find some of the people quoted in here pretty naive with their understanding of how research works as a whole.
posted by shelleycat at 6:16 PM on December 19, 2010 [13 favorites]
Data mining is part of how we do things now but it's only part, and I find some of the people quoted in here pretty naive with their understanding of how research works as a whole.
posted by shelleycat at 6:16 PM on December 19, 2010 [13 favorites]
It's called induction, and it has always been part of the scientific method.
posted by anthill at 6:18 PM on December 19, 2010 [1 favorite]
posted by anthill at 6:18 PM on December 19, 2010 [1 favorite]
It's called induction, and it has always been part of the scientific method.
Sherlock Holmes would use deduction.
posted by ovvl at 6:20 PM on December 19, 2010
Sherlock Holmes would use deduction.
posted by ovvl at 6:20 PM on December 19, 2010
Processing power is not measured in gigabytes!
It'll be light years before people stop getting that wrong.
posted by justsomebodythatyouusedtoknow at 6:20 PM on December 19, 2010 [21 favorites]
It'll be light years before people stop getting that wrong.
posted by justsomebodythatyouusedtoknow at 6:20 PM on December 19, 2010 [21 favorites]
Sherlock Holmes would use deduction.
Nah. That was just the cocaine talking.
posted by Dumsnill at 6:24 PM on December 19, 2010 [5 favorites]
Nah. That was just the cocaine talking.
posted by Dumsnill at 6:24 PM on December 19, 2010 [5 favorites]
It'll be light years before people stop getting that wrong.
Nah. Just about 12 parsecs.
posted by The Bellman at 6:25 PM on December 19, 2010 [2 favorites]
Nah. Just about 12 parsecs.
posted by The Bellman at 6:25 PM on December 19, 2010 [2 favorites]
Systems biology ... does not replace scientific method. Basically all it does is generate hypotheses which are then tested and validated
That's more or less my take too. Data mining helps in the inductive, intuitive hypothesis construction step, but it doesn't replace hypothesis testing. It's a pretty big deal, as coming up with (good) hypotheses is really hard. Someone still needs to do the field or lab work though.
posted by bonehead at 6:31 PM on December 19, 2010
That's more or less my take too. Data mining helps in the inductive, intuitive hypothesis construction step, but it doesn't replace hypothesis testing. It's a pretty big deal, as coming up with (good) hypotheses is really hard. Someone still needs to do the field or lab work though.
posted by bonehead at 6:31 PM on December 19, 2010
"You say "big data"; I hear the Spinal Tap song "Big Bottom". "My data fits me like a flesh tuxedo; I mapreduce it with my pink torpedo." - Bryan O'Sullivan
posted by mhoye at 6:56 PM on December 19, 2010 [1 favorite]
posted by mhoye at 6:56 PM on December 19, 2010 [1 favorite]
ya know... 5 comments in nearly three years does not make her "MeFi's own"...
Mefi's sorta.
posted by special-k at 7:01 PM on December 19, 2010 [2 favorites]
Mefi's sorta.
posted by special-k at 7:01 PM on December 19, 2010 [2 favorites]
I'm wondering if she actually believes what she wrote or if this is how she's selling it to a popular magazine.
posted by mandymanwasregistered at 7:52 PM on December 19, 2010 [2 favorites]
posted by mandymanwasregistered at 7:52 PM on December 19, 2010 [2 favorites]
I actually think "Mefi's Own" is now fully ironicised - what it seems to signify is "Subject of a thread who paid their five dollars to thank people for interest in their work or to correct inaccuracies and were never heard of again". By which terms Charlie Stross isn't technically "Mefi's Own", and Graham Linehan is borderline.
posted by Grangousier at 8:08 PM on December 19, 2010 [1 favorite]
posted by Grangousier at 8:08 PM on December 19, 2010 [1 favorite]
I kind of like how "ironicised" is sort the opposite of "ironclad".
posted by smcameron at 8:11 PM on December 19, 2010
posted by smcameron at 8:11 PM on December 19, 2010
I thought this article from The New Yorker two weeks ago about how lack of rigor in applying the scientific method leads to unreliable results was pretty good. No connection to MeFi as far as I can tell though ...
posted by TheShadowKnows at 8:22 PM on December 19, 2010 [1 favorite]
posted by TheShadowKnows at 8:22 PM on December 19, 2010 [1 favorite]
One thing about signifying notables who joined to comment on one specific thread is THEY KNOW WHO WE ARE. And once they know who we are and acknowledge us, WE GOT 'EM. Yes, they are "Mefi's Own"... or maybe "Mefi's Pwn"
posted by oneswellfoop at 8:24 PM on December 19, 2010
posted by oneswellfoop at 8:24 PM on December 19, 2010
Back before I had to forget everything I knew about Epidemiology in order to make room for the names of the little arteries that run along the intestines, I was super interested in how the field of Epidemiology will change as more and more health records get digitized. Imagine being able to comb through data on every interaction that every US citizen has had with the healthcare system.
It's like the prodigal child of the Framingham Heart Study and the Dartmouth Atlas all rolled into one, and with an n of over 300,000,000 it would have the statistical power to find things that nobody's ever been able to prove. This is to say nothing of its ability to answer some questions about comparative pharmaceutical effectiveness that could shake up the industry something awful. Of course, the Europeans and their centralized healthcare systems have been on this since forever. America lags behind with its ramshackle amalgamation of insurers and providers, and I sit around and shudder to think that I'm going to have to be a part of it.
posted by The White Hat at 8:24 PM on December 19, 2010 [1 favorite]
It's like the prodigal child of the Framingham Heart Study and the Dartmouth Atlas all rolled into one, and with an n of over 300,000,000 it would have the statistical power to find things that nobody's ever been able to prove. This is to say nothing of its ability to answer some questions about comparative pharmaceutical effectiveness that could shake up the industry something awful. Of course, the Europeans and their centralized healthcare systems have been on this since forever. America lags behind with its ramshackle amalgamation of insurers and providers, and I sit around and shudder to think that I'm going to have to be a part of it.
posted by The White Hat at 8:24 PM on December 19, 2010 [1 favorite]
Late to the party, but yes, "Big Data" is a huge boon to the scientific mind; it allows perception of unexpected trends (OMG) that can generate hypotheses (WTF), which can be tested (BBQ).
posted by Mister_A at 8:26 PM on December 19, 2010
posted by Mister_A at 8:26 PM on December 19, 2010
Big Data the scientist allow me to introduce you to my friend NP he's a traveling salesman. That's odd Big Data just tab away. NP what do you think...I see you'll get back to ke sometime after the death of the sun. Fuck you NP.
posted by humanfont at 8:29 PM on December 19, 2010 [2 favorites]
posted by humanfont at 8:29 PM on December 19, 2010 [2 favorites]
I thought this article from The New Yorker two weeks ago about how lack of rigor in applying the scientific method leads to unreliable results was pretty good.
I thought it was infuriating and I muttered aloud at it irritably while reading it.
The author seems to be utterly transfixed by a sort of literal take on the researcher's description of how it feels to be disappointed by non-replicable results. He keeps talking about the data as if it's some mutable thing that is actually changing in meaning.
However, his sources are describing how flaws in study design and statistical analysis plus unacknowledged biases have created a weak spot in our present system of scientific communication.
posted by desuetude at 9:16 PM on December 19, 2010
I thought it was infuriating and I muttered aloud at it irritably while reading it.
The author seems to be utterly transfixed by a sort of literal take on the researcher's description of how it feels to be disappointed by non-replicable results. He keeps talking about the data as if it's some mutable thing that is actually changing in meaning.
However, his sources are describing how flaws in study design and statistical analysis plus unacknowledged biases have created a weak spot in our present system of scientific communication.
posted by desuetude at 9:16 PM on December 19, 2010
Another circular article that sets up a straw man and then sets it on fire.
posted by benzenedream at 11:57 PM on December 19, 2010
posted by benzenedream at 11:57 PM on December 19, 2010
This thread will mean the death of snark on MetaFilter!
posted by Mister_A at 5:14 AM on December 20, 2010
posted by Mister_A at 5:14 AM on December 20, 2010
Sometimes you test your hypothesis by mining for contrary data. Or astronomy is like stamp collecting. One or the other.
posted by Kid Charlemagne at 5:48 AM on December 20, 2010 [1 favorite]
posted by Kid Charlemagne at 5:48 AM on December 20, 2010 [1 favorite]
A few years ago, a Danish team working in Guinea Bissau discovered that a new fashion for giving Vitamin A at birth appears to be good for boys and bad for girls. The findings were dismissed as the result of an “unintended experiment” and thus to be ignored. Baby girls may die as a result, but no policy change will be recommended until a trial has been conducted on the specific question of gender difference and Vitamin A supplements.
I think this example is very telling. Vitamin A supplementation is considered one of the great public health triumphs of the recent era: considered one of the most cost-effective and beneficial interventions, two doses of Vitamin A before the 1st birthday has been shown to reduce mortality by huge percentages (33%, 50% - huge meaningful differences). Absence of Vitamin A in children can lead to blindness. There is a fairly large corpus of data supporting a hypothesis that Vit A supplementation is beneficial and generally A Good Thing.
But no - because this one study that shows a different result from previous studies needs to be replicated, then obviously this is a sign that scientists don't care if baby girls die, they just want to do these cold pointless trials.
Well, what if not supplementing baby girls is even more harmful? What if their analysis had hidden confounding variables? What if their study design wasn't (gasp!) perfect? Why should this one study be exempt from proving their data isn't crap? Most individual studies have some sort of flaw (small sample size, limited follow up time, etc), which is why we must repeat our results, especially when they are new.
The Vitamin A hypothesis ('the new fashion') arose from precisely this type of data mining that she is advocating for. Alfred Sommer was an ophthalmologist studying the effects of Vitamin A on certain types of childhood blindness in Nepal. Going through his longitudinal data set, he noticed that the children who had a subclinical Vitamin A deficiency or worse were simply not showing up in his data set in subsequent rounds of follow up. From this he developed and tested, and re-tested his hypothesis, and is now credited as being the only ophtalmologist to ever save a billion lives. (Summary here.)
Man, I am just getting a wee bit tired with all these articles about how, with infinite data, we no longer need to state our biases, figure out direction of causality, or try to disprove our treasured ideas.
posted by palindromic at 8:48 AM on December 20, 2010 [2 favorites]
I think this example is very telling. Vitamin A supplementation is considered one of the great public health triumphs of the recent era: considered one of the most cost-effective and beneficial interventions, two doses of Vitamin A before the 1st birthday has been shown to reduce mortality by huge percentages (33%, 50% - huge meaningful differences). Absence of Vitamin A in children can lead to blindness. There is a fairly large corpus of data supporting a hypothesis that Vit A supplementation is beneficial and generally A Good Thing.
But no - because this one study that shows a different result from previous studies needs to be replicated, then obviously this is a sign that scientists don't care if baby girls die, they just want to do these cold pointless trials.
Well, what if not supplementing baby girls is even more harmful? What if their analysis had hidden confounding variables? What if their study design wasn't (gasp!) perfect? Why should this one study be exempt from proving their data isn't crap? Most individual studies have some sort of flaw (small sample size, limited follow up time, etc), which is why we must repeat our results, especially when they are new.
The Vitamin A hypothesis ('the new fashion') arose from precisely this type of data mining that she is advocating for. Alfred Sommer was an ophthalmologist studying the effects of Vitamin A on certain types of childhood blindness in Nepal. Going through his longitudinal data set, he noticed that the children who had a subclinical Vitamin A deficiency or worse were simply not showing up in his data set in subsequent rounds of follow up. From this he developed and tested, and re-tested his hypothesis, and is now credited as being the only ophtalmologist to ever save a billion lives. (Summary here.)
Man, I am just getting a wee bit tired with all these articles about how, with infinite data, we no longer need to state our biases, figure out direction of causality, or try to disprove our treasured ideas.
posted by palindromic at 8:48 AM on December 20, 2010 [2 favorites]
eep, one million, not one billion lives.
And another example of an article that says data is the future of science, this time starring Sergey from Google.
posted by palindromic at 8:59 AM on December 20, 2010
And another example of an article that says data is the future of science, this time starring Sergey from Google.
posted by palindromic at 8:59 AM on December 20, 2010
If your image of Popper's hypothesis generation was of a scientist sitting quietly at a desk, waiting for the light bulb to go off-- yup, it was all a big myth. If that's your image of hypothesis generation, well, then it makes sense that you'd wonder exactly what is all of this useful information you got from data dredging. (Hint: it's a hypothesis.)
It's no secret that hypotheses are generated from data. You look at a chart and say, "Hey, look at this!" And then you test it.
If you're Google, and you want to find out whether you can find flu outbreaks quickly and reliably, you have to listen to the folks who ask, "Well, you can predict them just as well from the Oscars." It's not hard. You compare your algorithm to that of the Oscars algorithm. And of course, one would hope that CDC had already beaten the Oscars algorithm, so if you beat the CDC, you've beaten the Oscars.
Pisani writes,
“Fishing” is only a problem if the datasets are too small or the sampling design too weak to support the results.
But that is just. not. true.
Every time you ask a question, there's a one-in-twenty that you're going to get a spurious result from your data set. It doesn't matter if your n is seven billion! (and hopefully you'd have a pretty representative sample at that size :))
Do you want to know what happens when you can ask as many questions as you want, free from the burden of generating hypotheses? You get studies that claim that eating breakfast cereal makes you likelier to give birth to a boy.
White Hat wrote,
It's like the prodigal child of the Framingham Heart Study and the Dartmouth Atlas all rolled into one, and with an n of over 300,000,000 it would have the statistical power to find things that nobody's ever been able to prove. This is to say nothing of its ability to answer some questions about comparative pharmaceutical effectiveness that could shake up the industry something awful.
Let's say that you took this data set and asked it twenty questions. You get one hit! Obviously, you publish on the positive finding, not on the nulls.
Guess what: you just published one spurious finding. And you know how research is in medicine: if it affects someone, it's going to be acted on, and it's going to be decades until anybody even bothers to see if it can be replicated.
More likely: you find two hits. One spurious, the other not spurious. Which is which? Who knows-- you're going to have to do a real, hypothesis driven study. Yeah, it can be retrospective, so long as its data wasn't used to form your hypothesis.
Of course, since your hypothesis generating data was 300mill, I don't think there's anybody left to sample in the population.... (Whole of US, right?)
And the reality is nothing like this. The reality is asking hundreds and thousands of questions of your data set, and not paying any attention to how many questions you've even asked. The ease with which we can poll this "big data" ironically makes it less likely that we are going to find meaningful correlations from dredging it.
Look, data sets are super rad. I'm into them. But they in no way free anyone from the responsibility of doing hypothesis-driven research (preferably RCTs).
posted by nathan v at 8:15 PM on December 20, 2010 [2 favorites]
It's no secret that hypotheses are generated from data. You look at a chart and say, "Hey, look at this!" And then you test it.
If you're Google, and you want to find out whether you can find flu outbreaks quickly and reliably, you have to listen to the folks who ask, "Well, you can predict them just as well from the Oscars." It's not hard. You compare your algorithm to that of the Oscars algorithm. And of course, one would hope that CDC had already beaten the Oscars algorithm, so if you beat the CDC, you've beaten the Oscars.
Pisani writes,
“Fishing” is only a problem if the datasets are too small or the sampling design too weak to support the results.
But that is just. not. true.
Every time you ask a question, there's a one-in-twenty that you're going to get a spurious result from your data set. It doesn't matter if your n is seven billion! (and hopefully you'd have a pretty representative sample at that size :))
Do you want to know what happens when you can ask as many questions as you want, free from the burden of generating hypotheses? You get studies that claim that eating breakfast cereal makes you likelier to give birth to a boy.
White Hat wrote,
It's like the prodigal child of the Framingham Heart Study and the Dartmouth Atlas all rolled into one, and with an n of over 300,000,000 it would have the statistical power to find things that nobody's ever been able to prove. This is to say nothing of its ability to answer some questions about comparative pharmaceutical effectiveness that could shake up the industry something awful.
Let's say that you took this data set and asked it twenty questions. You get one hit! Obviously, you publish on the positive finding, not on the nulls.
Guess what: you just published one spurious finding. And you know how research is in medicine: if it affects someone, it's going to be acted on, and it's going to be decades until anybody even bothers to see if it can be replicated.
More likely: you find two hits. One spurious, the other not spurious. Which is which? Who knows-- you're going to have to do a real, hypothesis driven study. Yeah, it can be retrospective, so long as its data wasn't used to form your hypothesis.
Of course, since your hypothesis generating data was 300mill, I don't think there's anybody left to sample in the population.... (Whole of US, right?)
And the reality is nothing like this. The reality is asking hundreds and thousands of questions of your data set, and not paying any attention to how many questions you've even asked. The ease with which we can poll this "big data" ironically makes it less likely that we are going to find meaningful correlations from dredging it.
Look, data sets are super rad. I'm into them. But they in no way free anyone from the responsibility of doing hypothesis-driven research (preferably RCTs).
posted by nathan v at 8:15 PM on December 20, 2010 [2 favorites]
« Older Catalogs of Horror | That's the biggest dirty photo I've ever seen. Newer »
This thread has been archived and is closed to new comments
... oh god sorry. I'm sure it's a good article, but sometimes it's hard to control the pedant reflex, you know? I'll keep reading.
posted by magnificent frigatebird at 5:49 PM on December 19, 2010 [2 favorites]