BASP+NHSTP=0
March 19, 2015 7:45 AM   Subscribe

Academic journal bans p-value significance test An editorial published in the academic journal Basic and Applied Social Psychology (BASP) has declared that the null hypothesis significance testing procedure (NHSTP) is 'invalid', and have banned it from future papers submitted to the journal.

Statslife.org has opinions from Peter Diggle, Stephen Senn, Andrew Gelman, Geoff Cummings and Robert Grant.

Choice quotes:
"The editorial is kinder to Bayesian inference, but not by much, stating that ‘The usual problem with Bayesian procedures is that they... generate numbers where none exist.’ The journal’s preferred approach is to rely on descriptive statistics. This begs the admittedly difficult question of what, in any particular circumstance, is the correct way to convert a description into a conclusion. "- Diggle

"The day this came out, I received 10 emails about it, mostly from people I don’t even know." -Gelman

More here and here.

(editorial is here but website is down)
posted by MisantropicPainforest (58 comments total) 28 users marked this as a favorite
 


Hypothesis testing should not be the measure of publishable significance, probably.

But that doesn't really justify banning it from all papers in your journal, IMO.
posted by grobstein at 7:58 AM on March 19, 2015 [1 favorite]


Portal 2 was a red herring. The lemons want only you and your car, the house is safe. For the moment.
posted by Slackermagee at 8:10 AM on March 19, 2015


The "null hypothesis significance testing procedure" (NHSTP) has been declared invalid. From now on we will accept only papers that apply the "nil hypothesis test scenario principle" (NHTSP).
posted by sfenders at 8:10 AM on March 19, 2015 [3 favorites]


Huh...when I was in grad school there was a group of people arguing that we shouldn't put stars (i.e. indicate p-values) on our tables, essentially because p-values are so often misinterpreted. I, and others I knew, found this kind of insulting: If I misinterpret p-values, then absolutely ding me for that, but don't tell me "you can't possibly be trusted with p-values!".

The post and the first article suggested this was about some other problem with p-values, but I don't see it. I see a whole bunch of arguments about the ways people misinterpret p-values. I assume exogenous' point in linking that article was also about misinterpretation of p-values. The editorial doesn't actually give a reason.

So why not just stop publishing articles that misinterpret p-values? Or is there some other argument we're not seeing here?

Also, about this business where if you take a finding with a p-value of .01 and replicate it, you don't get a p-value of .01 seems silly. Why would you expect to get a p-value of .01 if you replicated the study? One article says you would expect to get that p-value 99% of the time? Huh? Why? That makes no sense at all.
posted by If only I had a penguin... at 8:15 AM on March 19, 2015 [12 favorites]


Or that have to do with roadways (NHTSA).
posted by dorque at 8:15 AM on March 19, 2015


Interesting. I think too many studies are hyperfocused on "p<0.05 means we found something exciting", but in the absence of the p value score, will articles get harder or easier for non-statisticians to read and understand?
posted by demiurge at 8:15 AM on March 19, 2015


US highway fatality rate versus the tonnage of fresh lemons imported from Mexico

I think this is a great example of the dangers of collinearity, as I suspect that both US highway fatalities and fresh lemons imported from Mexico are positively correlated with time. Proper statistical analysis would remove that correlation first anyway.
posted by dialetheia at 8:18 AM on March 19, 2015 [4 favorites]


The null hypothesis (that null hypothesis testing is valid) cannot be rejected until they provide a p-value for the result. And find a journal that accepts their approach to publish the results.
posted by vorpal bunny at 8:20 AM on March 19, 2015 [9 favorites]


So why not just stop publishing articles that misinterpret p-values? Or is there some other argument we're not seeing here?


I suspect the problem is that the p-values are a problem for the readers, not the authors. Regardless of its accuracy, an argument that confuses 75% of the audience is a really bad argument.

Add on top of it that it's easy for the authors to get it wrong and you've got a recipe for a predictable rathole every time any paper is published.

It's nice and well to say that this would best be solved by everyone sitting down to wrap their heads around p-values once and for all, but history shows that ain't gonna happen. So we'll leave a minor bit of analysis out of the paper and everyone can move along.
posted by Tell Me No Lies at 8:31 AM on March 19, 2015 [6 favorites]


I think this is a great example of the dangers of collinearity

How so? collinearity is a problem only during some MCMC sampling or if its perfect. As they say, collinearity is really micronumerosity.
posted by MisantropicPainforest at 8:39 AM on March 19, 2015




collinearity is really micronumerosity sporting a snazzy smock
posted by quonsar II: smock fishpants and the temple of foon at 8:47 AM on March 19, 2015 [3 favorites]


It's nice and well to say that this would best be solved by everyone sitting down to wrap their heads around p-values once and for all,

Everyone who would be reading an academic journal DID sit down and wrap their heads around p-values, once and for all. They did it in undergraduate statistics.

The problem isn't that this is too complicated for your run of the mill PhD to get or that nobody every teaches this, it's that people get sloppy. Journal editors removing sloppiness is no harder than going through an article and hunting down anything based on a standard error. AND, since the people reading the articles are the same people publishing the articles, a journal editor reminding an author that your p-value of .000 does not mean you found a really big effect or a very important variable, is also reminding readers of journal articles the same thing.
posted by If only I had a penguin... at 8:50 AM on March 19, 2015 [2 favorites]


Sure, p-values can lead to misinterpretation, but I'm not sure if the alternative, no p-values, is going to lead to less misinterpretation. In fact, the whole reason that they are widely used was to allow some sort of standard of evidence to say things like "population A has greater quantity X than population B by Y amount (p < arbitrary limit)". Now will that still be said, except without the p-value statement, leading to a less informative statement? Now if authors write "A is greater than B" we are left without any standard of evidence or getting much insight into the statement or it's potentially valid and true interpretations. The other direction to go would be to lessen the amount of quantitative arguments in a paper, which seems like a way to get even more misinterpretation.

P-values are pretty terrible ways to communicate, but I think they're far better than the alternatives.
posted by Llama-Lime at 8:53 AM on March 19, 2015 [5 favorites]


When this happened I was trying to dig into it a little bit. One (or both) of the editors really has a problem with, like, the epistemological basis for inference based on the null hypothesis. As far as I can tell, they have written a couple of papers on the subject, basically arguing that (again, this is based on my cursory examination of their stuff, so I welcome corrections or clarifications) whether or not you reject the null does not tell you anything about whether the null is actually true. I haven't gone and read the articles he's written explaining this in more detail, just the editorial he wrote when he became editor a year before this new practice was introduced.

There are problems with the interpretation of p-values and probably a tendency to hunt around for specifications that give you significance. Confidence intervals/standard errors of estimates are probably "better" for the person trying to interpret things, and reporting effect size is important. (In economics, people often talk about the economic significance of these results, in terms of benchmarks people understand - differences in dollars, or levels of GDP, or welfare based on common benchmarks researchers are familiar with.) This appears to be part of what the editors want instead, which is probably not bad; perhaps it's a little more fraught in psychology. But, yeah, I don't think 'p-values are easily misinterpreted' is the reason the original ban happened, it seems deeper than that.
posted by dismas at 8:53 AM on March 19, 2015 [4 favorites]


The linked article seems to say that the journal is disallowing p-values because a good p-value isn't necessarily proof of the significance of a discovery.

So (and this is a serious question), what standard will they use instead? "The graph looks good"?
posted by amtho at 9:04 AM on March 19, 2015 [3 favorites]


whether or not you reject the null does not tell you anything about whether the null is actually true.

But that's just another misinterpretation of the p-value. It's not meant to tell you if the null is actually true. And of course you would report effect sizes. The effect sizes are the actual findings. And in my field at least standard errors are in every table. Each set of numbers serves its own purpose. There's no reason for having to choose between p-values and standard errors and effect sizes. They should all be in there, and each interpreted based on what it can actually say.
posted by If only I had a penguin... at 9:06 AM on March 19, 2015 [5 favorites]


"The graph looks good?"

It's psychology so it's more about whether the graph feels good.
posted by walrus at 9:06 AM on March 19, 2015 [2 favorites]


they mention bayesian inference as an alternative but their editorial clearly suggests that they dont know what it is or how to do it.
posted by MisantropicPainforest at 9:09 AM on March 19, 2015 [1 favorite]


So (and this is a serious question), what standard will they use instead? "The graph looks good"?

The people I went to grad school with were proponents of confidence intervals instead of p-values. Confidence intervals tell you how precisely you've estimated the effect sizes and you can still see statistical significance in them because if the confidence interval excludes the null value it's significant at the same confidence level as the confidence interval. They wouldn't want you to notice that, though. Confidence intervals are wonderful and useful and I have no issue with them.

However, this journal seems also opposed to confidence intervals, also, presumably because they are fruit of the same poisonous standard error tree. So I think they basically want an eyeball test of effect sizes, combined with larger samples (i.e. smaller standard errors = small confidence interval = more precise estimate, but let's not actually estimate any of those values).

I don't think they're so much suggesting Bayesian inference as an alternative as sighing and agreeing to put up with it, if you must.
posted by If only I had a penguin... at 9:14 AM on March 19, 2015 [3 favorites]


…I suspect that both US highway fatalities and fresh lemons imported from Mexico are positively correlated with time.

They might also be correlated with tequila. Just sayin...
posted by TedW at 9:16 AM on March 19, 2015 [1 favorite]


But I just learned about p-values! Don't take away the nails, I just got my hammer!
posted by ThePinkSuperhero at 9:24 AM on March 19, 2015 [6 favorites]


Oh, don't worry, people have been banging the shit out of screws, light switches, and thumbs with that hammer for decades.
posted by escabeche at 9:38 AM on March 19, 2015 [20 favorites]


...BASP authors David Trafimow and Michael Marks of Mexico State University...

My hypothesis is that Mexico and New Mexico are distinguishable even from the UK, provided you have an editor who gives a rat's ass.
posted by Killick at 9:54 AM on March 19, 2015 [2 favorites]


But that's just another misinterpretation of the p-value. It's not meant to tell you if the null is actually true. And of course you would report effect sizes. The effect sizes are the actual findings. And in my field at least standard errors are in every table. Each set of numbers serves its own purpose. There's no reason for having to choose between p-values and standard errors and effect sizes. They should all be in there, and each interpreted based on what it can actually say.

Right, I don't disagree! Which is why I was sort of hoping someone would swoop in and say "You're misreading his argument, he actually means _____."
posted by dismas at 10:08 AM on March 19, 2015


I'm totally for banning NHST, but I feel like the decision is pretty flawed. Jaynes does a pretty solid job demolishing the arguments from the latter half of the editors note in 3.81 of his book.

As for all the FUD about Laplace and priors, I dunno. It looks like he's in some kind of slap fight with Wagenmakers that, as mentioned above, doesn't really jive with the theory or practice of Bayesian inference.

Essentially, we should be arguing about the predictive properties of our theories, not statistical properties of our data.
posted by ethansr at 10:18 AM on March 19, 2015 [1 favorite]


He received 10 emails. The horror. I'll betcha that's not even a statistically significant increase.
posted by dances_with_sneetches at 10:23 AM on March 19, 2015 [1 favorite]


Everyone who would be reading an academic journal DID sit down and wrap their heads around p-values, once and for all. They did it in undergraduate statistics.

I'm certain they wrapped their heads around the technical details. It is not clear to me, however, that everyone walked away understanding the reason for their existence and the full whys and whens of how to use them.
posted by Tell Me No Lies at 12:32 PM on March 19, 2015 [4 favorites]


This has been doing the rounds in my academic circles, and I was wondering if it was going to show up here.

I'll admit I initially had concerns. Judging by the way the editorial seems to have mutilated Laplace's principle of insufficient reason, I'm not sure that Trafimow and Marks are entirely on top of all the statistical theory involved. But there's a methodological crisis in psychology at the moment, and I don't think it's wise to let perfect be the enemy of good. Right now Trafimow and Marks are on the side of the angels, and some of the hurf-durf-psychologists-are-dumb commentary coming from statisticians like this gem from Wasserman are actively harmful. Here's his helpful little contribution in full:
This is ridiculous.
Should we outlaw alcohol just because some people mis-use it?

Any statistical method can be used properly or poorly.
That’s true of p-values, confidence intervals (which they also banned)
and Bayesian method.

The journal should judge each paper on its merits.
A blind ban on p-values and confidence intervals is insane.

Larry Wasserman
Professor of Statistics
Carnegie Mellon University

I'm more than a little annoyed at this kind of comment, and I've seen quite a number of variations on it recently. Yes, every statistical method can be abused in some fashion. But not all tools have equal potential for abuse: some are worse than others. In fact, the tools that psychologists have been handed by statisticians who don't know a damn thing about the specifics of our discipline tend to be among the worst. Oh, the number of times I have seen talented experimental psychologists try to jam a perfectly sensible hypothesis into an awkward and highly inappropriate ANOVA. Ugh.

And for once I actually do know that of which I speak, having been an associate editor at several top-tier psychology journals for many years. I've been involved with broad methodological journals, discipline specific journals, and super-technical quantitative journals. In this time I've seen all manner of abuses unintentionally perpetrated by psychologists who were just doing what their introductory statistics textbook told them to, and felt powerless to fix it without rejecting 80% of the papers that cross my desk. The situation sucks. I'm sure it makes smart folks like Wasserman feel good to rubbish on scientists who don't know their Neyman-Pearson lemma from their Cramer-Rao lower bound, but the fact of the matter is one of the major contributors to this mess is the fact that psychologists were naive enough to actually listen to the dumbed-down condescending shit spoon-fed to us by statisticians who seem to act as if that the answers to every scientific question can be found within the GLM.

At this point I'm not willing to listen to their opinions any more. Fortunately, I don't have to: there are no shortage of statistically competent psychologists who I trust. From the cognitive science perspective, I trust E.J. Wagenmakers and Michael Lee. I trust Jeff Rouder and Rich Morey. I trust John Kruschke. Not one of them could match Wasserman as a pure statistician (hell, I'm not sure they're any better than me, and frankly I'm not that great), but every single one of them made their bones as an experimental cognitive psychologist. Every one of them has been burned trying to do what the introductory textbooks say we should do. Every one of them has a deep grasp of how data analysis in psychology actually works in the wild. I'm yet to meet a "proper statistician" who does.

Given the scale of the problem faced by psychology at this point, I have some advice for any statistician thinking of (a) contributing to the ruin of a scientific discipline by defending a badly broken status quo or (b) getting up on their high horse to defend p-values on technicalities. I would strongly advise pursuing option (c): shut the fuck up and listen to what psychologists are saying.

Wasserman's twee bit of advice that "the journal should judge each paper on its merits" fundamentally mistakes the strength that the "blind obedience to the p-value" dogma has on the hearts and minds of the reviewer pool at many, many journals. I cannot count the number of times I have seen smart scientists overrule their own good judgment because their data fall on the wrong side of the magic "p<.05" line. From an editorial point of view it's a nightmare because this problem affects not just the authors of manuscripts, but the majority of the reviewer pool too. Sure, if Wasserman wants to volunteer to review the statistics for every submission himself, he's welcome to do so and it would solve my problem. In fact, I've got several tedious jobs sitting in my editorial queue that I'll be very happy to pass off to him right now if he wants to volunteer.

In the meantime, since my inbox remains curiously devoid of helpful offers from celebrity statisticians and my editorial budget remains oddly fixed at $0, I find that I have a lot of sympathy for Trafimow's position: burn it all down, and throw the baby out with the bathwater if necessary. Individual authors need to be able to convince reviewers and readers using common sense arguments constructed from simple descriptive statistics and nothing else. It's depressingly low tech, but at least it stops good scientists from overruling common sense solely because they see the magic "p<.05".

I really don't mean to pick on Wasserman. He's a very, very good statistician, and his comments aren't really all that atypical. And look, I'll even take the moment to plug All of Statistics. It's a gorgeous little book, and one I often give to my grad students. I think he's way smarter than me, but he's not a practicing scientist, and it shows. He doesn't understand the scale of the problems that psychology faces, and he's not shown any willingness to actually help us out. So he has no business offering an opinion on this topic, none at all.

In the end I think the "alcohol abuse" analogy is a very good one. But I don't think it means what Wasserman thinks it means. In the real world, alcohol can ruin communities. If not used safely, it endangers people's health and livelihoods. Because of this, in many communities where systematic abuse exists, the collective action taken by the group is to impose a blanket ban on alcohol. When a community does this, we call it responsible public policy. Anyone acting like their personal right to drink alcohol trumps the broader needs of the community is being a dick.

I feel the same way about statisticians telling psychologists that we're not allowed to ban inferential procedures that have a solid track record of breaking when used by psychologists in the wild. Either offer up some of your time to help, or shut the fuck up.

posted by mixing at 2:06 PM on March 19, 2015 [18 favorites]


Please keep in mind that this is with all sincerity.

If its the case that:

"but the fact of the matter is one of the major contributors to this mess is the fact that psychologists were naive enough to actually listen to the dumbed-down condescending shit spoon-fed to us by statisticians who seem to act as if that the answers to every scientific question can be found within the GLM. "

Why is pyschology doing so much worse than other disciplines in this respect?
posted by MisantropicPainforest at 2:45 PM on March 19, 2015


Why is pyschology doing so much worse than other disciplines in this respect?

I doubt that it's a special problem for psychology. It's the social sciences; the complexity of the phenomena being studied pose some unique problems.

Psychology is just huge and widely discussed.
posted by Kutsuwamushi at 4:07 PM on March 19, 2015


Why is pyschology doing so much worse than other disciplines in this respect?

It's a good question, and I wish I had a satisfying answer. I don't know enough about other disciplines to know what counts as typical practice elsewhere, but I can point to a number of things I feel are wrong in my own area.

In part I think there's an education problem that goes all the way back to undergraduate classes. A typical psychology undergraduate at my institution will get two semester long classes covering introductory statistics. The purpose of these classes is to get the students to the point that they can do basic data analysis. I've taught several of these classes, and they're very hard to do well. The vast majority of these students are terrified of statistics. Teaching this class well is generally an exercise in calming down the students, and getting them to the point where they feel comfortable. Most of the time this means that statistical theory gets shortchanged. Most institutions (in Australia at least) will focus entirely on classical test theory, but will present it in a very simplified form. For instance, at no point will someone stop and talk about the importance of stopping rules for determining the admissability of the p-value. So none of my students will know why the t-test breaks if you see a p-value of .08 and decide to use that as a reason to continue running the experiment. It's unfortunate though: in the real world, data often arrive sequentially and researchers have to make decisions about the data on the fly. But the training we give to our students does not discuss sequential hypothesis testing methods at all.

In an ideal world, by the time (some of) these students make it into academia, they'd have been given more advanced training. But unless you do your graduate work in a strong quantitative department -- somewhere where people read Psychometrika or the Journal of Mathematical Psychology for fun -- that just doesn't seem to happen. I look around my department and there are maybe 2-3 people who would be able to explain to me why the t-test breaks when applied in a sequential hypothesis testing context. There would only be a couple of people who really understand the difference between Type I, II and III tests in ANOVA. These are really smart people and experts in their field, of course, they just don't seem to have had the statistical training. So most of us are fumbling around in the dark with tools we don't really understand, relying almost entirely on introductory textbooks that don't discuss the realities of the scientific process. That's even more dangerous than just relying on your gut.

But what's worse is that this gets combined with a very heavy rhetorical emphasis on "statistical training" that gives people the illusion of competence. Most people in I know actually do think they understand statistical theory. Yet I get people confidently asserting that it's impossible to accrue evidence for a null hypothesis under any inferential procedure. I get people confidently offering Bayesian (mis)interpretations of the p-value. I get people completely unable to tell me the difference between a fixed effect and a random effect, much less why it matters. I try to explain these things to people (despite not feeling super-confident myself) and yet for some reason they never lose faith in their own ability to analyse data, or to offer weird statistical comments in review processes.

My intuition is that this over-reliance on introductory methods causes people to assume that all data analysis must be framed in terms of these tools. So you end up with scientists who don't even try to build their own model for the data based on their substantive knowledge of the topic at hand, but instead try to find a way to force the science to fit the statistics. They do this not because they are stupid, but because this is what they were trained to do and they think this is what they are supposed to do.

I know this is turning into another absurdly long comment (sorry!) but I think a concrete example help give a sense for just how strong the "introductory statistics tramples the science" mentality can be... A couple of months ago I got a revise & resubmit decision on one of my papers, from a pretty good journal. In this paper we'd employed a pretty standard factorial design: two "treatments" each with several levels, all fully crossed. We had a strong a priori theory that led us to expect a very specific ordering across the cell means. Obviously, our design lends itself naturally to the method of planned comparisons, since we're only interested in one specific contrast. Trying to force it into an ANOVA doesn't make a lot of sense: neither main effect is interesting, and there's no value in testing for a generic "interaction", because our theory is consistent only with one specific pattern. So of course we didn't run an ANOVA. Why would we? It makes no sense. Yet, when the reviews come back, the editor expressed concern about our analysis, asking "where is the ANOVA that one would traditionally expect?" or words to that effect. It would be really easy to look down at this guy for failing to understand the design, but I think that's a mistake. I know the editor by reputation: he's very smart. But the practice of "doing an ANOVA" has such a strong hold over everyone's mind that it's mere presence is soothing, and its absence marks a paper as potentially flawed.

In this context the ANOVA has become a ritual, and a mindless one at that. Everyone sort of knows this. But it's so hard to shake off that feeling -- that fear, even -- that if you don't follow "the rules", then something bad will happen. After all, the voice in the back of your head says, the statisticians said that ANOVA is the tool for this problem, didn't they? And they're the experts, right? Right? Unless you happen to have a statistician in your back pocket, it's pretty hard to make that nagging fear vanish.

And so that's how it goes. The standard procedures in the field don't match up to what a good real world statistician might suggest, they faithfully follow the procedures laid down in an introductory textbook, no matter how inappropriate they are in the wild.
posted by mixing at 4:52 PM on March 19, 2015 [7 favorites]


It's depressingly low tech, but at least it stops good scientists from overruling common sense solely because they see the magic "p<0.05"

Aside from the other difficulties you describe, if 0.05 is treated as a magic number that makes people discard all reason, I think you've got some kind of deeper problem when there's work to be done that's intrinsically reliant on statistics. It's not that high a standard. I know next to nothing about statistics, but it doesn't take a whole lot of fancy mathematical training to understand what p=0.05 is like, anyone who's played games that involve 20-sided dice gets it.

Coincidentally, just this morning I was reading a blog post in which p=0.09 is taken as suggestive of something interesting going on.
posted by sfenders at 5:07 PM on March 19, 2015


Thanks for your reply mixing. I wonder if there simply haven't been the career incentives for pyschologists to do methods, and that could explain it.

it's not that high a standard.

Depends on the n.

Also, many paperes have robustness checks, which make it harder to satisfy get a p value below the alpha level by chance.
posted by MisantropicPainforest at 5:21 PM on March 19, 2015


My first thought was oh gods, please don't make me redo the last third of my data analysis class prep.

I need a damn vacation.
posted by joycehealy at 5:22 PM on March 19, 2015


Mixing: Thanks for the insight from a psychologist. You did addres what I was thinking when I read your first comment: "Why would a professional psychologist be doing the methods they learned in an intro to quant methods course in undergrad?" Surely by the time you're done grad school you've learned that those methods are only useful in very specific circumstances and there are all sorts of other methods (most parametric and thus with significance tests available) for other kinds of data or questions.

Second question that I don't think you addressed: Aren't there psychologist statistitians? That is, you set this up as psychologists taking advice from outside the discipline, but aren't there psychologists developing their own new methods suited for the kinds of questions psychologists ask or trained also as statisticians who are able to read the statistics journals and adapt new methods, writing their own software and alorithms etc.? This does exist in other social sciences.

Third question: Given the problem you outline, woudn't it be better to ban the Anova unless there is clear justification for use of the method in the paper?
posted by If only I had a penguin... at 5:25 PM on March 19, 2015


I'm not mixing, but I am a PhD student in psychology, albeit in cognitive psychology and not social psychology, so I can address some of your questions.

Aren't there psychologist statistitians?

Yes, there are. They are called quantitative psychologists, and there is usually one or two per department. Because most (graduate) students in psychology are not interested in math, quantitative psychology is the only sub-discipline of psychology where there are more jobs available than graduates.

Third question: Given the problem you outline, woudn't it be better to ban the Anova unless there is clear justification for use of the method in the paper?

Yes, but as mixing has said, people are not typically trained that way. People are trained, from undergraduate, to always use ANOVAs -- and it simply doesn't occur to people that there is such a thing as model comparison.

Even at a graduate level, the first course that you take is usually on ANOVA -- simply because it's what people use the most. You don't start learning about models until you take a course on multiple regression -- and that's not required for most graduate students.

I'm lucky in that, in my subarea of psychology (perceptual/cognitive psychology), and in my department in particular, people are mathematically inclined (you can't get a degree without doing at least one course on linear systems theory). When I saw that they were banning the p-value, I was actually okay with that, because I think confidence intervals are much better for describing the stability of a measurement. But when I saw that they were banning confidence intervals also, I realized that they were being stupid. How are you supposed to indicate the reliability of the data? And why don't they think Bayesian methods make for a good alternative?
posted by tickingclock at 5:43 PM on March 19, 2015 [1 favorite]


Why is pyschology doing so much worse than other disciplines in this respect?

Perhaps they're just more willing to deal with their own shortcomings?
posted by Tell Me No Lies at 5:47 PM on March 19, 2015


Also, it's very interesting to me that, depending on the subarea you are in, the quantitative methods that you use will vary greatly. Even though I'm considered to be one of the more mathematical students in my department, I have no idea what Type I, II, and III ANOVA tests are. I'm sure I could understand them if I studied them, but they've just never come up for me, because I almost never use ANOVA in my research. On the other hand, understanding signal detection theory and ideal observer models is critical.
posted by tickingclock at 5:48 PM on March 19, 2015


Also, many papers have robustness checks, which make it harder to satisfy get a p value below the alpha level by chance.

Well okay I'll take your word for it, but I seem to remember that the p-value is precisely a measure of the probability of getting a result by chance, and if it comes out to 0.05 that's still equal to 1/20 no matter how much work you had to do to get there. No?
posted by sfenders at 5:56 PM on March 19, 2015


Aren't there psychologist statisticians?

Yes. Anyone affiliated with either the Psychometric Society or the Society for Mathematical Psychology would qualify as a quantitative psychologist, and most psychology departments like to have at least one of them (us? I guess I qualify) on staff.

I think the issue is that there aren't enough quants to go around, and we have our own research careers to think about. I didn't get tenure by acting as an unpaid stats consultant to other academics, and I would never advise my students to try it either.

What makes me a little sad is that there are actually quite a few graduates from good quantitative psychology programs that would enjoy a career as a statistical consultant to psychological research projects. But the money doesn't seem to be there. Those people I know who went down that career path soon found that most of the people holding onto big enough grants to be able to afford proper stats consulting aren't in psychology, so they've ended up specialising in medical or defence related projects. I feel especially sad for social psychologists because they have some very tough data analysis problems that deserve to be tackled by good quantitative psychologists, but they usually don't have the funds to hire the people they need.

Given the problem you outline, wouldn't it be better to ban the Anova unless there is clear justification for use of the method in the paper?

Ideally, yes. I've never yet been an editor-in-chief, so thankfully that one remains above my pay grade, but my personal preference would be to avoid wielding the banhammer in such a crude way like Trafimow has. But I'm sympathetic to his position. The thorny problem is: who decides when the ANOVA has a clear justification?

It can't be the associate editor handling the paper, because in most cases the AE is recruited to the board based on topic area expertise not statistical training. If there were enough quantitative psychologists around that you could guarantee that every paper would be reviewed by at least one quant, then the answer would be "the quantitative psychologist decides". But having been in the AE position myself I'm acutely aware of the fact that these people are so overloaded that you have to beg and plead to get them to review (they're not jerks, they're just expected to handle all the complicated technical manuscripts and also get asked to comment on "regular" papers too, and they all know that reviewing is a service job that doesn't get you grants or tenure).

So what's an editor-in-chief to do, policy-wise? If they allow a procedure on a "case by case" basis they're reliant on expertise that is very thin on the ground in order to make the individual adjudications. If they ban the procedure outright, they look silly and unsophisticated and there's no guarantee it will even solve the problem. All choices seem to suck. I'm kind of glad I've never had to make that call myself, and I still don't know what I would do if I ever got the top job at a journal.
posted by mixing at 6:36 PM on March 19, 2015 [2 favorites]


Sorry for the repeat commenting, but:

I doubt that it's a special problem for psychology. It's the social sciences; the complexity of the phenomena being studied pose some unique problems.

Actually, this is a really good point that I think is often missed. Psychology is usually classified as a "soft" science yet it is one that typically poses very hard data analysis problems. Consider one of the simplest experimental measurements that we use: the two-alternative forced choice task. The 2AFC task is exactly what it sounds like: you give people two response options and ask them to choose which one is the "right" answer for some question. For every such decision you obtain two dependent measures: the correctness of the choice (a binary outcome) and the response time, RT (a continuous variable lower bounded at zero). These two variables are not independent of each other, and they trade-off against each other in a fashion that is uncontrolled by the experimenter: the participant makes that trade-off themselves based on their own cautiousness. So neither the accuracy nor the RT is a pure measure of the difficulty of the decision. The response time is of course highly skewed, and empirically it usually turns out that the distribution of RTs on correct decisions has a different mean to the error RT distribution.

I've looked through quite a few statistics textbooks over the years, both introductory and advanced, and oddly enough I haven't found many that actually discuss this as a data analysis problem. Yet psychologists do have tools for this, though much of the mathematics was borrowed from statistics and physics. The typical solution is assume that participant choices arise from an intuitive form of sequential analysis: accruing evidence over time until a decision threshold is reached. Formally, you can treat the evidence accrual process as a stochastic process and compute the first passage time (giving you the RT) to one of two absorbing boundaries (giving you the choice itself). In the typical case we assume that the stochastic process is a Wiener process and the resulting "drift-diffusion model" allows the researcher to make an appropriate transformation from a messy "accuracy + response time" measurement to a more psychologically meaningful "stimulus discriminability + response caution" representation of the data. It's slightly tedious though, because the expressions for the first passage time distribution take the form of an infinite sum: the series is convergent but the terms oscillate, so even though the tools have been around since the 1970s, it's only been in the last 10 years or so that they have become tractable enough to use as general purpose data analysis tools.

Of course that's a simple experimental task. It does get a bit worse.
posted by mixing at 10:56 PM on March 19, 2015 [5 favorites]


I think you guys just need a Gary King ;)
posted by MisantropicPainforest at 6:24 AM on March 20, 2015


I seem to remember that the p-value is precisely a measure of the probability of getting a result by chance

Not quite. Precisely, it's the probability that, if some specifically stated null hypothesis were actually the data-generating process, you would see data that give a result at least as powerful as the one you actually observe simply from whatever kind of sampling error is implied by the null hypothesis. That is, it's worth pointing out that it's a probability of seeing data and that "by chance" is actually some specific null data generating process (and that you can choose the wrong one).

0.05 is in a big global sense not a very strong test, yes. But this does have to be balanced with the difficulty or impossibility of getting more data. Particle physicists at CERN insist on 5 or 7\sigma results, which push their results to 1 in 3,486,914 or 1 in 7.7e11. Which is fine, but my sense (which could be entirely wrong) is that if they can get data at all, it's relatively trivial for them to get very high N.

In contrast, I do work on state legislatures or state governments. So I have an N of at most 50 for a cross-section, and that imposes restraints on how powerful my observed results can reasonably be. If you want to study postwar presidents, there are only 12 of them (and even if you look within, you still only have 12-whatever DF at the president level). In my work, even if I do TSCS instead of cross-sectional work I'd have to wait decades for the data-generating process to give me enough biennia to meaningfully work with, or for hundreds of years to generate substantial variation in internal legislative rule-sets. Likewise, for psychologists the costs of running a nonsurvey experiment with, say, 100,000 subjects are prohibitive.

That doesn't mean you can't see strong results in social science. It's pretty trivial to generate t-statistics well over 25 with normal-sized survey data, and I can generate t-statistics between 7 and 25 with Ns of 20-100. But those results are sort of banal (party ID influences people's votes; legislators votes are well-predicted by the positions they publicly take), and you certainly can't count of getting results like that for things that are, well, more interesting. At least not unless social sciences receive CERN level money to do our thing, which will never ever happen.
posted by ROU_Xenophobe at 8:21 AM on March 20, 2015


The basic problem with a p value is simply that it is uninterpretable. A p value is the probability, assuming the null hypothesis is true, of getting a more extreme result. It isn't clear how this relates to any epistemic question that scientists are interested in ("Is the hypothesis unlikely?"). It is assumed to, but it doesn't. There are other approaches that address epistemic questions directly (namely, Bayesian statistics) and there are approaches that attempt to do away with epistemic questions altogether (namely, frequentist statistics) but the p value is a vestigial part of statistics that was developed before the more principled approaches of (modern) Bayesianism and frequentism were developed.

People largely continue to use p values for three reasons. First, because that's what they were taught. Second, because they are easy to compute. Third, because they have no idea what they are. If scientists understood the complete disconnect between p values and what they use them for, they would look for other solutions.

Some of my own work is relevant here: Why Hypothesis Tests Are Essential for Psychological Science [preprint pdf]
posted by Philosopher Dirtbike at 12:08 PM on March 20, 2015 [1 favorite]


Well, in defense of the p value, a lot of times frequentist and bayesian methods come to the same conclusion, so while its not meaningful directly, the p value does suggest something.
posted by MisantropicPainforest at 1:27 PM on March 20, 2015


Well, in defense of the p value, a lot of times frequentist and bayesian methods come to the same conclusion, so while its not meaningful directly, the p value does suggest something.

Bayesian and frequentist methods cannot come to the same conclusions. Neyman denied the very idea of a "conclusion" as we would understand it, and hence frequentist methods reject epistemology. Bayesian conclusions are epistemic by nature. A Bayesian posterior probability is a completely different thing than a frequentist behavioristic decision.

Consider Bayesian and frequentist intervals. A Bayesian credible interval allows the interpretation "With this prior and model, the posterior probability of the parameter being in this interval is 95%". The frequentist confidence interval, on the other hand, is procedure that allows no conclusion at all (see Neyman, 1941, on confidence interval theory).
posted by Philosopher Dirtbike at 3:07 PM on March 20, 2015


Bayesian and frequentist methods cannot come to the same conclusions.

I don't disagree with the core point you're making: epistemic probability and aleatory probability are very different things, and to the extent that people interpret a Neyman confidence interval as if it had the same meaning as the corresponding credible interval, they are making an error. I'd not read the Neyman (1941) paper, but it's more or less what I would have expected. The passage that seems most telling to me is, when discussing the connection with Jeffreys' he remarks that
it seems essential to be clear that any probability calculated from (1), with any function p(θ1, ..., θs) not implied by the actual problem, need not and, generally, will not have any relation to relative frequencies. It will not be the probability in the classical sense of the word and, therefore, persons who would like to deal only with classical probabilities, having their counterparts in the really observable frequencies, are forced to look for a solution of the problem of estimation other than by means of the theorem of Bayes
In this I wouldn't disagree with Neyman at all. The confidence interval does not describe a range of plausible beliefs about an unknown parameter, it is a random variable computed from data that has good frequentist properties and can be given an aleatory interpretation.

The difference is that I simply don't care about that. I am not interested in knowing that 95% of the intervals I construct will include the unknown parameter (except perhaps as a basic sanity check). This is not a statement that tells me anything meaningful about my experiment. Statements about aleatory probability, for all their supposed "objectivity", seem to bear no resemblance to any scientific question I care about. My goal is always to answer the epistemic question: what should I believe about the world? To me, Neyman seems to be shooting himself in the foot here, by openly acknowledging that the confidence interval does not correspond to any sensible epistemic claim.

The closest I can come to a defence of confidence intervals is that they convey information about the data. If you are a Bayesian, then the confidence interval does not describe your beliefs, but you can reasonably treat it as a source of evidence. It is not a sufficient statistic in the classical sense (not that I care), but it is informative. The way I think of it, if someone discloses a Neyman confidence interval but not the raw data, I can use it to revise my beliefs about the relevant parameters. The credible interval would of course be far more useful, but confidence intervals are not without evidentiary value.

Admittedly, this is a very weak defence of confidence intervals, insofar as it reduces their role to nothing more than a descriptive statistic. But that's the strongest defence of confidence intervals I'm willing to mount, because I think they're rubbish.

... Actually, reading that Neyman paper made me angry all over again. The thing that really annoys me about the ideology of classical inference is that they want to make statistics a kind of "idealised physics", in which individual experiments are exactly as uninteresting as individual marbles in a bag. The statisticians who invent this stuff don't care about the interpretation of any specific experiment, their sole concern is that they can "properly" count up the number of correct and incorrect decisions that their procedures make, across a series of experiments. As the scientist who might have conducted a one of the experiments in question, why should I care about that? I don't give a stuff about those other experiments. I care about my experiment. The level of disregard for my actual research question is staggering. My experiment is now merely one marble in the jar whose properties the statistician is trying to quantify. Bugger that. The Bayesian approach, on the other hand, is a form of "idealised psychology": it tells me what to believe about my data. Whatever its other flaws might be, the Bayesian approach is actually trying to engage with my specific data set, and at least attempts to answer my research question.
posted by mixing at 9:13 PM on March 20, 2015 [1 favorite]


The credible interval would of course be far more useful, but confidence intervals are not without evidentiary value.

Only if you actually know which confidence interval was constructed; then you can work backward to some meaningful statistics (in the case of Student CI, you can work backward to Xbar and s^2). But if which CI is left unsaid, then the fact that "this interval is a 95% CI" is of no evidentiary value at all. It's a really strange procedure.

... Actually, reading that Neyman paper made me angry all over again.

It actually increased my respect for Neyman. He had a set of principles, and was clear about them. He didn't try to weasel out of the difficulties in his philosophy. He was a very clear thinker. But if you appreciated Neyman (1941), you should read some of the additional CI material in Neyman's 1952 book (here) especially page 211-215, which has a bonus hypothetical dialog reminding frequentists who may encounter resistance about the "glory" of being burned at the stake for the truth. I think this is a rare bit of Neyman's sense of humor coming through :) At least, I hope.
posted by Philosopher Dirtbike at 11:52 PM on March 20, 2015


Only if you actually know which confidence interval was constructed; then you can work backward to some meaningful statistics (in the case of Student CI, you can work backward to Xbar and s^2). But if which CI is left unsaid, then the fact that "this interval is a 95% CI" is of no evidentiary value at all. It's a really strange procedure.

True, but since I usually know what software the used, I can usually figure out something sensible. That said, I am reminded of a friend who ran a study attempting to elicit 80% confidence intervals from a bunch of engineers: they were asked to give interval estimates such that 8 of 10 would contain the true value. About 1/3 of them all arrived at the same procedure: provide blatantly incorrect point estimates for 2 questions, and give ridiculously wide intervals for the other 8.

Neyman's 1952 book

Sigh. Neyman, as usual, confounds me. I do like him, but he seemed to have an uncanny ability to cleanly define and then carefully solve the wrong problem:
We contemplate a situation in which the practical statistician is interested in the value of the parameter θ1 that appears in the probability density function P(x1, x2, ... , xn | θ1, θ2, ..., θs ) of n observable random variables X1, X2, ..., Xn. The analytical form of this probability density function is known to the statistician, but the values of the parameters θ1, θ2, ..., θs are unknown, except that they are contained in some specified intervals, say Ai < θi, < Bi (i = 1, 2, ... , s), finite or infinite. The practical statistician is faced with the necessity of taking an action which should be adjusted to the value of the parameter θi
As usual I wonder how it is that this practical statistician has come by such curious knowledge. The model for the data is known so precisely that he or she knows exactly the form of P(x | θ), and the statistician is allowed to make use of this knowledge when arriving at decisions. Yet it remains inadmissable to declare that anything is known about θ, so much so that P(θ) is not even allowable as a probability density. Where does the frequentist statistician come by this knowledge about P(x | θ), I ask myself? If beliefs and professional experience are not an acceptable basis for specifying a probability distribution -- as they must not be if the Bayesian approach is so beyond the pale -- then what is the basis for the assertion that the practical statistician has such excellent knowledge of the likelihood but no knowledge that would be admissable as a prior? Have they perhaps run an infinite number of replications of my experiment such that, for every value of θ they have discovered the limiting frequencies that P(x | θ) must entail in order to count as a probability density? Or have they just selected GLM from the SPSS menu on the blind faith that the data generating mechanism is always somewhere within the GLM family?
All human actions are subject to error and the actions of the practical statistician cannot be an exception to the general rule. Thus the practical statistician must be aware that, whatever function [for generating confidence intervals] he selects, his assertions about the value of θ will be erroneous from time to time. The best he can hope to arrange is that the errors of estimation do not occur too frequently.
And as usual, I wonder why the scientist on the receiving end of these assertions should care. This description is perhaps a fair representation of the problem facing a statistical consultant, whose job is to provide advice on an endless sequence of experiments, each with its own idiosyncracies that the statistician does not understand nor has time to learn about. From the statistician's perspective, one might as well view these data sets as a sequence of random events: whether the data come from a psychology lab or a pharmaceutical trial is irrelevant. In order to not get fired, the statistician needs a decision procedure that controls the error rate across the whole sequence of consultations.

But the actual scientist who comes to the statistician is not interested in those other experiments. From their perspective, it does not matter how many correct decisions the statistician makes on other occasions. Only their experiment matters. The scientist brings a lot of knowledge about what possible models P(x|θ) and parameters θ make sense, and would like this knowledge to be incorporated into any advice or decisions that the statistician provides. Yet it is precisely this context specific knowledge that the frequentist seems so keen to discard.

Neyman's approach seems unconcerned with any of this. Indeed -- to give him his credit -- he explicitly asserts that he is unconcerned with it (in that awesome little dialogue on p214). But with this assertion he seems to have conceded that his theory is irrelevant to the problems of science. The confidence interval is a procedure designed to provide comfort to the statistician and ambiguity to the scientist.
posted by mixing at 3:44 AM on March 21, 2015


Statements about aleatory probability, for all their supposed "objectivity", seem to bear no resemblance to any scientific question I care about. My goal is always to answer the epistemic question: what should I believe about the world?

OK, that's your goal. It's not Neyman's goal. As I understand it (though I'm always leery about assigning definite philosophical stances to early statisticians, they lived a long time and changed a lot) Neyman was not trying to generate knowledge, he was trying to generate decision rules. It's totally fine not to care about decision rules, but obviously some people care about them, and they have good reason to. So I don't think Neyman is trying to answer the wrong question; I think he's trying to answer his question, which isn't your question.
posted by escabeche at 2:25 PM on March 21, 2015


I tend to think about Neyman's approach by analogy with the US court system, whose goal is to provide a fair trial, not to figure out as accurately as possible whether the defendant committed the crime or not.
posted by escabeche at 2:26 PM on March 21, 2015


So I don't think Neyman is trying to answer the wrong question; I think he's trying to answer his question, which isn't your question.

That's totally fair as far as it goes, but you make it sound as if this were a pure intellectual debate where we can all just go our own way and agree to disagree. For a professional statistician or mathematician, that might be what this feels like. That is not what is happening here.

When I publish a paper using Bayesian analyses I have to fight tooth and nail with reviewers and editors to get the statistics through. Even today, now that Bayesian methods are starting to get accepted, I get people demanding that I publish p-values in parallel with Bayes factors. In the applied domain the frequentists are not playing fair, and I've given up trying to do so either. I have been fighting this for 15 years now, and I am tired and angry.

It is not merely in the research domain either. The stranglehold that orthodox methods hold over our teaching programs is insane. In Australia, for instance, there is a legislative requirement that all psychology programs must be professionally accredited or else our graduates cannot get jobs as practitioners. The accreditation criteria include methodological requirements, and while no-one ever says "thou shalt not disagree with Neyman" I can absolutely guarantee that if I were to try to teach my undergraduate class from a Bayesian perspective I would be immediately vetoed, not because people think I am wrong, but because the faculty would fear losing accreditation. This is why I get annoyed at statisticians sometimes: their ideological disputes have practical consequences for us. I am, in effect, contractually obligated to teach Neyman's opinions as truth.

So, I do take your point, and I know I'm being a little unfair on Neyman. But please understand that my anger doesn't come from nowhere. His ideas are being badly misused, and they're messing up my research and my teaching from beyond the grave.
posted by mixing at 3:04 PM on March 21, 2015 [1 favorite]


For a professional statistician or mathematician, that might be what this feels like.

Yes. Sorry. I am a mathematician and in my life this is a deeply interesting philosophical difference, but that's not at all the way it manifests for working scientists. I am annoyed as you are when people say "0 is within the confidence interval which means we should have a high degree of belief there's no effect." I think Neyman would be annoyed about this too, but that's cold comfort in your situation.
posted by escabeche at 7:05 PM on March 21, 2015


Understood. I normally enjoy talking idly about the philosophy of inference, because it's really interesting stuff. I suspect you're entirely right: Neyman would be horrified at how his work gets misused in applied contexts. And I am sorry for getting angry in this thread, especially to the extent that I've directed it at you and folks like you (I'm something of a fan, in fact). I've been allowing my professional frustrations to spill over into this thread, and I shouldn't.
posted by mixing at 7:40 PM on March 21, 2015


FWIW mixing I've enjoyed your contributions here! this has been a Good Thread (tm)

In economics, it appears to me that Bayesian methods (often misused or misunderstood - I am an amateur Bayesian at best, and since we all got trained in frequentist methods first, it's hard to shake) are becoming more dominant in structural macroeconomics and time series (Chris Sims, one of the winners of the Econ Nobel in 2011, is a big proponent and very influential). My applied microeconomics friends don't seem to be as interested. I think it's beginning to be that while you can do frequentist sorts of applied work in macro and get published, time series theory appears to be totally dominated by Bayesians and it's hard to get a job if you're not doing Bayesian stuff.

But econometrics is its own subfield and has developed its own sort of...intellectual character, I guess, based on the particular kinds of problems economists face and questions they're interested in (I've gathered from this discussion psychometrics is similar but there are many fewer psychometricians out there). And in particular, econometricians seem to have a different view of how and when you can think about causality relative to statisticians generally.
posted by dismas at 7:32 AM on March 22, 2015 [1 favorite]


So I don't think Neyman is trying to answer the wrong question; I think he's trying to answer his question, which isn't your question.

To the extent to which he was offering tools to scientists, he was answering the wrong question. Fisher was correct about this; Neyman missed the point of science, and incalculable damage has been done by those who, not understanding statistical theory, assumed that Neyman's goals were the same as theirs.
posted by Philosopher Dirtbike at 2:47 PM on March 22, 2015


« Older Bisland v. Bly: A Race Around the World   |   Swift, silent, and with razor-sharp claws Newer »


This thread has been archived and is closed to new comments