Fake spiders weave tangled webs
February 8, 2020 8:24 AM Subscribe

Jonathan Pruitt was a rising star in behavioral ecology, lured from UCSB to McMaster University by the Canada 150 program, with fascinating research about personality and social behavior in spiders. Then one of his co-authors got an email with a question about a dataset of his that she had used in a publication, forcing her to ask serious questions of her own about all of the data he had provided her and leading to retraction of their publications together.

Now, journal editors (including Dan Bolnick who blogs at Eco-Evo Evo-Eco and Jeremy Fox who blogs at Dynamic Ecology), former students and post-docs, and just curious onlookers are going through all of his papers, clearing some, and repeatedly finding evidence suggesting widespread data fabrication in others.

Also covered in Nature and Science

posted by hydropsyche (37 comments total) 25 users marked this as a favorite

These kind of stories always make me feel terrible for the co-authors. The damage this sort of thing does to everyone who has ever touched his research makes this reprehensible just on an interpersonal level, let alone scientific.
posted by BungaDunga at 8:53 AM on February 8, 2020 [16 favorites]

Editorial note -- McMaster, singular.

Source: I am an alum and also they occasionally give me a paycheque.
posted by ricochet biscuit at 8:54 AM on February 8, 2020 [5 favorites]

Apologies. I was tricked by the headline of the article (he is McMaster's Canada 150 Chair) and by having gone to a college that does have an apostrophe (all the saints own their colleges).
posted by hydropsyche at 9:04 AM on February 8, 2020 [5 favorites]

Oof. Yeah it would be rough for the co-authors, that’s why they (should) become the most ardent truth-seekers when this sort of stuff comes to light.

Fox sometimes publishes in my area and I suspect we’ve reviewed each other’s work on occasion. I will say he sets a high standard for thoroughness and rigor, and he’s exactly the sort I’d implicitly trust to give a yay or nay on work that he was involved in.
posted by SaltySalticid at 9:20 AM on February 8, 2020 [3 favorites]

I have an acquaintance (I know him from outside of academia) in another field who burned his co-authors this way. When he did that, I pretty much cut ties. It's such an incredible breach of trust.
posted by dismas at 9:29 AM on February 8, 2020 [8 favorites]

Updated the post to reflect the singularness of McMaster.
posted by LobsterMitten (staff) at 9:43 AM on February 8, 2020 [6 favorites]

This case has caused a lot of discussion in spider (and behaviour) labs around the world. It's changed my opinion about the necessity of having raw data available publicly along with the paper. In my lab, we are now incorporating raw data storage as part of the whole paper writing process. What was really interesting to me was that normal exploratory data analysis didn't throw up any red flags, which means that there is no technical solution to data fraud, but probably will be easier to spot these kind of manipulations through more social means. If the data look too good to be true, it warrants extra interrogation.
posted by dhruva at 9:51 AM on February 8, 2020 [7 favorites]

In more than ten years of applied machine learning research I’ve developed an acute spidey-sense for too-good-to-be true results. It has saved me from considerable embarrassment. In one memorable case I objected to including a new result in a camera-ready that I felt couldn’t be possible. The paper was submitted without the result and I was able to find the bug in my coauthor’s code (which was an honest mistake). When really good results turn up it’s important to compartmentalize the emotional response you might have and greet the numbers with a strong and healthy dose of skepticism.
posted by simra at 10:11 AM on February 8, 2020 [7 favorites]

It's sickening to think that Jonathan Pruitt's academic fraud helped him win a Canada 150 chair that could have been awarded to a real scientist.
posted by heatherlogan at 10:44 AM on February 8, 2020 [16 favorites]

From the Nature coverage [emphasis mine]:

[S]even papers have been retracted or are in the process of being retracted; five further retractions have been requested by Pruitt’s co-authors; and researchers have flagged at least five more studies as containing possible data anomalies. [...]

More than 20 scientists — co-authors, peers and other interested observers in the field — mobilized to pore through the data in almost 150 papers on which Pruitt is a co-author, looking for evidence of manipulated or fabricated numbers. They found similar signs of copy-and-paste duplications. In at least one instance, researchers identified formulae inserted into a published excel file, designed to add or subtract from a pasted value and create new data points.

Several have stated publicly and privately that they believe this to be clear evidence of fraud. Dingemanse says that his mind was made up by the “avalanche of retractions” in progress, as well as the mounting piles of irregular data. “It is hard to believe these data are not fabricated,” he says.

posted by heatherlogan at 10:52 AM on February 8, 2020 [9 favorites]

That emphasis on a formula in excel files is weird. I often use a formula in Excel to use as the test variable (for example the midpoint between two measured coordinates), and all subsequent analysis is done with the midpoint. Surely that belongs in the raw data? Also, leaving the formula in is a great way to troubleshoot odd looking data later. In the case above, people have found formulae that adds a fixed amount to a certain column such that it reflects the hypothesis which is a more clear cut indication of data manipulation.
posted by dhruva at 10:59 AM on February 8, 2020

dhruva, have you read the blog post by Laskowski? The implication is that the formula here is being used to manipulate pasted values to make it harder to spot that they have been copy-pasted to create new data points.
posted by heatherlogan at 11:05 AM on February 8, 2020 [1 favorite]

Ah ok, I understand.
posted by dhruva at 11:15 AM on February 8, 2020

Pruitt looks to have been incredibly sloppy (besides being a fraudster), which is why the bad data was identified. It really makes me wonder how much is out there, produced by "researchers" who do more sophisticated things than cut&paste.
posted by Joe in Australia at 12:27 PM on February 8, 2020 [6 favorites]

One of my co-authors on the SSB study from a few months ago did a post-doc with Pruitt. She's currently in the process of going through all her work with him and double-checking all of the data.

I also know Dan Bolnick well; he was the grad student chair in my department up until a year or two ago when he moved to Connecticut. Dan is good people and I would trust him with my career without batting an eye; it doesn't surprise me in the least to see him doing so much good, careful, and thoughtful work with this.

What a damn mess. I met Pruitt a few times; he gave a talk at my department once, and I think my strongest impression was how can a single human being talk that fast?! I wouldn't have suspected anything, but then I had never given him much thought. I do remember a comment at the time from a friend working in the same system that she had never been able to replicate his findings, but--it's so easy to assume that's bad luck, right?

What a fucking mess.
posted by sciatrix at 1:27 PM on February 8, 2020 [29 favorites]

Hello, scientists, I hope you see this. Some of the dates in the far right column follow the US date format (MM/DD/YY) and some follow the more widely accepted format of DD/MM/YY (at least row 109 does, but there might be others). Thank you for doing the work to make sure science is less of a liar, sometimes!
posted by FirstMateKate at 1:58 PM on February 8, 2020

So many aspects in the production of scientific knowledge are based on an honor system that folds under the pressures to pay bills, publish, & profit. The need for scientists to sell themselves erodes the claim to certain knowledge.
posted by dmh at 4:24 PM on February 8, 2020 [7 favorites]

Also, this kind of stuff just reeks of inexcusably superficial -- which is understandable in human terms, but reducing the human factor is precisely why we have science in the first place -- engagement with the data:

Apparently I had never clicked on Sheet 2 before because I had never noticed these numbers before (when I got the data, I pretty quickly saved it as a .csv file which is easier to manipulate in R, but only saves the first sheet).
posted by dmh at 4:31 PM on February 8, 2020

Apparently I had never clicked on Sheet 2 before because I had never noticed these numbers before (when I got the data, I pretty quickly saved it as a .csv file which is easier to manipulate in R, but only saves the first sheet).

I would do exactly the same thing if a collaborator sent me an .xlsx file, because I don't trust .xlsx not to manipulate my data. I also never, ever see data presented in unlabeled tabs even for people who are using Excel to record datasheets. Most of the time Sheet 2 is totally empty, but you still get the prompt about deleting data in the other sheets when you save it as a .csv, and so you learn to tune the prompt out.

There is nothing particularly sloppy about the way that Laskowski handled that data.
posted by sciatrix at 4:36 PM on February 8, 2020 [26 favorites]

Science works because we all go through our lives assuming everyone is working hard and doing their best and presenting us real data that they collected like they say they did. There is nothing sloppy or weird about not doubting ones collaborators. There's nothing sloppy or weird about not clicking through multiple tabs and instead just assuming that the final data on the first tab is what it claims to be and importing it into R and analyzing it. That's normal.
posted by hydropsyche at 5:07 PM on February 8, 2020 [10 favorites]

Also, this kind of stuff just reeks of inexcusably superficial

For someone on their first collaboration with someone a decade ahead of them, career-wise, who has their own lab? Did you read and think about the blog posts, or were you just being inexcusably superficial? The assumption is that there's one dataset, so why go past Sheet1?
posted by ambrosen at 5:07 PM on February 8, 2020 [9 favorites]

The data passed the statistical tests that I would have thought of if someone had asked me to look at them. In Bayesian terms, it's fair to say that these tests would increase my confidence that the data was authentic.

In contrast, Pruitt's failure to clean up his work was especially unexpected. I expect most fraudsters would be more careful than that, so the fact that a spreadsheet is "clean" only provides a tiny bit of support for the hypothesis that the figured were faked. It's like, does a murder suspect subscribe to Serial Killers' Monthly? That would certainly change our hypothesis, but it's hardly ever going to be a useful inquiry.
posted by Joe in Australia at 6:08 PM on February 8, 2020 [1 favorite]

I’ve also learned that I’m apparently odd or OCD insofar as I delete empty Sheet2 (and any other empty SheetN+1 that may exist) as a force of habit. Doesn’t it bother y’all to do otherwise? I hate when they are created automagically in the first place....
posted by RolandOfEld at 6:16 PM on February 8, 2020 [5 favorites]

Once again, Excel is part of the problem. Oh, how I loathe thee, Excel.
posted by papineau at 6:31 PM on February 8, 2020 [5 favorites]

Joe in Australia: In contrast, Pruitt's failure to clean up his work was especially unexpected.

He was involved in a LOT of papers. Could be just overconfidence that the manipulated data would be unnoticed.

In Laskowski's blog post, she mentions that Pruitt responded to a question about the extra Excel sheet with a comment that “Why Sheet 2 exists is an interesting question". I thought that comment was strangely reminiscent of say, criminal types who view the damning evidence with a strange attention.
posted by dhruva at 6:43 PM on February 8, 2020 [6 favorites]

Some of the dates in the far right column follow the US date...some follow the...

For the record a bunch of us* use Julian day and day of year and closely related derivatives precisely to avoid the problems you’re seeing.

*By ‘us’ I mostly mean colleagues, not me. I rarely use real dates. But when I do, I use DOY; months are just an annoying, counterproductive and artificial convention to a lot of scientists.
posted by SaltySalticid at 7:02 PM on February 8, 2020 [3 favorites]

Geez, if Johnathan had thrown any noise into his faked results this would have been a lot harder/impossible to detect. That's the really scary thing, there must be a lot of this sort of thing out there that wouldn't be so "easily" detected.

One thing that will no doubt come out of this is that people will start running the number faking and data duplicate detection scripts against their fake data to make sure they pass. The smart cheaters will become harder to catch in their duplicity.

That said the number I find most shocking is 150 publications by a 29 year old scientist.
posted by srboisvert at 4:31 AM on February 9, 2020 [2 favorites]

Once again, Excel is part of the problem. Oh, how I loathe thee, Excel.

I remember when I learned that Excel has two date systems, the 1900 date system and the 1904 date system.

The same serial number -- say, 43870 -- can represent 9 February 2020 (in the 1900 date system) or 10 February 2024 (in the 1904 date system).

It's not as much of an issue as it used to be, I think, because it was a Windows vs. Mac issue and newer versions of Excel on Mac also use the Windows Excel date system (the 1900 system), but when I learned this, I immediately became less surprised by that article that talked about the huge number of Excel errors in research.

(I only figured it out, fortunately, in a non-work context when I was copying some dates from an Excel file that I used to keep tracks of the books that I read, and started to wonder how I was logging finish dates in the future.)
posted by andrewesque at 5:05 AM on February 9, 2020 [2 favorites]

One of my co-authors on the SSB study from a few months ago did a post-doc with Pruitt. She's currently in the process of going through all her work with him and double-checking all of the data.

I'm so sorry for your friend, sciatrix. The huge tragedy in this is all the people who Pruitt has hurt by damaging their reputations, too, and by directly hurting them by making their CVs look thinner as papers are removed.

One of the papers likely to be retracted incorporated data collected by undergraduates. It's a huge deal to get a publication out of your undergrad work--to then find out it will have to be retracted because your first research mentor was a fraud is just devastating.
posted by hydropsyche at 6:53 AM on February 9, 2020 [10 favorites]

I'm so sorry for your friend, sciatrix. The huge tragedy in this is all the people who Pruitt has hurt by damaging their reputations, too, and by directly hurting them by making their CVs look thinner as papers are removed.

I'm only a little sorry for the people caught up in this now.

Data fakery has been an openly known big problem in psychology for a long time now. I can remember talking with some of my wife's colleagues who were afronted when I said they had a personal responsibility for the data in their publications whether they collected it or not. That was at least 8 years ago when this particular fraudulent researcher would have been 20 or 21 years old and probably still an undergrad maybe doing or about to do their honors thesis.

Diederik Stapel was caught and ultimately fired in 2011 and pretty much birthed the open science movement in psychology and has led to the revelation of at least a couple of star researchers a year getting caught with fraud, irreproducible results, or other hanky panky.

At this point, in the field of psychology, if you don't verify the data you put your name on, including its collection, you're simply not a competent researcher and you've earned some of the tarnish that splatters from frauds onto naive collaborators.

There are of course go along to get along pressures....but part of being an adult is resisting those pressures.
posted by srboisvert at 9:39 AM on February 9, 2020

I will note that behavioral ecology is probably the most distantly related to psychology field of animal behavior imaginable and that Pruitt often worked with people and teams that had almost entirely ecological backgrounds. I have a Bachelor's in Psychology, but this is somewhat unusual among my colleagues, especially those who work more closely with evolution and ultimate mechanisms and less closely with neuroscience and proximate mechanisms. That disparity is markedly more common among behavioral ecologists working on nommammals, especially invertebrates which aren't D. melanogaster, because most of Psychology tackles mechanisms of behavior which are specific to mammals and humans to varying degrees. Behavioral ecology is often taught through ecology without even referencing psychology.

I will also note that you literally just chose to say that you had minimal sympathy for undergrads taken in by Pruitt because they should have known better, and that you minimize the pressure on junior scientists and researchers quite blithely. I imagine it has been quite a while since you were a graduate student or undergraduate who was questioning their own expertise and relatively isolated from outside perspectives to serve as reality checks from your PI. I wonder at the confidence you assert that you could never be deceived, because you do your due diligence with every collaboration you participate in. Do you ever trust your collaborators? Pruitt doesn't seem to have faked all his data, after all, and he exploited the trust that he built up with others as he built it. When do you trust your collaborators not to lie to you? When your students can't replicate a colleague's results, do you trust the colleague or the student? How much effort on verification do you expend so that you can never be taken in by a liar?
posted by sciatrix at 10:25 AM on February 9, 2020 [16 favorites]

Because this is the internet and because I found this interesting at about 1am, I looked him up further. Interesting interview (about "Before they were scientists") where he says (at the end) that he'd like to be a villain....I guess it worked.
posted by bquarters at 11:13 AM on February 9, 2020 [1 favorite]

How much effort on verification do you expend so that you can never be taken in by a liar?

My wife is the researcher (I dropped out after leaving a lab that engaged in data fuckery I was uncomfortable with and finished my masters in a different lab and left). She won't work with someone if they don't share all their materials, subject logs and raw data files. She won't sign off unless she can reproduce the analysis from scratch.

How much verification do you think is reasonable?

How much responsibility do you think people have for things they put their names on?

I am frankly gobsmacked that you make these arguments.
posted by srboisvert at 11:24 AM on February 9, 2020

I'm saying that being taken in by someone like this and making a mistake is reasonable, and that the more junior someone is, the more reasonable the mistake is. I'm saying that personally, I have watched PIs routinely blame students in that scenario with data that didn't agree, even when collaborators turned out to be wrong. I'm saying that cross-verification and training on the value of cross-verification takes effort and time and trust, and someone has to be willing to pay for that work. More to the point, I am saying that asserting undergrads and grad students should magically know this stuff without being taught or advised otherwise is unreasonable and rather cruel.

I'm sure your wife does all due diligence on this with her students and her collaborations. More PIs should. Frankly, I do - - I'm working right now on a raw dataset I didn't collect myself, and I know it right down to the metadata. But I see an awful lot of people setting up junior scientists to fail because experienced scientists didn't check, because this field right here (not Psychology, evolution and ecology) haven't had a scandal of this magnitude, because not enough data has been publicly available. And I see a lot of people who aren't doing this job now blithely say junior scientists "should" do this or that to make science better, without bothering to consider the costs of doing anything or whether support is there to actually make it feasible for any individual person. I'm saying that this was a systemic failure, not an individual failure, and I'm saying it because it happened to 200 separate people in this field who collaborated with Pruitt.

The result of this scandal in this field should be to create systemic ways to prevent this from happening again, not to shame the largely junior people who were misled and whose careers have not been tarnished. Fortunately, this is how the conversation in EEB is largely handling it: discussions about the role of journals in demanding that raw datasets be published alongside each manuscript, so future malfeasance can be caught by anyone looking carefully at the data going forward. Increased transparency about data collection and open data storage. Higher standards from academic journals about data checking.

For Christ's sake, I'm not saying that no one has any responsibility to catch liars like Pruitt. Quite the contrary: I'm saying (and Bolnick and Fox, who are both very senior editors of important journals within the field, are saying) that we all have a responsibility to change the system in ways that help increase the likelihood of catching this shit before it can be published in the first place. Dedicate resources to working on this problem to people who have more resources and experience to draw on, and don't mistake the stochastic effects of predator encounter rate for intrinsic effects of individual gullibility.
posted by sciatrix at 11:55 AM on February 9, 2020 [14 favorites]

We need a big push for all raw data to be available with the publication of a paper. While the print version can't do that, it's easy to set up a repository online that includes whatever was used to collect the data: scans of field notes, checklists, questionnaires (with names redacted, most likely), pictures of nests, whatever. They don't need to be well-organized and neatly sorted, but anyone who questions the data should be able to go into the source material and figure out where most of it came from, even if a few pieces aren't available.

Nobody's going to take photos of every single reaction in a test tube before they write down "34.7 ml after combining the two liquids." But they could do it once or twice, and they could hand over whatever they used to collect the numbers, whether that's hand-written notes on paper or a digital spreadsheet with messily labeled columns.

For this particular spiders test, in addition to the data spreadsheets, there could be data about how the spiders were identified (stickers on their backs? paint splotches? something else?), pics of the environments used, invoices for whatever tools were used to move them around between colonies, notes about the spiders that died mid-experiment and their data had to be removed, etc.

It's damned annoying that a grad student might need to say to their collaborator, "okay, prove to me that you actually had five hundred spiders," but once that kind of request is standard, the verification part should be easy--if there actually were five hundred spiders.
posted by ErisLordFreedom at 2:14 PM on February 9, 2020 [1 favorite]

Amusing follow-up: I was chatting with a colleague who received his PhD pretty recently in animal behavior stuff. Although his previous work was with vertebrates, we are a public commuter college with no research funds so he was telling me how he is planning a new research project with a student on spider behavior.

I said, "Wait, do you know Jonathan Pruitt?" and he said, "We've been reading some of his papers. I have some questions about some of his results." And I said, "He's a total fraud. Lots of those papers are being retracted."

It turns out he had read Pruitt's recent Nature paper and was pretty skeptical of the stats and conclusions, so he thought it would be interesting to play around with similar questions with a student, having no idea at all about what has happened with Pruitt over the past 2 weeks. His response when I told him the whole story was basically, "I knew it."
posted by hydropsyche at 3:10 PM on February 12, 2020 [3 favorites]

« Older The atomic age⁠ at last? Manipulating bits to... | Explore free production music Newer »

This thread has been archived and is closed to new comments

MetaFilter

Fake spiders weave tangled webs
February 8, 2020 8:24 AM Subscribe

Tags

Share

Fake spiders weave tangled webs February 8, 2020 8:24 AM Subscribe

Tags

Share

Fake spiders weave tangled webs
February 8, 2020 8:24 AM Subscribe