Reproducibility in Science

© fouroaks | stockfresh.com

In 2011 a University of Virginia psychologist named Brian Nosek began the Reproducibility Project. He simply wanted to see if the reported problem with reproducing the scientific findings of published research studies in psychology was true. Nosek and his team recruited 250 research psychologists, many of whom volunteered their time to double-check what they considered to be the important works in their field. They identified 100 studies published in 2008 and rigorously repeated the experiments while in close consultation with the original authors. There was no evidence of fraud or falsification, but “the evidence for most published findings was not nearly as strong as originally claimed.”

Their results were published in the journal Science:Estimating the Reproducibility of Psychological Science.” In a New York Times article about the study, Brian Nosek said: “We see this is a call to action, both to the research community to do more replication, and to funders and journals to address the dysfunctional incentives.” The authors of the journal article said they conducted the project because they care deeply about the health of psychology and believe it has the potential to accumulate knowledge about human behavior that can advance the quality of human life. And the reproducibility of studies that further that goal is central to that aim. “Accumulating evidence is the scientific community’s method of self-correction and is the best available option for achieving that ultimate goal: truth.”

The present results suggest that there is room to improve reproducibility in psychology. Any temptation to interpret these results as a defeat for psychology, or science more generally, must contend with the fact that this project demonstrates science behaving as it should. Hypotheses abound that the present culture in science may be negatively affecting the reproducibility of findings. An ideological response would discount the arguments, discredit the sources, and proceed merrily along. The scientific process is not ideological. Science does not always provide comfort for what we wish to be; it confronts us with what is.

The editor in chief of Science said: ““I caution that this study should not be regarded as the last word on reproducibility but rather a beginning.” Reproducibility and replication of scientific studies has been a growing concern. John Ioannidis of Stanford has been particularly vocal on this issue. His best-known paper on the subject, “Why Most Published Research Findings Are False,” was published in 2005. A copy of one of his latest works, “Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature,” can be found here.  Szucs and Ioannidis concluded that false report probability was likely to exceed 50% for the whole literature. “In light of our findings the recently reported low replication success in psychology is realistic and worse performance may be expected for cognitive neuroscience. “

A recent survey conducted by the journal Nature found that more than 70% of researchers have tried and failed to reproduce another scientist’s experiments. More than half failed to reproduce their own experiments.  In response to the question, “Is there a reproducibility crisis?” 52% said there was a significant crisis; another 38% said there was a slight crisis. More than 60% of respondents thought that two factors always or often contributed to problems with reproducibility—pressure to publish and selective reporting. More than half also pointed to poor oversight, low statistical power and insufficient replication in the lab. See the Nature article for additional factors.

There were several suggestions for improving reproducibility in science. The three most likely were: a better understanding of statistics, better mentoring/supervision, and a more robust experimental design. Almost 90% thought these three factors would improve reproducibility. But even the lowest-ranked item had a 69% endorsement. See the Nature article for additional approaches for improving reproducibility.

In “What does research reproducibility mean? John Ioannidis and his coauthors pointed out how one of the problems with examining and enhancing the reliability of research is that its basic terms—reproducibility, replicability, reliability, robustness and generalizability—aren’t standardized.  Rather than suggesting new technical meanings for these nearly identical terms, they suggested using the term reproducibility with qualifying descriptions for the underlying construct. The three terms they suggested were: methods reproducibility, results reproducibility, and inferential reproducibility.

Methods reproducibility is meant to capture the original meaning of reproducibility, that is, the ability to implement, as exactly as possible, the experimental and computational procedures, with the same data and tools, to obtain the same results. Results reproducibility refers to what was previously described as “replication,” that is, the production of corroborating results in a new study, having followed the same experimental methods. Inferential reproducibility, not often recognized as a separate concept, is the making of knowledge claims of similar strength from a study replication or reanalysis. This is not identical to results reproducibility, because not all investigators will draw the same conclusions from the same results, or they might make different analytical choices that lead to different inferences from the same data.

They said what was clear is that none of these types of reproducibility can be assessed without a complete reporting of all relevant aspects of scientific design.

Such transparency will allow scientists to evaluate the weight of evidence provided by any given study more quickly and reliably and design a higher proportion of future studies to address actual knowledge gaps or to effectively strengthen cumulative evidence, rather than explore blind alleys suggested by research inadequately conducted or reported.

In “Estimating the Reproducibility of Psychological Science,” Nosek and his coauthors said it is too easy to conclude that successful replication means the original theoretical understanding is correct. “Direct replication mainly provides evidence for the reliability of a result.” Alternative explanations of the original finding may also account for the replication. Understanding come from multiple, diverse investigations giving converging support for a certain explanation, while ruling out others.

It is also too easy to conclude a failure to replicate means the original evidence was a false positive. “Replications can fail if the replication methodology differs from the original in ways that interfere with observing the data.” Unanticipated factors in the sample, setting, or procedure could alter the observed effect. So we return to need for multiple, diverse investigations.

Nosek et al. concluded that their results suggested there was room for improvement with reproducibility in psychology. Yet the Reproducibility Project demonstrates “science behaving as it should.” It doesn’t always confirm what we wish it to be; “it confronts us with what is.”

For more on reproducibility in science, also look at: “The Reproducibility Problem” and “’Political’ Science?” on this website.


Clinical Trial Sleight-of-Hand

7501727 - a rabbit in a hat and a magic wand against white background

© ljupco | 123rf.com

In 2005 a researcher named John Ioannidis published a seminal paper on publication bias in medical research, “Why Most Published Research Findings are False.” When Julia Belluz interviewed Ioannidis for Vox ten years later, she reported that as much as 30% of the most influential medical research papers turn out to be wrong or exaggerated. She said an estimated $200 billion, the equivalent of 85% of the global spending on research, is wasted on poorly designed and redundant studies. Ioannidis indicated that preclinical research on drug targets received a lot of attention since then. “There are papers showing that, if you look at a large number of these studies, only about 10 to 25 percent of them could be reproduced by other investigators.”

Ioannidis noted even with randomized control trials, there is empirical evidence indicating only a modest percentage can be replicated. Among those trails that are published, about half of the initial outcomes of the study are actually reported. In the published trials, 50% or more of the results are inappropriately interpreted, or given a spin that favors the sponsor of the research. “If you multiply these levels of loss or distortion, even for randomized trials, it’s only a modest fraction of the evidence that is going to be credible.”

One of the changes that Ioannidis’s 2005 paper seemed to produce was the introduction of mandatory clinical trial registration guidelines by the International Committee of Medical Journal Editors (ICMJE). Member journals were supposed to require prospective registration of trials before patient enrollment as a condition of publication. The purpose is that registering clinical trial ahead of time publically describes the methodology that should be followed during the trial. If the published report of the trial afterwards differed from its clinical trial registration, you have evidence that the researchers massaged or spun their research data when it didn’t meet the originally proposed outcome measures. In other words, they didn’t play by the rules they said ahead of time they were going to do in their research if they didn’t “win.”

Julia Rucklidge and two others looked at whether five psychiatric journals (American Journal of Psychiatry, Archives of General Psychiatry/JAMA Psychiatry, Biological Psychiatry, Journal of the American Academy of Child and Adolescent Psychiatry, and the Journal of Clinical Psychiatry) were indeed actually following the guidelines that said they would follow. They found that less than 15% of psychiatry trials were prospectively registered with no changes in their primary outcome measures (POMs). Most trials were either not prospectively registered, had either their POMs or timeframes changed sometime after registration, or they had their participant numbers changed.

In an article for Mad in America, Rucklidge said they submitted their research for review and publication in various journals, including two of the five they investigated. Six medical or psychiatric journals rejected it—they refused to publish Rucklidge et al.’s findings. PLoS One, a peer-reviewed open access journal did accept and publish their findings. She said while the researchers in their study could have changed their outcome measures or failed to preregister their trials for benign reasons, “History suggests that when left unchecked, researchers have been known to change their data.”

For example, an initial clinical trial for an antidepressant could be projected to last for 24 weeks. The 24-week time frame would be one of the initial primary outcome measures—will the antidepressant be more effective than a placebo after 24 weeks. After gathering all the data, the researchers find that the antidepressant was not more effective than placebo at 24 weeks. But let’s say it was more effective than placebo at 18 weeks. What gets reported is the results after 18 weeks; the 24 week original timeframe may disappear altogether when the research results are published.

People glorify their positive results and minimize or neglect reporting on negative results. . . . At worst, our findings mean that the trials published over the last decade cannot be fully trusted. And given that health decisions and funding are based on these published findings, we should be very concerned.

Looking ahead, Rucklidge had several suggestions for improving the situation with clinical trials.

1) Member journals of the ICMJE should have a dedicated person checking trial registries, trials should simply not be published if they haven’t been prospectively registered as determined by the ICMJE or the journals should state clearly and transparently reasons why studies might be published without adhering to ICMJE guidelines.2) If authors do change POMs or participant numbers or retrospectively register their trials, the reasons should be clearly outlined in the methods section of the publication.3) To further improve transparency, authors could upload the full clinical trial protocol, including all amendments, to the registry website and provide the raw data from a clinical trial in a format accessible to the research community.4) Greater effort needs to be made to ensure authors are aware of the importance of prospectively registering trials, by improving guidelines for submission (3) and when applying for ethical approval.5) Finally, reviewers should not make decisions about the acceptability of a study for publication based on whether the findings are positive or negative as this may be implicitly encouraging authors to be selective in reporting results.

Rucklidge also mentioned another study by Mathieu, Chan and Ravaud that looked at whether clinical trial registrations were actually looked at by peer-reviewers. The Mathieu et al. survey found that only one-third of the peer reviewers looked at registered trial information and then reported any discrepancies to journal editors. “When discrepancies were identified, most respondents (88.8%) mentioned them in their review comments, and 19.8% advised editors not to accept the manuscript.” The respondents who did not look at the trial registry information said that main reasons they failed to do so was because of the difficult or inconvenience in accessing the registry record.

One suggested improvement by Mathieu, Chan and Ravaud was for journals to provide peer reviewers with the clinical trial registration number and a direct Web link to the registry record; or provide the registered information with the manuscript to be reviewed.

The actions of researchers who fail to accurately and completely register their clinical trials, alter POMs, change participant numbers, or make other adjustments to their research methodology and analysis without clearly noting the changes is akin to the sleight-of-hand practiced by illusionists. And sometimes the effect is radical enough to make an ineffective drug trial seem to a new miracle cure.


Open Access Could ‘KO’ Publication Bias

© : Antonio Abrignani 123rf.com

© : Antonio Abrignani 123rf.com

A crisis of sorts has been brewing in academic research circles. Daniele Fanelli found that the odds of reporting a positive result were 5 times higher among published papers in Psychology and Psychiatry than in Space Science. “Space Science had the lowest percentage of positive results (70.2%) and Psychology and Psychiatry the highest (91.5%).”

Three Austrian researchers, Kühberger et al., randomly sampled 1,000 published articles from all areas of psychological research. They calculated p values, effect sizes and sample sizes for all the empirical papers and investigated the distribution of p values. They found a negative correlation between effect size and sample size. There was also “an inordinately high number of p values just passing the boundary of significance.” This pattern could not be explained by implicit or explicit power analysis. “The negative correlation between effect size and samples size, and the biased distribution of p values indicate pervasive publication bias in the entire field of psychology.” According to Kühberger et al., publication bias was present if there was a better chance of publication if there were significant results in the analysis.

Publication bias can occur at any stage of the publication process where a decision is made: in the researcher’s decision to write up a manuscript; in the decision to submit the manuscript to a journal; in the decision of journal editors to send a paper out for review; in the reviewer’s recommendations of acceptance or rejection, and in the final decision whether to accept the paper. Anticipation of publication bias may make researchers conduct studies and analyze results in ways that increase the probability of getting a significant result, and to minimize the danger of non-significant results.

Charles Seife, in his December 2011 talk for Authors@Google, said it succintly: “Most published research findings are wrong.” Seife was speaking on the topic of his book, Proofiness: The Dark Arts of Mathematical Deception.”  But Seife is not alone in this opinion, as several others have made the same claim. In 2005, John Ioannidis published “Why Most Published Research Findings Are False” in PLOS Medicine. He said that for many scientific fields, claimed research findings may often be just measures of the prevailing bias of the field. “For most study designs and settings, it is more likely for a research claim to be false than true.” Uri Simonsohn has been called “The Data Detective” for his efforts in identifying and exposing cases of wrongdoing in psychology research.

These and other “misrepresentations” received a lot of attention at the NIH, with a series of meetings to discuss the nature of the problem and the best solutions to address it. Both the NIH (NIH guidelines) and the NIMH (NIMH guidelines) published principles and guidelines for reporting research. The NIH guidelines were developed with input from journal editors from over 30 basic/preclinical journals in which NIH-funded investigators have most often published their results. Thomas Insel, the director of MIMH, noted how the guidelines aimed “to improve the rigor of experimental design, with the intention of improving the odds that results” could be replicated.

Insel thought it was easy to misunderstand the so-called “reproducibility problem”  (see “The Reproducibility Problem“). Acknowledging that science is not immune to fraudulent behavior, he said the vast majority of the time the reproducibility problem could be explained by other factors—which don’t involve intentional misrepresentation or fraud. He indicated that the new NIH guidelines were intended to address the problems with flawed experimental design. Insel guessed that misuse of statistics (think intentional fraud, as in “The Data Detective”) was only a small part of the problem. Nevertheless, flawed analysis (like p-hacking) needed more attention.

An important step towards fixing it is transparent and complete reporting of methods and data analysis, including any data collection or analysis that diverged from what was planned. One could also argue that this is a call to improve the teaching of experimental design and statistics for the next generation of researchers.

From reading several articles and critiques on this issue, my impression is that Insel may be minimizing the problem (see “How to Lie About Research”). Let’s return to Ioannidis, who said that several methodologists have pointed out that the high rates of nonreplication of research discoveries were a consequence of “claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically a p-value less than 0.05.” As Charles Seife commented: “Probabilities are only meaningful when placed in the proper context.”

Steven Goodman noted that p-values were widely used as a measure of statistical evidence in medical research papers. Yet they are extraordinarily difficult to interpret. “As a result, the P value’s inferential meaning is widely and often wildly misconstrued, a fact that has been pointed out in innumerable papers and books appearing since at least the, 1940s.” Goodman then reviewed twelve common misconceptions with p-values. He also pointed out the possible consequences of these misunderstandings or misrepresentations of p-values. See Goodman’s article for a discussion of the problems.

After his examination of the key factors contributing the inaccuracy of published research findings, Ioannidis suggested there were several corollaries that followed. Among them were: 1) The smaller the studies conducted, the less likely the research findings will be true; 2) the smaller the effect sizes, the less likely the research will be true; 3) the greater the financial and other interests and prejudices in a scientific field, the less likely the research findings will be true; and 4) the hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true. The first two could fit within Insel’s issue of flawed experimental design, but not the third and fourth corollaries.

Moonesinghe et al. noted that while they agreed with Ioannidis that most research is false, they were able to show that “replication of research findings enhances the positive predictive value of research findings being true.” However, their analysis did not consider the possibility of bias in the research. They commented how Ioannidis showed that even a modest bias could decrease the positive predictive value of research dramatically. “Therefore if replication is to work in genuinely increasing the PPV of research claims, it should be coupled with full transparency and non-selective reporting of research results.”

While not everyone is supportive of the idea, open access within peer-reviewed scholarly research would go a long way to correcting many of these problems. Starting in January of 2017, the Bill & Melinda Gates Foundation will require all of its research to be published in an open access manner. Susannah Locke on Vox cited a chart from a 2012 UNESCO report that showed where scholarly publications in clinical medicine and biomedicine have typically been less available for open access than other scientific fields. Access to psychology research began a downward trend in 2003 and was no better than the average for all fields by 2006.

There is a growing movement to widen Open Access (OA) to peer-reviewed scholarly research. The Budapest Open Access statement said by ‘open access’ they meant “its free availability on the internet, permitting any user to read, download, copy, distribute, print, search, or link to the full texts of these articles.” Francis Collins, in an embedded video on the Wikipedia page for “Open Access” noted the NIH’s support for open access. Effective on May 8, 2013, President Obama signed an Executive Order to make government-held data more accessible to the public.

Many of the concerns with academic research discussed here could be quickly and effectively dealt with through open access. The discussion of the “First Blood Test Able to Diagnose Depression in Adults,” looked at in “The Reproduciblity Problem,” is an example of the benefit and power it brings to the scientific process.  There is a better future for academic research through open access. It may even knock publication bias out of the academic journals.