Reproducibility in Science

© fouroaks | stockfresh.com

In 2011 a University of Virginia psychologist named Brian Nosek began the Reproducibility Project. He simply wanted to see if the reported problem with reproducing the scientific findings of published research studies in psychology was true. Nosek and his team recruited 250 research psychologists, many of whom volunteered their time to double-check what they considered to be the important works in their field. They identified 100 studies published in 2008 and rigorously repeated the experiments while in close consultation with the original authors. There was no evidence of fraud or falsification, but “the evidence for most published findings was not nearly as strong as originally claimed.”

Their results were published in the journal Science:Estimating the Reproducibility of Psychological Science.” In a New York Times article about the study, Brian Nosek said: “We see this is a call to action, both to the research community to do more replication, and to funders and journals to address the dysfunctional incentives.” The authors of the journal article said they conducted the project because they care deeply about the health of psychology and believe it has the potential to accumulate knowledge about human behavior that can advance the quality of human life. And the reproducibility of studies that further that goal is central to that aim. “Accumulating evidence is the scientific community’s method of self-correction and is the best available option for achieving that ultimate goal: truth.”

The present results suggest that there is room to improve reproducibility in psychology. Any temptation to interpret these results as a defeat for psychology, or science more generally, must contend with the fact that this project demonstrates science behaving as it should. Hypotheses abound that the present culture in science may be negatively affecting the reproducibility of findings. An ideological response would discount the arguments, discredit the sources, and proceed merrily along. The scientific process is not ideological. Science does not always provide comfort for what we wish to be; it confronts us with what is.

The editor in chief of Science said: ““I caution that this study should not be regarded as the last word on reproducibility but rather a beginning.” Reproducibility and replication of scientific studies has been a growing concern. John Ioannidis of Stanford has been particularly vocal on this issue. His best-known paper on the subject, “Why Most Published Research Findings Are False,” was published in 2005. A copy of one of his latest works, “Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature,” can be found here.  Szucs and Ioannidis concluded that false report probability was likely to exceed 50% for the whole literature. “In light of our findings the recently reported low replication success in psychology is realistic and worse performance may be expected for cognitive neuroscience. “

A recent survey conducted by the journal Nature found that more than 70% of researchers have tried and failed to reproduce another scientist’s experiments. More than half failed to reproduce their own experiments.  In response to the question, “Is there a reproducibility crisis?” 52% said there was a significant crisis; another 38% said there was a slight crisis. More than 60% of respondents thought that two factors always or often contributed to problems with reproducibility—pressure to publish and selective reporting. More than half also pointed to poor oversight, low statistical power and insufficient replication in the lab. See the Nature article for additional factors.

There were several suggestions for improving reproducibility in science. The three most likely were: a better understanding of statistics, better mentoring/supervision, and a more robust experimental design. Almost 90% thought these three factors would improve reproducibility. But even the lowest-ranked item had a 69% endorsement. See the Nature article for additional approaches for improving reproducibility.

In “What does research reproducibility mean? John Ioannidis and his coauthors pointed out how one of the problems with examining and enhancing the reliability of research is that its basic terms—reproducibility, replicability, reliability, robustness and generalizability—aren’t standardized.  Rather than suggesting new technical meanings for these nearly identical terms, they suggested using the term reproducibility with qualifying descriptions for the underlying construct. The three terms they suggested were: methods reproducibility, results reproducibility, and inferential reproducibility.

Methods reproducibility is meant to capture the original meaning of reproducibility, that is, the ability to implement, as exactly as possible, the experimental and computational procedures, with the same data and tools, to obtain the same results. Results reproducibility refers to what was previously described as “replication,” that is, the production of corroborating results in a new study, having followed the same experimental methods. Inferential reproducibility, not often recognized as a separate concept, is the making of knowledge claims of similar strength from a study replication or reanalysis. This is not identical to results reproducibility, because not all investigators will draw the same conclusions from the same results, or they might make different analytical choices that lead to different inferences from the same data.

They said what was clear is that none of these types of reproducibility can be assessed without a complete reporting of all relevant aspects of scientific design.

Such transparency will allow scientists to evaluate the weight of evidence provided by any given study more quickly and reliably and design a higher proportion of future studies to address actual knowledge gaps or to effectively strengthen cumulative evidence, rather than explore blind alleys suggested by research inadequately conducted or reported.

In “Estimating the Reproducibility of Psychological Science,” Nosek and his coauthors said it is too easy to conclude that successful replication means the original theoretical understanding is correct. “Direct replication mainly provides evidence for the reliability of a result.” Alternative explanations of the original finding may also account for the replication. Understanding come from multiple, diverse investigations giving converging support for a certain explanation, while ruling out others.

It is also too easy to conclude a failure to replicate means the original evidence was a false positive. “Replications can fail if the replication methodology differs from the original in ways that interfere with observing the data.” Unanticipated factors in the sample, setting, or procedure could alter the observed effect. So we return to need for multiple, diverse investigations.

Nosek et al. concluded that their results suggested there was room for improvement with reproducibility in psychology. Yet the Reproducibility Project demonstrates “science behaving as it should.” It doesn’t always confirm what we wish it to be; “it confronts us with what is.”

For more on reproducibility in science, also look at: “The Reproducibility Problem” and “’Political’ Science?” on this website.


Open Access Could ‘KO’ Publication Bias

© : Antonio Abrignani 123rf.com

© : Antonio Abrignani 123rf.com

A crisis of sorts has been brewing in academic research circles. Daniele Fanelli found that the odds of reporting a positive result were 5 times higher among published papers in Psychology and Psychiatry than in Space Science. “Space Science had the lowest percentage of positive results (70.2%) and Psychology and Psychiatry the highest (91.5%).”

Three Austrian researchers, Kühberger et al., randomly sampled 1,000 published articles from all areas of psychological research. They calculated p values, effect sizes and sample sizes for all the empirical papers and investigated the distribution of p values. They found a negative correlation between effect size and sample size. There was also “an inordinately high number of p values just passing the boundary of significance.” This pattern could not be explained by implicit or explicit power analysis. “The negative correlation between effect size and samples size, and the biased distribution of p values indicate pervasive publication bias in the entire field of psychology.” According to Kühberger et al., publication bias was present if there was a better chance of publication if there were significant results in the analysis.

Publication bias can occur at any stage of the publication process where a decision is made: in the researcher’s decision to write up a manuscript; in the decision to submit the manuscript to a journal; in the decision of journal editors to send a paper out for review; in the reviewer’s recommendations of acceptance or rejection, and in the final decision whether to accept the paper. Anticipation of publication bias may make researchers conduct studies and analyze results in ways that increase the probability of getting a significant result, and to minimize the danger of non-significant results.

Charles Seife, in his December 2011 talk for Authors@Google, said it succintly: “Most published research findings are wrong.” Seife was speaking on the topic of his book, Proofiness: The Dark Arts of Mathematical Deception.”  But Seife is not alone in this opinion, as several others have made the same claim. In 2005, John Ioannidis published “Why Most Published Research Findings Are False” in PLOS Medicine. He said that for many scientific fields, claimed research findings may often be just measures of the prevailing bias of the field. “For most study designs and settings, it is more likely for a research claim to be false than true.” Uri Simonsohn has been called “The Data Detective” for his efforts in identifying and exposing cases of wrongdoing in psychology research.

These and other “misrepresentations” received a lot of attention at the NIH, with a series of meetings to discuss the nature of the problem and the best solutions to address it. Both the NIH (NIH guidelines) and the NIMH (NIMH guidelines) published principles and guidelines for reporting research. The NIH guidelines were developed with input from journal editors from over 30 basic/preclinical journals in which NIH-funded investigators have most often published their results. Thomas Insel, the director of MIMH, noted how the guidelines aimed “to improve the rigor of experimental design, with the intention of improving the odds that results” could be replicated.

Insel thought it was easy to misunderstand the so-called “reproducibility problem”  (see “The Reproducibility Problem“). Acknowledging that science is not immune to fraudulent behavior, he said the vast majority of the time the reproducibility problem could be explained by other factors—which don’t involve intentional misrepresentation or fraud. He indicated that the new NIH guidelines were intended to address the problems with flawed experimental design. Insel guessed that misuse of statistics (think intentional fraud, as in “The Data Detective”) was only a small part of the problem. Nevertheless, flawed analysis (like p-hacking) needed more attention.

An important step towards fixing it is transparent and complete reporting of methods and data analysis, including any data collection or analysis that diverged from what was planned. One could also argue that this is a call to improve the teaching of experimental design and statistics for the next generation of researchers.

From reading several articles and critiques on this issue, my impression is that Insel may be minimizing the problem (see “How to Lie About Research”). Let’s return to Ioannidis, who said that several methodologists have pointed out that the high rates of nonreplication of research discoveries were a consequence of “claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically a p-value less than 0.05.” As Charles Seife commented: “Probabilities are only meaningful when placed in the proper context.”

Steven Goodman noted that p-values were widely used as a measure of statistical evidence in medical research papers. Yet they are extraordinarily difficult to interpret. “As a result, the P value’s inferential meaning is widely and often wildly misconstrued, a fact that has been pointed out in innumerable papers and books appearing since at least the, 1940s.” Goodman then reviewed twelve common misconceptions with p-values. He also pointed out the possible consequences of these misunderstandings or misrepresentations of p-values. See Goodman’s article for a discussion of the problems.

After his examination of the key factors contributing the inaccuracy of published research findings, Ioannidis suggested there were several corollaries that followed. Among them were: 1) The smaller the studies conducted, the less likely the research findings will be true; 2) the smaller the effect sizes, the less likely the research will be true; 3) the greater the financial and other interests and prejudices in a scientific field, the less likely the research findings will be true; and 4) the hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true. The first two could fit within Insel’s issue of flawed experimental design, but not the third and fourth corollaries.

Moonesinghe et al. noted that while they agreed with Ioannidis that most research is false, they were able to show that “replication of research findings enhances the positive predictive value of research findings being true.” However, their analysis did not consider the possibility of bias in the research. They commented how Ioannidis showed that even a modest bias could decrease the positive predictive value of research dramatically. “Therefore if replication is to work in genuinely increasing the PPV of research claims, it should be coupled with full transparency and non-selective reporting of research results.”

While not everyone is supportive of the idea, open access within peer-reviewed scholarly research would go a long way to correcting many of these problems. Starting in January of 2017, the Bill & Melinda Gates Foundation will require all of its research to be published in an open access manner. Susannah Locke on Vox cited a chart from a 2012 UNESCO report that showed where scholarly publications in clinical medicine and biomedicine have typically been less available for open access than other scientific fields. Access to psychology research began a downward trend in 2003 and was no better than the average for all fields by 2006.

There is a growing movement to widen Open Access (OA) to peer-reviewed scholarly research. The Budapest Open Access statement said by ‘open access’ they meant “its free availability on the internet, permitting any user to read, download, copy, distribute, print, search, or link to the full texts of these articles.” Francis Collins, in an embedded video on the Wikipedia page for “Open Access” noted the NIH’s support for open access. Effective on May 8, 2013, President Obama signed an Executive Order to make government-held data more accessible to the public.

Many of the concerns with academic research discussed here could be quickly and effectively dealt with through open access. The discussion of the “First Blood Test Able to Diagnose Depression in Adults,” looked at in “The Reproduciblity Problem,” is an example of the benefit and power it brings to the scientific process.  There is a better future for academic research through open access. It may even knock publication bias out of the academic journals.


The Reproducibility Problem

Copyright : Fernando Gregory (Follow)

Copyright : Fernando Gregory (Follow)

In January of 2014, a Japanese stem cell scientist published what looked like groundbreaking research in the journal Nature that suggested stem cells could be made quickly and easily. But as James Gallagher of the BBC noted, “the findings were too good to be true.” Her work was investigated by the center where she conducted her research amid concern within the scientific community that the results had been fabricated. In July, the Riken Institute wrote a retraction of the original article, noting the presence of “multiple errors.” The scientist was later found guilty of misconduct. In December of 2014, Riken announced that their attempts to reproduce the results had failed. Dr. Obokata resigned saying, “I even can’t find the words for an apology.”

The ability to repeat or replicate someone’s research is the way scientists can weed out nonsense, stupidity and pseudo-science from legitimate science. In Scientific Literacy and the Scientific Method, Henry Bauer described a ‘knowledge filter’ that illustrated this process. The first stage of this process was research or frontier science. The research is then presented to editors and referees of scientific journals for review, in hopes of being published. It may also be presented to other interested parties in seminars or at conferences. If the research successfully passes through this first filter, it will be published in the primary literature of the respective scientific field—and pass into the second stage of the scientific knowledge filter.

The second filter consists of others trying to replicate the initial research or apply some modification or extension of the original research. This is where the reproducibility problem occurs. The majority of these replications fail. But if the results of the initial research can be replicated, these results are also published as review articles or monographs (the third stage). After being successfully replicated, the original research is seen as “mostly reliable,” according to Bauer.

So while the stem cell research of Dr. Obokata made it through the first filter to the second stage, it seems that it shouldn’t have. The implication is that Nature didn’t do a very good job reviewing the data submitted to it for publication. However, when the second filtering process began, it detected the errors that should have been caught by the first filter and kept what was poor science from being accepted as reliable science.

A third filter occurs where the concordance of the research results with other fields of science is explored. There is also continued research by others who again confirm, modify and extend the original findings. When the original research successfully comes through this filter, it is “mostly very reliable,” and will get included into scientific textbooks.

Francis Collins and Lawrence Tabak of The National Institute of Health (NIH) commented that: “Science has long been regarded as ‘self-correcting’, given that it is founded on the replication of earlier work.” But they noted how the checks and balances built into the process of doing science—that once helped to ensure its trustworthiness—have been compromised. This has led to the inability of researchers to reproduce the initial research findings.  Think here of how Obokata’s stem cell research was approved for publication in Nature, one of the most prestigious science journals.

The reproducibility problem has become a serious concern within research conducted into psychiatric disorders. Thomas Insel, the Director of the National Institute of Mental Health (NIMH), wrote a November 14, 2014 article in his blog on the “reproducibility problem” in scientific publications. He said that “as much as 80 percent of the science from academic labs, even science published in the best journals, cannot be replicated.” Insel said this failure was not always because of fraud or the fabrication of results. Perhaps his comment was made with the above discussion of Dr. Obokata’s research in mind. Then again, maybe it was made in regard to the following study.

On September 16, 2014, the journal Translational Psychiatry published a study done at Northwestern University that claimed it was the “First Blood Test Able to Diagnose Depression in Adults.”  Eva Redei, the co-author of the study said: “This clearly indicates that you can have a blood-based laboratory test for depression, providing a scientific diagnosis in the same way someone is diagnosed with high blood pressure or high cholesterol.” A surprise finding of the study was that the blood test also predicted who would benefit from cognitive behavioral therapy. The study was supported by grants from the NIMH and the NIH.

The Redei et al. study received a good bit of positive attention in the news media.  It was even called a “game changing” test for depression. WebMD, Newsweek, Huffington Post, US News and World Report, Time and others published articles on the research—all on the Translational Psychiatry publication date of September 16th.  Then James Coyne, PhD published a critique of the press coverage and the study in his “Quick Thoughts” blog. Coyne systematically critiqued the claims of the Redei et al. study. Responding to Dr. Rediei’s quote in the above paragraph, he said: “Maybe someday we will have a blood-based laboratory test for depression, but by themselves, these data do not increase the probability.”

He wondered why these mental health professionals would make such “misleading, premature, and potentially harmful claims.” In part, he thought it was because it was fashionable and newsworthy to claim progress in an objective blood test for depression. “Indeed, Thomas Insel, the director of NIMH is now insisting that even grant applications for psychotherapy research include examining potential biomarkers.” Coyne ended with quotes that indicated that Redei et al. were hoping to monetize their blood test. In an article for Genomeweb.com, Coyne quoted them as saying: “Now, the group is looking to develop this test into a commercial product, and seeking investment and partners.”

Coyne then posted a more thorough critique of the study, which he said would allow readers to “learn to critically examine the credibility of such claims that will inevitably arise in the future.” He noted how the small sample size contributed to its strong results—which are unlikely to be replicated in other samples. He also cited much larger studies looking for biomarkers for depression that failed to find evidence for them. His critique of the Redie et al. study was devastating. The comments from others seemed to agree. But how could these researchers be so blind?

Redie et al. apparently believed unquestionably that there is a biological cause for depression. As a result, their commitment to this belief effected how they did their research to the extent that they were blind to the problems pointed to by Coyne. Listen to the video embedded in the link “First Blood Test Able to Diagnose Depression in Adults” to hear Dr. Redie acknowledge she believes that depression is a disease like any other disease. Otherwise, why attempt to find a blood test for depression?

Attempts to replicate the Redei et al. study, if they are done, will raise further questions and (probably) refute what Coyne said was a study with a “modest sample size and voodoo statistics.” Before we go chasing down another dead end in the labyrinth of failed efforts to find a biochemical cause for depression, let’s stop and be clear about whether this “game changer” is really what it claims to be.