Estimating the reproducibility of psychological science
Reproductions of 100 studies were performed, only 1⁄3 reproduced.
Only a third of results were replicated (subjectively). In many cases the results were found to not be statisticaly significant, in some cases the effect was the opposite of the one observed!
Is this the same in other fields? Presumably maths and physics are mostly safe. There is another paper to read on bio which is apparently similar in its conclusions.
Insights, lessons learnt:
Can be attributed to publication bias (only publish experiments that worked, makes it more likely that “lucky” results will be published). Very important that ALL results should be recorded, that ALL experiments should have a publicaly recorded protocol before they start. Reproduction of past studies should be regarded more highly, done more often. The foundation of science is made rotten by incorrect results, since new science builds on old. Very scary paper.
Important results should be reproduced by a different team, to try and make this less likely.
“Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. Scientific claims should not gain credence because of the status or authority of their originator but by the replicability of their supporting evidence. Even research of exemplary quality may have irreproducible empirical findings because of random or systematic error.”
“We conducted replications of 100 experimental and correlational studies published in three psychology journals using highpowered designs and original materials when available”
“The mean effect size ® of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline”
“Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results;”
“More generally, there are indications of cultural practices in scientific communication that may be responsible for the observed results. Lowpower research designs combined with publication bias favoring positive results together produce a literature with upwardly biased effect sizes (14, 16, 33, 34). This anticipates that replication effect sizes would be smaller than original studies on a routine basis-not because of differences in implementation but because the original study effect sizes are affected by publication and reporting bias, and the replications are not.”
When a study doesn’t turn up an interesting result, it is not published, which increases the chance of non-reproducible “chance” results ending up published. Everything should be published, protocols should be pre-recorded to prevent unconcious bias and concious hacking (try another study, another protocol, until it works)
“This suggests publication, selection, and reporting biases as plausible explanations for the difference between original and replication effects. The replication studies significantly reduced these biases because replication preregistration and pre-analysis plans ensured confirmatory tests and reporting of all results.”
“Reproducibility is not well understood because the incentives for individual scientists prioritize novelty over replication (20). If nothing else, this project demonstrates that it is possible to conduct a large-scale examination of reproducibility despite the incentive barriers”
“Progress occurs when existing expectations are violated and a surprising result spurs a new investigation. Replication can increase certainty when findings are reproduced and promote innovation when they are not”
“After this intensive effort to reproduce a sample of published psychological findings, how many of the effects have we established are true? Zero. And how many of the effects have we established are false? Zero. Is this a limitation of the project design? No. It is the reality of doing science, even if it is not appreciated in daily practice”