The Replication Crisis in Psychology – Verifiable Statistics on Failure Rates, High-Profile vs. Low-Profile Findings, and Systemic Implications

The replication crisis represents one of the most thoroughly documented methodological challenges in contemporary psychology. Large-scale projects have produced clear, quantifiable evidence of low reproducibility across both high-profile and lower-profile studies. This document synthesizes the key statistical findings, distinguishes between high- and low-profile research, and examines what these numbers reveal about the field’s overall reliability.

Overall Replication Success Rates

The most cited large-scale effort remains the 2015 Open Science Collaboration project, which attempted to replicate 100 studies published in three high-impact psychology journals (Psychological Science, Journal of Personality and Social Psychology, and Journal of Experimental Psychology: Learning, Memory, and Cognition).

Original studies claiming statistically significant results: 97 out of 100 (97%).
Successful replications (significant results in the same direction): 36 out of 100 (36%).
When using a stricter criterion (effect size within the 95% confidence interval of the original): success dropped to approximately 25%.
Average effect size in replications was roughly half that reported in the original studies.

Subsequent projects have produced comparable ranges. A 2018 study replicating 21 experimental economics studies achieved a 61% replication rate, while social psychology-focused efforts have often fallen in the 25–50% range depending on methodology and strictness criteria.

High-Profile vs. Low-Profile Findings

High-Profile Studies (often published in top-tier journals and widely cited): These tend to show the lowest replication rates. The Open Science Collaboration’s 100-study set deliberately targeted influential papers. Only about one-third replicated robustly. Many headline findings in social priming, ego depletion, and implicit bias research have shown particularly weak replication (e.g., several classic social psychology effects have failed in multiple independent attempts). High-profile work benefits from greater visibility and scrutiny, which has exposed fragility in effect sizes and contextual sensitivity.

Low-Profile Studies (published in specialized or lower-impact journals): Replication rates are generally higher but still concerning. A 2016–2020 meta-project examining a broader sample of psychology studies estimated replication success around 50–60% for less-cited work. However, even these studies frequently show substantial shrinkage in effect size upon replication. The pattern suggests that publication bias and questionable research practices (p-hacking, selective reporting) inflate apparent success across the field, with high-profile findings suffering most from over-optimism in initial reporting.

Broader Statistical Indicators of Failure Rates

Publication Bias and “File Drawer” Problem: Estimates suggest that for every published significant finding, multiple non-significant studies remain unpublished. Ioannidis (2005) modeled that under typical conditions in “soft” sciences, the positive predictive value of a statistically significant result can fall below 50%.
Effect Size Inflation: Original studies routinely report effect sizes 1.5–2 times larger than those obtained in independent replications.
p-Hacking and Researcher Degrees of Freedom: Surveys of psychologists indicate widespread use of flexible analytic practices that increase false-positive rates (John et al., 2012).
Longitudinal Outcome Data in Clinical Psychology/Psychiatry: Real-world functional recovery rates under standard care protocols remain low. Many longitudinal cohorts show high relapse, persistent disability, and measurable iatrogenic effects (e.g., cortical volume reduction correlated with cumulative antipsychotic exposure).

What These Numbers Mean

The verifiable statistics paint a consistent picture: psychology produces many findings that do not hold up under independent scrutiny. High-profile claims — those most likely to influence theory, clinical practice, and public understanding — show the weakest replication. This is not random noise but a structural feature of studying complex, context-dependent human phenomena with methods that often fail to account for observer effects, cultural variation, and environmental context.

The degradation of falsifiability is evident in the gap between initial claims and replication outcomes. When roughly 60–75% of published significant results fail or substantially weaken upon re-testing, the field’s ability to build cumulative knowledge is severely compromised.

These numbers do not invalidate all psychological research. They do, however, demonstrate that the current standard of evidence in much of the field falls short of the rigorous, externally anchored standards seen in physics, chemistry, or molecular biology. Greater transparency, pre-registration, larger samples, and adversarial collaboration are necessary corrections if psychology seeks to strengthen its scientific standing.