The Design of Systematic Replication Studies

Since the start of the war on poverty in the 1960s, social scientists have developed and refined experimental and quasi-experimental methods for evaluating and understanding the way public policies affect people’s lives. The overarching mission of many social scientists is to understand “what works” in social policy for reducing inequality, improving educational outcomes, and mitigating harms of early life disadvantage. This is a laudable goal. However, mounting evidence suggests that the results from many studies are fragile and hard to replicate. The so-called “replication crisis” has important implications for evidence-based analysis of social programs and policies. At the same time, there is intense debate about what constitutes a successful replication and why certain types of replication rates are so low. A crucial set of questions for evidence-based policy research involve questions about external validity and replicability. We need to understand the contexts and conditions under which interventions produce similar outcomes.

To address these concerns, Peter Steiner and I presented a new framework that provides a clear definition of replication, and highlights the conditions under which results are likely to replicate (Wong & Steiner, 2018; Steiner, Wong, & Anglin, 2019). This work introduces the Causal Replication Framework (CRF), which defines replication as a research design that tests whether two or more studies produce the same causal effect within the limits of sampling error. The CRF formalizes the conditions under which replication success can be expected. The core of the replication framework is based on potential outcomes notation (Rubin, 1974), which has the advantage of identifying clear causal estimands of interest and demonstrating research design assumptions needed for the direct replication of results. Here, a causal estimand is defined as the causal effect of a well-defined treatment-control contrast of a clear target population. In this blog post, I discuss key assumptions needed for direct replication of results, describe how the CRF may be used for planning systematic replication studies, and how the CRF is informing out development of a “crowdsourcing platform” for SERA.

Under the CRF, five assumptions are required for the direct replication of results across multiple replication studies. These assumptions may be understood broadly as replication design requirements (R1-R2 in Table 1), and individual study design requirements (A1-A3 in Table 1). Replication design assumptions include treatment and outcome stability (R1) and equivalence in causal estimands (R2). Combined, these two assumptions ensure that the same causal estimand for a well-defined treatment and target population is produced across all studies. Individual study design assumptions include unbiased identification of causal estimands (A1), unbiased estimation of causal estimands (A2), and correct reporting of estimands, estimators, and estimates (A3). These assumptions ensure that a valid research design is used for identifying effects, unbiased analysis approaches are used for estimating effects, and that effects are correctly reported – standard assumptions in most individual causal studies. Replication failure occurs when one or more of the replications and/or individual study design assumptions are not met. Table 1 summarizes the assumptions needed for the direct replication of results.


Table 1. Design Assumptions for Replication of Effects (Steiner & Wong, in press; Wong & Steiner, 2018)

Design AssumptionsFor Study 1… Through Study k
Replication design assumptions (R1-R2)R1. Treatment and outcome stability
R2. Equivalence in causal estimand
R1. Treatment and outcome stability
R2. Equivalence and causal estimand
Individual study design assumptions (A1-A3)A1. Unbiased identification of effects
A2. Unbiased estimation of effects
A3. Correct reporting of estimators, estimands, and estimates
A1. Unbiased identification of effects
A2. Unbiased estimation of effects
A3. Correct reporting of estimators, estimands, and estimates

Deriving Research Designs for Systematic Replication Studies

A key advantage of the CRF is that it is straight-forward to derive research designs for replication to identify sources of effect heterogeneity. For example, direct replications examine whether two or more studies with the same well-defined causal estimand yields the same effect (akin to the definition of verification tests in Clemens, 2017). To implement this type of design, the researcher may introduce potential violations to any individual study design assumptions (A1-A3), such as using different research design (A1) or estimation (A2) approaches for producing effects, or asking an independent investigator to reproduce the effect using the same data and code (A3). Design-replications (Lalonde, 1986) and reanalysis or reproducibility studies (Chang & Li, 2015) are examples of direct replication approaches. Conceptual replications examine whether 2+ studies with potentially different causal estimands yield the same effect (akin to definition of robustness tests in Clemens, 2017). Here, the researcher may introduce systematic violations to replication assumptions (R1-R2), such as multi-site designs where there are systematic differences in participant and setting characteristics across sites (R2), and multi-arm treatment designs when different dosage levels of an intervention is assigned (R1). The framework demonstrates that while the assumptions for direct replication of results are stringent, researchers often have multiple research design options for evaluating the replicability of effects and identifying sources of effect variation. However, until recently, these approaches have not been recognized as systematic replication designs. Wong, Anglin, and Steiner (2021) describe research designs for conceptual replication studies, and methods for assessing the extent to which replication assumptions are met in field settings. 

Using Systematic Replication Designs for Understanding Effect Heterogeneity

Wong and Steiner (2018) show that replication failure is not inherently “bad” for science – as long as the source of the replication failure can be identified. Prospective research designs or research design features may be used to address replication assumptions in field settings. The multi-arm treatment design, where units are randomly assigned units to different treatment dosages, is one example of this approach. A switching replication design, where two or more groups are randomly assigned to receive treatment at different time intervals in an alternating sequence is another example (Shadish, Cook, & Campbell, 2002). Here, the timing and context under which the intervention is delivered vary, but all other study features (including composition of sample participants) remain the same across both intervention intervals. In systematic replication studies, a researcher introduces planned variation by relaxing one or more replication assumption while trying to meet all others. If replication failure is observed—and all other assumptions are met—then the researcher may infer that the assumption was violated and resulted in treatment effect heterogeneity.

Using the CRF to Plan Crowdsourced Systematic Replication Studies

In crowdsourcing efforts such as SERA, teams of independent investigators collaborate to conduct a series of conceptual replication studies using the same study protocols. These are conceptual replication studies because although sites may have the same participant eligibility criteria, the characteristics of recruited sample members, the context under which the intervention is delivered, and the settings under which the study is conducted will likely vary across independent research teams. A key challenge in all conceptual replication studies – including those in SERA – is identifying why replication failure occurred when multiple design assumptions may have been violated simultaneously.

To address these concerns, SERA is developing infrastructure and methodological supports for evaluating the replicability of effects, and for identifying sources of effect heterogeneity when replication failure is observed. These tools are based on the Causal Replication Framework and focus on three inter-related strands of work. First, we are working with the UVA School of Data Science (Brian Wright) to create a crowdsourcing platform that will ensure transparency and openness in our methods, procedures, and data. A crowdsourcing platform that allows for efficient and multi-modal data collection is critical for helping us evaluate the extent to which replication assumptions are met in field settings. We are also working on data science methods for assessing the fidelity and replicability of treatment conditions across sites in ways that are valid, efficient, and scalable. This will provide information about the extent to which the “treatment stability” assumption (R2 in Table 1) is met in field settings. Finally, we are examining statistical approaches for evaluating the replicability of effects, especially when there is evidence of effect heterogeneity. Over the next year, we look forward to sharing results from these efforts and what our team has learned about the “science” of conducting replication studies in field settings.

Replication Quantitative Methods Team includes Vivian C Wong (UVA), Peter M Steiner (Maryland), Brian Wright (UVA SDS), Anandita Krishnamachari (Research Scientist), and Christina Taylor (Replication Specialist).


Vivian C. Wong, Ph.D.

Research Faculty

Vivian C. Wong, Ph.D., is an Associate Professor in Research, Statistics, and Evaluation at UVA. Dr. Wong’s expertise is in improving the design, implementation, and analysis of experimental and quasi-experimental approaches. Her scholarship has focused recently on the design and analysis of replication research. Dr. Wong has authored numerous articles on research methodology in journals such as Journal of Educational and Behavioral Statistics, Journal of Policy Analysis and Management, and Psychological Methods. Wong is primarily responsible for developing the methodological infrastructure and supports for SERA, assisting with methodological components of the pilot study, and conducting exploratory analyses of SERA.