The Design of Systematic Replication Studies

Since the start of the war on poverty in the 1960s, social scientists have developed and refined experimental and quasi-experimental methods for evaluating and understanding the way public policies affect people’s lives. The overarching mission of many social scientists is to understand “what works” in social policy for reducing inequality, improving educational outcomes, and mitigating harms of early life disadvantage. This is a laudable goal. However, mounting evidence suggests that the results from many studies are fragile and hard to replicate. The so-called “replication crisis” has important implications for evidence-based analysis of social programs and policies. At the same time, there is intense debate about what constitutes a successful replication and why certain types of replication rates are so low. A crucial set of questions for evidence-based policy research involve questions about external validity and replicability. We need to understand the contexts and conditions under which interventions produce similar outcomes.

To address these concerns, Peter Steiner and I presented a new framework that provides a clear definition of replication, and highlights the conditions under which results are likely to replicate (Wong & Steiner, 2018; Steiner, Wong, & Anglin, 2019). This work introduces the Causal Replication Framework (CRF), which defines replication as a research design that tests whether two or more studies produce the same causal effect within the limits of sampling error. The CRF formalizes the conditions under which replication success can be expected. The core of the replication framework is based on potential outcomes notation (Rubin, 1974), which has the advantage of identifying clear causal estimands of interest and demonstrating research design assumptions needed for the direct replication of results. Here, a causal estimand is defined as the causal effect of a well-defined treatment-control contrast of a clear target population. In this blog post, I discuss key assumptions needed for direct replication of results, describe how the CRF may be used for planning systematic replication studies, and how the CRF is informing out development of a “crowdsourcing platform” for SERA.

Under the CRF, five assumptions are required for the direct replication of results across multiple replication studies. These assumptions may be understood broadly as replication design requirements (R1-R2 in Table 1), and individual study design requirements (A1-A3 in Table 1). Replication design assumptions include treatment and outcome stability (R1) and equivalence in causal estimands (R2). Combined, these two assumptions ensure that the same causal estimand for a well-defined treatment and target population is produced across all studies. Individual study design assumptions include unbiased identification of causal estimands (A1), unbiased estimation of causal estimands (A2), and correct reporting of estimands, estimators, and estimates (A3). These assumptions ensure that a valid research design is used for identifying effects, unbiased analysis approaches are used for estimating effects, and that effects are correctly reported – standard assumptions in most individual causal studies. Replication failure occurs when one or more of the replications and/or individual study design assumptions are not met. Table 1 summarizes the assumptions needed for the direct replication of results.


Table 1. Design Assumptions for Replication of Effects (Steiner & Wong, in press; Wong & Steiner, 2018)

Design AssumptionsFor Study 1… Through Study k
Replication design assumptions (R1-R2)R1. Treatment and outcome stability
R2. Equivalence in causal estimand
R1. Treatment and outcome stability
R2. Equivalence and causal estimand
Individual study design assumptions (A1-A3)A1. Unbiased identification of effects
A2. Unbiased estimation of effects
A3. Correct reporting of estimators, estimands, and estimates
A1. Unbiased identification of effects
A2. Unbiased estimation of effects
A3. Correct reporting of estimators, estimands, and estimates

Deriving Research Designs for Systematic Replication Studies

A key advantage of the CRF is that it is straight-forward to derive research designs for replication to identify sources of effect heterogeneity. For example, direct replications examine whether two or more studies with the same well-defined causal estimand yields the same effect (akin to the definition of verification tests in Clemens, 2017). To implement this type of design, the researcher may introduce potential violations to any individual study design assumptions (A1-A3), such as using different research design (A1) or estimation (A2) approaches for producing effects, or asking an independent investigator to reproduce the effect using the same data and code (A3). Design-replications (Lalonde, 1986) and reanalysis or reproducibility studies (Chang & Li, 2015) are examples of direct replication approaches. Conceptual replications examine whether 2+ studies with potentially different causal estimands yield the same effect (akin to definition of robustness tests in Clemens, 2017). Here, the researcher may introduce systematic violations to replication assumptions (R1-R2), such as multi-site designs where there are systematic differences in participant and setting characteristics across sites (R2), and multi-arm treatment designs when different dosage levels of an intervention is assigned (R1). The framework demonstrates that while the assumptions for direct replication of results are stringent, researchers often have multiple research design options for evaluating the replicability of effects and identifying sources of effect variation. However, until recently, these approaches have not been recognized as systematic replication designs. Wong, Anglin, and Steiner (2021) describe research designs for conceptual replication studies, and methods for assessing the extent to which replication assumptions are met in field settings. 

Using Systematic Replication Designs for Understanding Effect Heterogeneity

Wong and Steiner (2018) show that replication failure is not inherently “bad” for science – as long as the source of the replication failure can be identified. Prospective research designs or research design features may be used to address replication assumptions in field settings. The multi-arm treatment design, where units are randomly assigned units to different treatment dosages, is one example of this approach. A switching replication design, where two or more groups are randomly assigned to receive treatment at different time intervals in an alternating sequence is another example (Shadish, Cook, & Campbell, 2002). Here, the timing and context under which the intervention is delivered vary, but all other study features (including composition of sample participants) remain the same across both intervention intervals. In systematic replication studies, a researcher introduces planned variation by relaxing one or more replication assumption while trying to meet all others. If replication failure is observed—and all other assumptions are met—then the researcher may infer that the assumption was violated and resulted in treatment effect heterogeneity.

Using the CRF to Plan Crowdsourced Systematic Replication Studies

In crowdsourcing efforts such as SERA, teams of independent investigators collaborate to conduct a series of conceptual replication studies using the same study protocols. These are conceptual replication studies because although sites may have the same participant eligibility criteria, the characteristics of recruited sample members, the context under which the intervention is delivered, and the settings under which the study is conducted will likely vary across independent research teams. A key challenge in all conceptual replication studies – including those in SERA – is identifying why replication failure occurred when multiple design assumptions may have been violated simultaneously.

To address these concerns, SERA is developing infrastructure and methodological supports for evaluating the replicability of effects, and for identifying sources of effect heterogeneity when replication failure is observed. These tools are based on the Causal Replication Framework and focus on three inter-related strands of work. First, we are working with the UVA School of Data Science (Brian Wright) to create a crowdsourcing platform that will ensure transparency and openness in our methods, procedures, and data. A crowdsourcing platform that allows for efficient and multi-modal data collection is critical for helping us evaluate the extent to which replication assumptions are met in field settings. We are also working on data science methods for assessing the fidelity and replicability of treatment conditions across sites in ways that are valid, efficient, and scalable. This will provide information about the extent to which the “treatment stability” assumption (R2 in Table 1) is met in field settings. Finally, we are examining statistical approaches for evaluating the replicability of effects, especially when there is evidence of effect heterogeneity. Over the next year, we look forward to sharing results from these efforts and what our team has learned about the “science” of conducting replication studies in field settings.

Replication Quantitative Methods Team includes Vivian C Wong (UVA), Peter M Steiner (Maryland), Brian Wright (UVA SDS), Anandita Krishnamachari (Research Scientist), and Christina Taylor (Replication Specialist).


Vivian C. Wong, Ph.D.

Co-Principal Investigator

Vivian C. Wong, Ph.D., is an Associate Professor in Research, Statistics, and Evaluation at UVA. Dr. Wong’s expertise is in improving the design, implementation, and analysis of experimental and quasi-experimental approaches. Her scholarship has focused recently on the design and analysis of replication research. Dr. Wong has authored numerous articles on research methodology in journals such as Journal of Educational and Behavioral Statistics, Journal of Policy Analysis and Management, and Psychological Methods. Wong is primarily responsible for developing the methodological infrastructure and supports for SERA, assisting with methodological components of the pilot study, and conducting exploratory analyses of SERA.

Crowdsourcing & Open Science

Educators strive to improve and maximize the learning outcomes of their students by applying effective instructional practices. Although no instructional approach is universally effective, some teaching practices are more effective than others. Therefore, it is important to reliably identify and prioritize the most effective instructional practices for populations of learners. Rigorous experimental research is generally agreed to be the most reliable approach for identifying “what works” in education. However, it is important to recognize that scientific research has important limitations and does not always generate valid findings.

In large-scale replication projects in psychology and other fields, researchers often failed to replicate the findings of previously conducted studies (e.g., Klein et al., 2018; Open Science Collaboration, 2015), casting doubt on the validity of research findings. Researchers, including those in education, have also reported using a variety of questionable research practices such as p-hacking (exploring different ways to analyze data until desired results are obtained), selective-outcomes reporting (only reporting analysis with desired results), and hypothesizing after results are known (HARKing; e.g., Fraser et al., 2018; John et al., 2012; Makel et al., 2019), all of which increase the likelihood of false positive findings (Simmons et al., 2011). Moreover, research studies in education often involve relatively small and underpowered samples that do not adequately represent the population being studied, which further threatens the validity of study findings. Finally, most published research lies behind a paywall, inaccessible to many practitioners and policymakers (Piwowar et al., 2018), which reduces the potential application and impact of research. Open science and crowdsourcing are two related developments in research that aim to address these issues and improve the validity and impact of research.

Open science is an umbrella term that includes a variety of practices aiming to open and make transparent all aspects of research with the goal of increasing its validity and impact (Cook et al., 2018). For example, preregistration involves making one’s research plans transparent before conducting a study in order to discourage questionable research practices such as p-hacking and HARKing by making them easily discoverable. Data sharing is another key open-science practice, which allows the research community to verify analyses reported in an article, as well as analyze data sets in other ways to examine the robustness of reported findings, thereby serving as a protection against p-hacking. Materials sharing involves openly sharing materials used in a study (e.g., intervention protocols, intervention checklists) to enable other researchers to replicate the study as faithfully as possible. Finally, open access publishing and preprints provide free access to research to anyone with internet access, thereby democratizing the benefits and impact of scientific research beyond those with institutional subscriptions to the publishers of academic journals.

Crowdsourcing in research involves large-scale collaboration of various elements of the research process and can take many forms (Makel et al., 2019). For example, multiple research teams might conduct independent studies examining the same issues, as the Open Science Collaboration (2015) did when they conducted 100 replication studies to examine reproducibility of findings in psychology. Crowdsourcing in research has also occurred by multiple analysts analyzing the same data set to answer the same research question in order to examine the effects of different analytic decisions on study outcomes (Silberzahn et al., 2018). The most frequent application of crowdsourcing in research, which we have adopted in the Special Education Research Accelerator, is to involve many research teams in the collection of data, thereby increasing the size and diversity of the study sample – which serves to improve the study’s power and external validity. For example, Jones et al. (2018) involves > 200 researchers collecting data from > 11,000 participants in 41 countries to evaluate facial perceptions. Regardless of how it is employed, “crowdsourcing flips research planning from ‘what is the best we can do with the resources we have to investigate our question?’ to ‘what is the best way to investigate our question, so that we can decide what resources to recruit?” (Uhlmann et al., 2019, p. 417).

Although open science and crowdsourcing are independent constructs (i.e., research that is open is not necessarily crowdsourced, and crowdsourced studies are not necessarily open), the two approaches are closely aligned and complementary. The ultimate goal of both open science and crowdsourcing is to improve the validity and impact of research. Although using different means to achieve this goal, crowdsourcing facilitates making research open, and open science facilitates crowdsourcing. For example, in a crowdsourced study in which data are collected by many research teams, study procedures have to be determined and disseminated to collaborating researchers before the study is begun. Because study procedures are determined and documented prior to conducting the study, they can readily be posted as a preregistration. Additionally, if data are being collected across many researchers in a crowdsourced study, the project will need a clear data-management plan to enable data to be collected and entered reliably across researchers. Such well-organized data-management plans that include metadata not only facilitate the integration of data across researchers on the project, but also make data more readily usable by other researchers when shared and reduce the burden of creating supporting metadata when sharing. Moreover, because data are collected across many different researchers in many crowdsourced studies, individual researchers may be less likely to feel that they “own” the data and therefore may feel less reluctant to share them. Similarly, materials used in crowdsourced studies must have been developed and shared with the many researchers collecting data, so – like data – materials from crowdsourced research are ready to be uploaded and shared. Most broadly, crowdsourcing and open science share an ethos of collaboration and sharing for the betterment of science, and we conjecture that most researchers involved in crowdsourced studies will want to make their research as open as possible. As such, we look forward to making our crowdsourced research through the Special Education Research Accelerator as open as possible.


Crowdsourcing Research Primer

In crowdsourced research, resources are combined across researchers to conduct studies that could not be accomplished on their own.

“Crowdsourcing flips research planning from ‘what is the best we can do with the resources we have to investigate our question,’ to ‘what is the best way to investigate our question, so that we can decide what resources to recruit.” (Uhlmann et al., 2019, p. 7).

Crowdsourcing research is new to special education research, but in other fields, it has been around for decades. Many aspects of research can be crowdsourced, such as deciding what ideas to research, analyzing data, and conducting peer review (Uhlmann et al., 2019). The most common use of crowdsourcing is regarding data collection, in which many researchers collect data, resulting in much larger and diverse samples of study participants. Examples of crowdsourced data collection outside of special education include:

  • Citizen data collectors: Fields such as astronomy and geology utilize volunteers to collect data that researchers would never be able to collect on their own. Many of these projects and opportunities to become data collectors are compiled on the federal government citizen science website.
  • Psychological Science Accelerator: The Psych Accelerator is a network of over 1,000 research labs, across more than 70 countries, conducting large-scale studies on topics such as gendered prejudice and stereotype threat.

Conducting crowdsourced research, especially involving crowdsourcing data collection, in special education has many potential uses such as:

  • Implementing large scale observational studies so we can better understand how students with disabilities are provided services across the country
  • Validating evidence-based practices using nationally representative student samples
  • Enabling adequately powered group studies with low-incidence populations to be conducted
  • Dramatically increasing the number of direct and conceptual replications so we can efficiently determine what interventions work for who, under what conditions (Coyne et al., 2016).

Ultimately, we envision crowdsourcing will democratize the research enterprise by enabling more and diverse researchers to be involved in large-scale research studies involving large and diverse participant samples that ask and answer critical research questions aimed at improving services for children with disabilities. By facilitating crowdsourcing of data collection across many researchers, we hope the Special Education Research Accelerator (SERA) will play a core role in democratizing the research enterprise in our chosen field — special education.


Bill Therrien

Co-Principal Investigator

William J. Therrien, Ph.D., BCBA, is a Professor in Special Education at UVA. He is an expert in designing and evaluating academic programming for students with disabilities, particularly in science and reading. Along with co-directing CASPER, Dr. Therrien co-edits Exceptional Children, the flagship journal in special education, and is Research in Practice Director for UVA’s Supporting Transformative Autism Research (STAR) initiative. Therrien assists Cook with developing infrastructure and supports for SERA, conducting the pilot study and assessing the usability and feasibility of using SERA to conduct future replication pilot studies.

Welcome to the Special Education Research Accelerator

Bryan G. Cook, William J. Therrien, Vivian C. Wong, Christina Taylor

We are pleased to welcome you to the Special Education Research Accelerator (SERA). SERA is a platform for crowdsourcing data collection in special education research across multiple research teams. When we read about the Psychological Research Accelerator, which crowdsources data collection for massive studies in the field of psychology conducted throughout the world, we began to think about the potential benefits of crowdsourcing in special education research. In essence, instead of a single research team conducting a study, crowdsourcing of data collection involves a network of research teams collecting data. Crowdsourcing allows researchers to flip “research planning from ‘what is the best we can do with the resources we have to investigate our question,’ to ‘what is the best way to investigate our question, so that we can decide what resources to recruit’” (Uhlmann et al., 2019, p. 713).

Given that there are relatively few students with disabilities, especially low-incidence disabilities, in schools, it is often difficult for special education researchers to obtain large, representative samples for their studies. Moreover, given limited grant funding, relatively few researchers in the field have the resources to conduct studies with large, representative samples on their own. Crowdsourcing data collection across many research teams seemed to us to be well-suited to address these and other challenges faced in special education research.

However, implementing crowdsourcing in special education research presents many challenges. Can interventions be conducted with fidelity across many different research teams? How will data be managed? How will implementation fidelity be measured? How will IRB issues be handled across multiple institutions? Fortunately, the National Center for Special Education Research (NCSER) funded an unsolicited grant for us to develop and pilot a platform for crowdsourcing research in special education research (i.e., SERA) to examine these and other issues. By involving many different research teams in data collection of studies, SERA can generate large and representative study sample, involve diverse sets of researchers, and examine whether and how study findings vary across researchers and researcher sites, in ways that studies conducted by a single research team could not.

This website has two primary functions: (a) a public-facing site to provide information and resources related to SERA and crowdsourcing research in special education, and (b) a hub for research partners to access resources and interact with SERA staff related to ongoing SERA projects. We hope all of you explore the website to learn more about SERA, the team behind SERA, and our SERA research partners. Please check back periodically as we provide updates. For our research partners, please be on the lookout for an email over the next few weeks which will contain your login credentials, as well as provide additional information on navigating pilot study resources and materials.

Currently, we are in the process of preparing to conduct a crowdsourced randomized control trial that conceptually replicates Scruggs et al.’s (1994) study of acquisition of science facts. We will be examining the effect of instructor-provided elaborations and student-generated elaborations on science-fact acquisition for elementary students with high-functioning autism across more than 20 research partners and sites throughout the US.

We couldn’t be more excited about SERA, this website, and our upcoming pilot study. We plan on developing and refining SERA for future use in many different crowdsourced studies in special education. If you’re interested in potentially being involved in future studies, please send us a message using the contact form here. We hope that you’ll find the site interesting and helpful, and that (if you’re a special education researchers) you’ll consider being involved in a crowdsourced SERA study.