Establishing a Baseline

In this study, Classworks users (the treatment group) and students with no classworks instruction (the comparison group) took the Star early literacy assessment in the fall. This is our baseline score because neither the treatment group nor the comparison group had been exposed to Classworks instruction prior to taking the fall test.

In order for research to be considered valid, the baseline test scores must be equivalent across the groups. If baseline test scores aren’t equivalent, we can’t attribute differences in the post-test scores to the intervention.

In this study, the baseline test score was 610.25 for the treatment group and 610.37 for the comparison group. The What Works Clearinghouse has standards for baseline equivalence; the baseline effect size can be no larger than .25, in this study it is .002. Because this study established baseline equivalence, we proceeded with the research and applied the treatment.

After a semester of exposure to Classworks instruction, both groups took the Star early literacy test again. The treatment group, the ones that used Classworks, outscored the comparison group by 28 points. At-a-glance, this seems like a meaningful difference. But, how can we be sure?

Understanding the Research Method

This is where it helps to look at the statistical analysis. Since we know the treatment and comparison groups had equivalent scores in the fall, we can argue that the difference in winter scores is at least partially attributable to Classworks instruction. But, we still don’t know how meaningful that difference actually is.

To convey the magnitude of a difference, researchers will typically conduct a statistical test (like a t-test) which produces a p-value; however, education researchers have recently reported, along with p-values, the effect size.

Effect size is a simple way to quantify the magnitude of difference between two groups. Traditional statistical tests incorporate sample size into the analysis, which can make the tests biased against smaller studies.

This also means that the more participants a study includes, the more likely a small effect can still result in a “significant” p-value. So, be wary of large-scale studies that don’t present an effect size. Effect size simply shows the magnitude of difference between groups in standard deviations without sample size or arbitrary thresholds. In this study, Classworks users scored 28 points higher than non-users on the post-test—this resulted in an effect size of .27 standard deviations. For educational experiments, this is a rather large effect size. The What Works Clearinghouse says that if the effect size is greater than .25 it’s considered substantive.

P-values Show Confidence

Often when researchers see an effect size, they will try to better understand the results by looking at a confidence interval and p-value.
‍
Confidence intervals provide an upper and lower bound for the effect size. Our 95% confidence interval was .12 to .53—this means that if we ran this experiment 100 times, the effect size would fall within that range in at least 95 of those experiments. When a confidence interval does not contain zero, it means that the treatment scores will be greater than the comparison scores.
‍
The next thing they will look at is the p-value. While judging the entire efficacy of a study by only looking at a p-value can be problematic, incorporating it into the analysis, with effect size, adds another layer of evidence. A p-value is telling us the probability that our results are due to chance. The threshold for significance is traditionally any p-value less than .05. In this study, the p-value was less than .0001, which means there’s a probability of one in ten-thousand that the results were simply luck, or chance.

Understanding the Results

Taking into account the factors we’ve just discussed, we can confidently say this study is well designed. It established baseline equivalence. After the pretest, the treatment was applied. Then a post-test was administered and differences were observed between the groups. In this study, Classworks users experienced a 28-point difference in their winter scores. That is highly significant! It has an effect size of .27, with a 95% confidence interval of [.12,.53], and, after conducting a statistical test, a p-value of
‍
Once you’ve considered research design and outcomes, what other things should you consider when looking at a product’s research portfolio?

1. Is the company committed to on-going research? Are they continually adding new studies to their portfolio?
‍
2. Do their studies encompass their entire product offering, multiple grades and subjects, or are they focused on one area?
‍
3. Do all of their studies use external measures to validate their analysis, like NWEA MAP Growth or Renaissance Star, or is the treatment and the measure published by the same provider?
‍
4. Do the majority of their efficacy studies use a minimum dosage for the treatment group? In other words, are they selectively choosing usage and mastery thresholds to compare to students that did not use the program at all. While quite popular, studies like these are not included in What Works Clearinghouse.

Efficacy & Research Design

What to look for in an efficacy study

Establishing a Baseline

Understanding the Research Method

P-values Show Confidence

Understanding the Results

Efficacy Matters