Does online learning work?

Some variation of the question “Does it work?” is the most common question that I get from policymakers and reporters regarding online learning, blended learning, and other innovative instructional strategies using technology.

In some cases, the question is easy to answer. These are cases in which, to borrow (and alter) a phrase from the Christensen Institute, online learning is filling a non-consumption gap. When a student takes an online Advanced Placement course that was not available at her school, and gets a five on the exam, then that course has clearly “worked.” Similarly, when a student in a blended early college high school like Innovations in Salt Lake City, or Oasis in California, graduates high school with several college credits accumulated, that is another fairly clear instance of a successful outcome.

But when we’re talking about improving student performance in mainstream schools, it is in fact quite difficult to draw conclusions. Even when a press release headline shows impact, digging deeper reveals an untold level of complexity and ambiguity.

I’m going to use an example from Public Impact’s Opportunity Culture work to dig deeper into this issue, but I want to be completely clear that this is not a dig at Public Impact or their OC work. Although I don’t know them very well, my impression of the company and their work is positive. In addition, one of the school districts in the study, Charlotte Mecklenburg, was the subject of a case study that Evergreen developed, and we were impressed with the work being done in the district around their use of technology in classrooms.

Still, it’s worth looking at the press release and the full study to examine more closely how policymakers and practitioners can assess if an instructional change has “worked.”  From the opening of the press release:

“Students in classrooms of team teachers led by Opportunity Culture “multi-classroom leaders” showed sizeable academic gains, according to a new study from the American Institutes for Research and the Brookings Institution. The team teachers were, on average, at the 50th percentile in the student learning gains they produced before joining a team led by a multi-classroom leader. After joining the teams, they produced learning gains equivalent to those of teachers in the top quartile in math and nearly that in reading, said the report, released on January 11, 2018, through the CALDER Center.”  (emphasis in original)

That sounds fairly straightforward, right?

But when we dig into the study to determine how the researchers came to this conclusion, we find that the process is anything but straightforward. I’m going to pull some paragraphs from the study, and if they make your eyes glaze over, well that’s the point:

“Our baseline analysis measures the difference in achievement of students in classrooms exposed to OC models and comparison students who are in other classrooms. Our approach follows similar studies of measuring the classroom achievement of students exposed to certain types of teachers, such as Teach For America and the New York City Teaching Fellows program (Boyd, Lankford, Loeb, & Wyckoff, 2006; Hansen, Backes, Brady, & Xu, 2014; Kane, Rockoff, & Staiger, 2008). We estimate the following equation:

                                    Y𝑖𝑠𝑡 = 𝛽0 + 𝛽1𝑦𝑖𝑠𝑡−1 + 𝛽2𝑋𝑖 + 𝛽3𝑂𝐶𝑖 + 𝜀𝑖𝑠𝑡 , (1)

where 𝑦𝑖𝑠𝑡 indicates the score on a math or reading exam (with separate regressions for each) for student i in schools in year t, 𝑦𝑖𝑠𝑡−1 is a vector of cubic functions of prior-year test scores in math and reading, 𝑂𝐶𝑖 is an indicator for whether student i was taught by a teacher in an OC role in that subject, and 𝑋𝑖 contains a vector of student i’s characteristics (only available in CMS), including race, gender, and eligibility for FRPL. In addition, 𝜀𝑖𝑠𝑡 represents a randomly distributed error term. In all analyses where we observe school-level indicators (i.e., when not including CCS), standard errors are clustered at the school-cohort level to allow for arbitrary within-school clustering of the error terms (Chetty et al., 2014a).”

This level of statistical analysis is necessary because the researchers are trying to measure growth in a group of students who were not randomly selected to take part in the OC work. The fact that they were not randomly selected means that they can’t be easily compared to the full population of students. This type of situation is very common when researchers are trying to measure impact from a program that is first and foremost trying to improve outcomes, as opposed to an effort to ask a research question. In the real world, when organizations are experimenting with different approaches, they often have to choose teachers and schools that are willing to try it—which means that they are unlikely to get a random sample of teachers or students. Once you get away from a random sample, researchers have to perform all sorts of statistical gyrations to be able to assess outcomes. These calculations are a metaphorical black box in the sense that nobody outside of the research field can be sure that the calculations being made are accurate. This problem, in turn, is addressed by using highly reputable researchers—in this case from AIR and Brookings. But that doesn’t change the fact that the large majority of people who will be asked to evaluate these findings do not fully understand them.

Even with those statistical gyrations, the researchers point out another issue (which doesn’t make it into the press release):

In a section above, we discussed how the general rise in test scores in OC treatment schools could lead to an upward bias in OC estimates if this rise is not fully caused by OC. To test the extent of this potential bias, we randomly select non-OC teachers from within treated schools in 2014-15 and 2015-16 and call them “placebo” OC teachers. We then re-run our models from Tables 5 and 6 with an additional role of “placebo” OC teacher. Because these teachers are not real OC teachers and represent no genuine treatment, we expect the point estimates on these placebo OC teachers to be statistically indistinguishable from zero.

Results are shown in Table 9. For both math and reading, the point estimates for placebo OC teachers are strikingly similar to the point estimates for team teachers, with the exception of the school by- year fixed effects model in Column 6.21 In math, Column 6’s point estimate for the placebo teacher’s is -0.02 and not statistically significant, while for reading it is 0.05 and statistically significant. Because the Column 6 specification passes the placebo test in math and fails it in reading to a lesser degree than the other specifications, we take these as our preferred estimates.

Put together, the results from Table 9 suggest that many of the positive OC point estimates are due to schoolwide improvements that are not necessarily due to OC, though as we discuss further below, it is also possible to interpret these as positive spillovers from the implementation of OC models. (emphasis added)

So the researchers can’t explain the fact that there appears to be improvement across the school, and it’s unclear if these can be attributed to the work of Public Impact.

Where does this lead us? First, to reiterate my earlier point, these observations are not meant to be a negative comment on Public Impact. In fact, the company should be commended for putting in this effort and the transparency in the public report (although perhaps the press release might be considered misleading).

Second and more importantly, it suggests that the idea that we should have strong evidence of impact, based on these types of research studies, before moving ahead with using innovative methods, is simply untenable. This approach would simply make it far too easy for people who want to slow or stop new approaches to say “we need more evidence.”

(For any readers who believe that won’t happen, I suggest a quick Internet search for the way that companies and politicians sowed confusion about the impacts of cigarettes and acid rain, using the argument that “more evidence is needed” long past the time when the evidence was abundantly clear.)

If we’re not using that type of evidence, however, how should decisions about new instructional strategies be made?

In my view, the understanding that formal, academic research has an important but limited role in evaluating innovative methods means that we have to trust school and district leaders to make the best decisions for their students and schools. Yes there are some issues with this approach, and it’s not perfect, but I believe it is the best option.

In light of that view, a statement I recently heard from a highly experienced leader in Clark County, Nevada, is instructive. This was in a meeting and I didn’t write it down, but paraphrasing it was essentially:

“Our graduation rates have risen and I am completely sure that is because of our use of blended learning, including our digital learning credit recovery programs. I don’t have the research to show that, because we don't have the time and resources to invest in the studies. So I can’t show you a document that says this. But I know it to be the case.”

Is that satisfying? I suppose it depends on what you believe. If you believe that new programs should always be supported by the type of research done by AIR and Brookings that I referenced at the start of this post, then the quote may not seem strong enough. But I suspect that most educators, who recognize the need for evidence but recognize the shortcomings of that type of research, would support this view.

There may be a viable in-between approach as well. Gretchen Morgan, former Associate Commissioner of Innovation at the Colorado Department of Education, has written about the need for short-cycle innovation to prototype and rapidly test ideas in classrooms. This is a topic that has been discussed elsewhere (e.g., Creating a Culture of Innovation in Education Week). Although these tend to be focused on efforts by individual teachers, and my sense is that sustainable and scalable efforts have to be made at the school level, they may point a way forward that strikes a balance in between unrealistic calls for more proof of impact, and the concern that investments are being made with little regard for outcomes and ROI.