Experimental vs. non-experimental evidence on early childhood programs

A recent evaluation of Tennessee’s pre-K program found very mixed results. The report was done by Vanderbilt researchers Mark Lipsey, Kerry Hofer, Nianbo Dong, Dale Farran, and Carol Bilbrey. The program’s initial effects on academic achievement at the end of pre-K seemed to mostly disappear in a year or two. The program did, however, appear to have some evidence of reducing the percentage of students retained in kindergarten. Because student retention reflects behavior as well as academic achievement, and also independently predicts future success, we would expect these kindergarten retention effects to potentially have long-lasting effects on pre-K participants as they age.

This report has already been used by opponents of pre-K. Tennessee state representative Bill Dunn argued that this report shows that benefit claims for pre-K are “hype”. According to Rep. Dunn,

“If you do a cost-benefit analysis on this extremely expensive program, you will come to the conclusion that it is like paying $1,000 for a McDonald’s hamburger. It may make an initial dent on your hunger, but it doesn’t last long and you soon realize you could have done a lot more with the money spent.”

Steve Barnett, Director of the National Institute for Early Education Research, has already provided some useful analysis of the Tennessee study. As Professor Barnett points out, Tennessee’s program spends far less than would be needed to ensure adequate quality. Program spending per child for a full-day school-year program is $5,814. NIEER estimates, based on studies by the Institute for Women’s Policy Research, that a quality full-day pre-K program in Tennessee would cost $8,059, 39% more than what Tennessee spends.    (For the IWPR research, see “Meaningful Investments in Pre-K”; for NIEER’s estimates, see Table 7 in the Executive Summary of the latest NIEER report on state pre-K programs.)

As Barnett points out, even the initial estimated effects of Tennessee’s pre-K program were smaller than in some other programs with higher spending per child. If initial effects are small, then these initial effects can be more easily offset by children’s K-12 experiences. In particular, we expect that teachers will try to intervene with kindergartners and first-graders to offset disadvantages, which will tend to help children who were behind, whether due to a lack of pre-K or other causes, to catch up to other students.  These offsetting interventions tend to reduce the measured effect of pre-K on student achievement. They also obscure one benefit of pre-K, which is that it may reduce the need for such remedial intervention, which inevitably come at some opportunity cost in teachers’ time and attention.

The main point I would add to Barnett’s analysis is that the Tennessee report illustrates the enormous difficulties of conducting a true “gold standard” random assignment experiment. One aspect of the report that will no doubt be cited by opponents is that the Tennessee study involved random assignment to treatment and control groups. Random assignment is argued to give more reliable results than other type of evaluation. And it is true that if there are no issues with data collection and implementation, and if the sample is large enough, we can be assured that random assignment experiments will reveal the true causal effects of a program. The treatment and control group will on average be expected to be identical in observed and unobserved characteristics. If the sample is large enough, these expectations become increasingly likely to be realized. Therefore, any differences that are observed must be due to the program intervention.

In the real world, it is quite difficult to actually carry out and implement a perfect random assignment study. Inevitably, problems crop up. For the Tennessee study, the academic achievement results so far are based on a relatively modest sub-sample of the treatment and control group for which parents  agreed to answer questions and have tests administered to their children. The first cohort had only 46% of the pre-K participants agreeing to such data collection, and a much smaller 32% for non-participants.  These small data consent rates are troubling, and the differential across treatment versus control groups is also troubling. The participation rate increased in the second cohort to 74% for participants and 68% for non-participants, which is still much lower and more differential that we would prefer.

The problem is that even if the original sample is randomly divided into treatment and control groups, and that therefore the two groups are expected on average to be similar in unobserved characteristics, there is no reason to think that this is true of the smaller sample that agreed to data collection. The “gold standard” nature of the evidence is weakened or perhaps even eliminated.

The Vanderbilt researchers try to do what they can to match treatment and control group members on observables. This is what is commonly done in any non-experimental study. But such matching can only control for observable differences between treatment and control groups. Unobservable differences may still remain.  The Tennessee study probably suffers from this issue at least as much as the average non-experimental study

Other problems with this experiment are due to crossover between treatment vs. control group assignment and what pre-K services are actually received. According to the Vanderbilt research report, about 19% of the sample ended up not following their random assignment – either they were randomly assigned to pre-K, but ended up not participating, or they were randomly assigned to the control group, but somehow succeeded in enrolling in the state pre-K program.

Most of the Tennessee report’s analysis is based on comparing students who actually participated in pre-K versus those who did not participate. Again, this is no longer a pure random assignment experiment, as there are many reasons why the 19% of the sample who broke with their assignment might differ from other families.

Finally, a complexity of the Tennessee study is that it is not really a single experiment. Rather, it an average of separate experiments in 58 different pre-K programs across the state. In each of these pre-K programs, applicants exceeded slots available, so a random assignment lottery was run to determine enrollment.  But the result is that the relative size of the treatment versus control group varies systematically across pre-K programs. The study corrects for this problem by weighting the treatment and control group separately so that the weighted population for each group has the same probability of coming from each of the local pre-K programs. This is one way to deal with this problem, but results might differ for other ways. (For example, one could do a model where treatment effects were separately estimated by pre-K program, and then weighted up in various ways.)

None of this means to say that the Tennessee result provide no useful information. The study’s researchers worked hard to provide useful information despite the limitations of their data that reduced the methodological advantages of the initial random assignment.  I regard the results as providing useful information that increases my belief that the initial academic test score effects of Tennessee’s pre-K program are probably to a large extent offset in early elementary grades by the K-12 system’s intervention. The jury is still out on whether the more behavioral effects of Tennessee’s program on student retention will translate into large effects on high school graduation rates or other long-term behaviors.

But I don’t think that the Tennessee study’s results should be given greater weight than a well-done study with a non-randomly chosen control group. For example, I suspect that the Head Start studies that compare siblings who do and do not participate in Head Start are using at least as reliable a method for determining causal effects of pre-K as the Tennessee study. The final sample that is analyzed in the Tennessee study is very far from being a sample in which one can be at all confident that the treatment and control group are the same in unobserved characteristics.  Therefore, it is unclear whether the estimated differences between the treatment and control groups represent the true effects of the program.

There is no magic pre-K evaluation methodology that trumps all other methodologies. In the real world, determining what is the true effect of a program is difficult. There are better and worse methodologies for doing so, and better and worse datasets. Random assignment studies are rarely even close to perfect, and other methodologies can also do a good job of providing comparison groups that are truly comparable.  Our evaluation of the effects of pre-K programs should be based on the totality of the evidence from multiple studies with diverse data and methodologies, rather than solely on any one study that happens to incorporate random assignment.

About timbartik

Tim Bartik is a senior economist at the Upjohn Institute for Employment Research, a non-profit and non-partisan research organization in Kalamazoo, Michigan. His research specializes in state and local economic development policies and local labor markets.
This entry was posted in Early childhood program design issues, Early childhood programs. Bookmark the permalink.