One key point about evaluation

The National Institute for Early Education has recently published a useful report on evaluation for early education programs. The report is entitled “Improving Early Education Programs through Data-based Decision Making”. Its authors are Shannon Riley-Ayers, Ellen Frede, Steve Barnett, and Kimberly Brenneman.

The report provides an extensive description of various evaluation designs. These various designs include randomized experimentation, regression discontinuity, and nonequivalent comparison groups. The report includes discussion of the pros and cons of each design. In addition, the report includes a discussion of the costs of different types of evaluation. Finally, an appendix discusses different possible student tests to use in evaluation.

Over the years, I have written extensively on how to evaluate business incentives and other economic development programs. My new book, Investing in Kids, includes extensive discussion of evaluation of early childhood programs and business incentive programs.

Out of all this verbiage in reports and books, what is the most important point? What about evaluation is it most important for policymakers and early childhood advocates to understand?

My choice for a key point is this:  not all evaluations are created equal. In fact, evaluation quality is extremely unequal. One evaluation with a good design trumps 50 evaluations with a mediocre design.

More specifically, an evaluation with a comparison group that is truly comparable is far more reliable than 50 evaluations with more usual comparison groups. A very common evaluation design is to compare what happens to participants in early childhood programs with what happens to non-participants. This is usually a mediocre evaluation design. Much better are evaluation designs that yield better comparison groups by random assignment or by techniques such as regression discontinuity.

For early childhood programs, the problem with the usual comparison groups can be labeled with the jargon “selection bias”.  There is some reason that some children were selected either by parents or by the programs into participating in early childhood programs. Frequently, children who are thought for some reason by early childhood programs to be “needier” are selected into early childhood programs. These program participants will not be comparable to children who do not participate in the program.  Program participants and non-participants will remain non-comparable even if the research study tries to statistically control for as many characteristics of the children or their families as possible.

The consequence of this non-comparability is that estimates of program effects will be biased. If program participants are needier, then over time they will tend to do worse than non-participants. This causes a negative bias in estimated effects of the early childhood program. The estimated effects will be less than any true positive effects of the program, and even may be negative. This negative bias may tend to increase over time. Over time, if program participants are needier, any negative effects of program participants’ social environment will tend to have larger cumulative effects on outcomes. The initial estimated positive effects will appear to fade over time even if there were no true fading of program effects.

For evaluations to be reliable, the comparison group has to be truly comparable to the program participants. This requires that the comparison group’s non-participation has to be due to some factor that has no relationship to expected child outcomes.  Random assignment is one such factor: program participation vs. non-participation is due to some random number that is generated.

Regression discontinuity is another such method that yields good comparison groups. In the case of pre-k programs, we compare children of various ages who either are just entering the pre-k program, or who participated in the program the previous year and are now entering kindergarten. Selection bias is not an issue because all these children were selected by families and the program for participation in the pre-k program. Whether the children are just entering the program, or are through the program and are entering kindergarten, is due solely to where their birthdays fall relative to the age cut-off for the pre-k program. Age by itself also has some effects on student outcomes. But we can control for age’s gradual effects on student performance. The abrupt shift in performance at the age cutoff is most plausibly due to the fact that children who were slightly older than the age cut-off a year ago ended up participating in the program for one year, whereas children who were slightly younger than the age cut-off a year ago were not able to participate until this year.  The recent NIEER publication provides a useful diagram to illustrate these ideas in Appendix C.

Regression discontinuity is relatively straightforward to do. It does require policymakers to do one thing that no doubt seems a little weird: give the same tests at pre-k entrance that are given at kindergarten entrance. To allow for the comparison between the two groups, the outcome measures must be the same.

One big advantage of regression discontinuity is that it can be done on an ongoing basis for program evaluation of pre-k programs. A state or local area that wants to monitor performance of a pre-k program simply has to regularly collect data on the same tests at pre-k entrance and kindergarten entrance for participants in the pre-k program. This allows the effects of the pre-k program on kindergarten readiness to be measured. Policymakers can compare the effectiveness of different pre-k program designs or curricula to see which works better with different groups of children.

I should emphasize that such evaluations do not need to look solely at “hard skills” (e.g., the skills measured by literacy and math tests). A regression discontinuity evaluation can also look at program effects on “soft skills” (e.g., various skills related to social skills and character skills).

There are other ways of getting comparison groups that are truly comparable. Comparing siblings who participated in early childhood programs with siblings who did not participate in early childhood programs controls for unobserved family factors that may affect child outcomes.  (See, for example, papers on Head Start by David Deming, and Janet Currie and Duncan Thomas.)  Sometimes comparisons across geographic areas in program participation can be used, if there are good reasons to think that the differences across geographic areas would not have independent effects upon child outcomes.

However, the key point is this: some evaluations are much more reliable than others. In reaching conclusions about the effects of early childhood programs, we should rely far more heavily on these rigorous evaluations rather than the non-rigorous evaluations. In designing evaluations for ongoing program improvement, we should plan for evaluations that will give us reliable conclusions.

The reliability of the evaluations is why researchers on early childhood programs place so much emphasis on the long-term evaluation results from the Perry Preschool Program, the Abecedarian Program, the Nurse Family Partnership Program, and the Chicago Child-Parent Center Program. These programs have truly comparable comparison groups. And it is the reason why researchers place so much emphasis on the regression discontinuity results on pre-k programs from Oklahoma and other states. Even if people are tired of hearing about these studies, their results are so much more reliable than most other evaluations that they still deserve special emphasis.

About timbartik

Tim Bartik is a senior economist at the Upjohn Institute for Employment Research, a non-profit and non-partisan research organization in Kalamazoo, Michigan. His research specializes in state and local economic development policies and local labor markets.
This entry was posted in Early childhood programs. Bookmark the permalink.