A recent opinion column (March 1, 2013) in the Wall Street *Journal*, by Shikha Dalmia and Lisa Snell of the Reason Foundation, criticized proposals for universal preschool programs on the basis of the experience of Georgia and Oklahoma, which have been among the leaders in expanding access to preschool.

Their main argument is that if universal preschool is so great, why haven’t Oklahoma and Georgia done better on various social and educational indicators? Citing statistics for various years, they argue that

“…Neither state program has demonstrated major social benefits…. A … realistic report card for the two states:

Lowering teen births: Oklahoma, Fail; Georgia, C.

Raising graduation rates: Oklahoma, Fail; Georgia, Fail.

Raising fourth-grade NAEP reading scores: Oklahoma, Fail; Georgia, C.

Closing the minority achievement gap: Oklahoma, Fail; Georgia, C.”

What this analysis fails to reckon with are the many other social and economic forces that frequently shift state educational and social indicators. Once one accounts for the statistical uncertainty due to such forces, it is very difficult from case studies of one or two states to determine the effects of even major educational and social interventions with the needed degree of precision.

In making this argument, I am going against ordinary human intuition. We human beings think in terms of specific examples and anecdotes. We love to generalize from a particular individual to everyone, or from a particular state to the world. Both liberals and conservatives do this. Politicians do it when they argue for re-election on the basis that the national or state economy is doing well. Voters do it when they reward politicians on that basis.

But from a social science perspective, case study analysis is quite difficult to do with sufficiently good statistical precision. If we’re trying to estimate whether any one policy in one state (or one metro area, or one school district) has made a “statistically significant” difference to the state, detecting such a difference requires both very large effects and statistically sophisticated procedures.

Let me take as an example the case of Oklahoma preschool and detecting its effects on 4^{th} grade test scores on the National Assessment of Educational Progress. Prior to 1998, Oklahoma had a targeted preschool program that typically enrolled 11% or less in pre-K. In 1997-98, Oklahoma enrolled about 5% of all 4-year olds in state-supported pre-K. In 1998-99, this jumped to 38% in the state pre-K program. The program then expanded more gradually until today, Oklahoma enrolls about 74% of all 4-year-olds in state preschool. But the big jump occurred between 1997-98 and 1998-99.

Given all the other things affecting 4^{th} grade test scores, if we want to detect the aggregate effect of this preschool expansion, we need data on 4th grade test scores based on children who were age 4 “before” this 1998-99 “big jump” in preschool enrollment, and also 4^{th} grade test scores for children who were age 4 “after” this 1998-99 “big jump” in preschool enrollment. To help control for other factors affecting 4^{th} grade test scores, it would probably be best to consider the closest observations we can get “before and after” the big jump in preschool enrollment. If two observations are closer in time, we can hope that “other factors” affecting test scores will not have changed as much.

In the case of the NAEP, we happen to have reading and math results for 4^{th} graders in the winters of 2003 and 2005, which would correspond to children who were four year-olds in the 1997-98 and 1999-2000 school years. Over that two-year time period, Oklahoma preschool enrollment went up from about 5% of all 4-year-olds to around 51%, a jump of 46%.

What would we expect to happen to NAEP 4^{th}-grade test scores due to an increase in Oklahoma preschool enrollment of 46%? According to the NIEER study of five state pre-K programs, Oklahoma’s program raises both literacy and math test scores at kindergarten entrance of pre-K participants by about an “effect size” of 0.35. (This literacy test score effect averages results across a vocabulary test and a “print awareness” test; the math test score effect comes from a single test). This effect size calculates the test score change as a proportion of the standard deviation of the test scores across different students. This corresponds to an increase of between 13 and 14 “percentile points”, which certainly seems “large” in that most parents would care about such an improvement for their child. (We’ll see later how this translates into benefit-cost analysis of the program.)

But this increase only occurs for the additional pre-K participants, who are 46% of the four-year old population. So we would expect kindergarten entrance scores over ALL 4-year-olds in Oklahoma to only go up by 46% times 0.35, or an increase in “effect size units” of 0.16.

But the NAEP data are for fourth grade. The evidence suggests that we should expect to see some considerable depreciation of initial cognitive effects of preschool between kindergarten and fourth grade. This is true even for programs, such as Perry, Head Start, and the Chicago Child-Parent Center program, that show a considerable “bounceback” in program effects in adulthood.

It is certainly plausible that the initial effects of Oklahoma pre-K could depreciate in half by fourth grade. If so, the average “effect size” we would expect to see at 4^{th} grade would be an increase in math and reading test scores by an effect size of 0.08.

On the NAEP, an effect size of 0.08 corresponds to an increase in the NAEP score of a little less than 3 points. (The NAEP standard deviation is around 35 points, and 0.08 times 35 is 2.8.)

What do we actually observe for Oklahoma? From 2003 to 2005, Oklahoma’s 4 grade reading NAEP test score increased by 0.5 points LESS than the U.S. But over the same time period, Oklahoma’s NAEP math test score results for 4^{th} grade increased by 1.9 points MORE than the nation as a whole.

But there are two sources of uncertainty in interpreting these estimates as showing the effect of Oklahoma’s preschool program. The first is that we don’t have an infinite sample size of students taking the NAEP test in Oklahoma and the U.S. Because we have a finite sample, there is some uncertainty about whether the Oklahoma and U.S. results accurately represent the population of students in Oklahoma and the U.S. in 2003 and 2005.

This sampling uncertainty by itself is sufficient to make it hard to tell whether the NAEP results deviate from what we would expect. The “95% confidence interval” for the Oklahoma reading test score decline of 0.5 points relative to the U.S. is plus or minus 3.3 points. Therefore, 19 times out of 20, the true number for the population as a whole would be somewhere in the range from minus 3.8 points to plus 2.8 points.

Similarly, for the test score gain in Oklahoma relative to the nation of 1.9 points from 2003 to 2005, the 95% confidence interval is plus or minus 2.7 points. If we had an infinitely sized sample, the probability is 95% that Oklahoma’s advantage over this time period would be somewhere between plus 4.6 points and minus 0.8 points.

But this is only one source of uncertainty about the estimates. Even if we had an infinite sample for both Oklahoma and the U.S., we know that there could be a wide variety of influences (changes in demographic mix, changes in school funding, changes in cultural trends, etc.) that might cause test scores to change. The same is true of other educational and social trends such as high school graduation rates and teen pregnancy

We can gauge the possible size of this additional uncertainty, due to unobserved variables that cause test scores or other state outcomes to fluctuate, by observing how much test scores or other social outcomes in various states fluctuate from year to year, above and beyond what we would expect due to limited sample sizes. When economists have looked at these fluctuations across states, they frequently find that such fluctuations increase uncertainty sufficiently to increase confidence intervals from two-fold to five-fold. That is what is found in Conley and Taber’s paper on this topic, which looks at influences on state college attendance. And it is found in Fitzpatrick’s paper on Georgia, which is looking at how preschool influences 4^{th} grade NAEP test scores.

Thus, the true confidence intervals are not plus or minus 3 points. They are probably at least plus or minus 6 points, and possibly much larger.

We can intuitively understand this by noticing that test scores in states such as Oklahoma fluctuate by many points over relatively short-term time periods, even when preschool access is not changing. For example, from 2000 to 2003, Oklahoma’s 4^{th} grade math NAEP test scores increased by 5 points. From 1998 to 2002, Oklahoma’s 4^{th} grade reading NAEP test scores dropped by 6 points. These are time periods that match up with time periods 5 years earlier, when these 4th-graders were 4-years old, during which preschool access in Oklahoma was not significantly changing. Apparently there are many other educational, demographic, and social trends that can cause quite dramatic short-run fluctuations in test scores.

The bottom –line statistical conclusion from this discussion is that we cannot tell whether the jump in preschool access from 1997-98 to 1999-2000 had the expected effect of increasing Oklahoma’s NAEP scores relative to the nation by 3 points. The estimated differences between Oklahoma and the U.S. changes in test scores probably have confidence intervals of plus or minus 6 points at least, and the changes we actually observe are therefore quite consistent with the expected effect of Oklahoma’s preschool program over this time period. Unfortunately, the confidence intervals are so large that we cannot tell whether the test score gains due to preschool match expectations, or are zero, or are larger than expected.

But, one could argue, if we can’t tell in Oklahoma’s aggregate statistics that preschool is making a difference, isn’t it the case that preschool is having too minor an effect for it to be a worthwhile investment? No, that does not follow. For example, suppose that the NIEER estimates are right that preschool increases kindergarten entrance test scores by about 13 or 14 percentile points. Not only is this large from the perspective of parents, but it would be predicted, based on Chetty et al.’s results, to yield large percentage effects in earnings. Chetty et al.’s results imply that such a kindergarten test score increase would be expected to increase adult earnings by about 7%. I think most people would regard this as a large effect.

From a benefit-cost standpoint, it certainly is a large effect. In my paper with Gormley and Adelstein, we used Chetty et al.’s numbers to calculate that a 1 percentile increase in kindergarten test scores would increase the present value of future earnings by about $1500 (in 2005-06 prices). A 13 percentile increase would increase the present value of adult earnings by almost $20,000. This is for a preschool program that in Tulsa had total costs of $4400 for a half-day program and $8800 for a full-day program.

Why aren’t such large earnings effects easier to detect in 4^{th} grade test scores? In sum, these effects are hard to detect because: effects are spread across an entire cohort of children, many of whom do not experience increased access to preschool, and this lessens the average effect on the cohort; test score effects commonly fade but then re-emerge in improved adult outcomes; there are a lot of forces causing test scores to fluctuate.

Does this mean it is impossible to tell whether preschool works? No, it is possible to detect preschool’s effects, but to do so we need better statistical evidence than is provided by a case study of one state’s aggregate data. We can learn a lot more if we have a program group and good comparison groups in the same state, which lessens the problem of other factors driving test scores. Or, we can learn a lot more if we have observations on many states that have dramatically expanded high-quality preschool during different time periods, which helps makes it easier to disentangle preschool’s influence from other forces that affect test scores.

Does this mean that preschool does not provide a miracle solution that immediately solves all social ills? Yes, it does mean that preschool is not a miracle solution, but that is quite different from saying that preschool does not pass a benefit-cost test. If preschool had effects on participants at kindergarten entrance that were, for example, five times as great as were estimated by NIEER for Oklahoma, say an effect size of 1.75 at kindergarten entrance, then even when these effects are dissipated by being measured over an entire cohort, and even if the test scores depreciated over time, and even if other factors affected test scores, then we could still readily detect preschool’s effects in aggregate state test score trends at 4^{th} grade. But then we would be talking about preschool moving students from the 4^{th} percentile to the 50^{th} percentile, or from the 50^{th} percentile to the 96^{th} percentile, which are extraordinarily large effects, well beyond what it is reasonable to expect. But we don’t need effects anywhere near that large for preschool to pass a benefit-cost test, given that the cost of a quality half-day preschool program is only around $5,000 per year for one student.

In sum, the key problem with the Dalmia/Snell article is that it does not provide strong evidence from a social science perspective. A case study of one state’s aggregate data will rarely provide convincing evidence for or against any social intervention, even in cases where that social intervention has a high benefit-cost ratio. Case studies of one or two states may be a persuasive political argument, but evaluating the benefits and costs of any policy usually requires other, better evidence that can more accurately detect policy-relevant effects.

Pingback: ECE at the Table: The Perils and Promise | ECE PolicyMatters