Grading the Pre-K Evidence

Russ Whitehurst of Brookings has a new blog post that outlines his views on pre-K research in more detail. The title is “Does Pre-K Work? It Depends How Picky You Are”.

Whitehurst reaches the following conclusion:

“I conclude that the best available evidence raises serious doubts that a large public investment in the expansion of pre-k for four-year-olds will have the long-term effects that advocates tout.

This doesn’t mean that we ought not to spend public money to help families with limited financial resources access good childcare for their young children. After all, we spend tax dollars on national parks, symphony orchestras, and Amtrak because they make the lives of those who use them better today. Why not childcare?

It does mean that we need public debate that recognizes the mixed nature of the research findings rather than a rush to judgment based on one-sided and misleading appeals to the preponderance of the evidence.”

Therefore, it is fair to say that Whitehurst is marketing doubt. Maybe pre-K doesn’t work. Maybe we shouldn’t move forward with large-scale programs, and instead should undertake more limited measures or do more research.

He admits that opponents of his position, who believe that pre-K does work, are also basing their position on scientific research, and wonders: “how is that different individuals could look at the same research and come to such different conclusions?”

His framing of the issue is that he is just more “picky” about what research he believes. In his view, his opponents, when claiming that the “preponderance” of evidence supports pre-K, are relying on weak research, whereas he is relying on the strongest research in saying that pre-K does not work.

In his view, the strongest research, to which he gives straight “As” for quality, is the recent Head Start randomized control trial (RCT) and the recent Tennessee RCT. All the other evidence for the effectiveness of pre-K, in his view, is inferior in research rigor (“internal validity”) and/or less policy relevant to today’s policy choices (“external validity”).

Let me make some summary comments upfront before getting into the details of Whitehurst’s research review.

First, I think all researchers seek to be “picky” in reviewing research, in trying to assess the rigor of the research, and its relevance to the policy question at hand. However, even researchers who are equally “picky” can disagree about what the strengths and weaknesses are of various studies.

Second, in my view, Whitehurst significantly overstates the quality and relevance of the Tennessee RCT, and the relevance of the Head Start RCT. He’s not “picky” enough!

Third, Whitehurst underplays the findings and understates the research strengths and relevance of many other research studies. He also omits recent relevant research.

Fourth, Whitehurst never grapples with a fundamental issue in pre-K research: it does not take much of a pre-K impact on test scores for pre-K’s predicted earnings benefits over an entire career to justify considerable costs. Effects he characterizes as “small” are in many cases more than sufficient for programs to pass a benefit-cost test.

Fifth, Whitehurst never discusses another fundamental issue in pre-K research: test score effects often fade as children go through the K-12 system, but then effects on adult outcomes such as educational attainment or earnings re-emerge despite the fading. The faded test score effects are often poorer predictors of adult outcomes than the initial post-program test score effects. This means that studies with good evidence on adult outcomes gain importance relative to studies that only go through elementary school, and that studies with good evidence on immediate post-program outcomes gain in importance relative to studies that only go through elementary school. The elementary school test data adds some evidence, but not as much as might at first appear.

Sixth, if Whitehurst believes in the usefulness of child care services, the most logically consistent position is that he should back expanding programs such as Educare (full-time child care and pre-K from birth to age 5) to all low-income children. In my book Investing in Kids, I argued that the research evidence on child care and on the Abecedarian program, which was very similar to today’s Educare program, suggested that a program such as Educare would have earnings benefits for parents that significantly exceeded program costs.

So why not expand Educare, which would help low-income parents increase their work and their educational attainment, leading to significant boosts to parents’ short-run and long-run earnings? If Educare also helps improve the children’s long-run prospects, so much the better. (And in fact Whitehurst seems to like the Infant Health and Development Program research that supports that there would be such benefits for low-income children from an Educare-style program.)

I estimate that an Educare program for all families below the poverty line would cost around $70 billion per year, but would have parental earnings benefits significantly greater than that. This proposal would be consistent with a previous proposal made by colleagues of Whitehurst at Brookings. Such a proposal goes far beyond the cost of any preschool proposal made by the Obama Administration. But I think it would be a logically consistent proposal for Whitehurst to make. Whitehurst should be arguing that the Obama Administration preschool proposal is underfunded, not sufficiently comprehensive in its birth-to-five services, and insufficiently targeted on low-income families. (Note: this is not my position; for example, I’m in favor of universal pre-K. What I am describing is the position that is most consistent with Whitehurst’s own review of the research evidence.)

Before I get into the details, one more important headline issue: why should policymakers or journalists or other policy “influencers” believe my position, that the best evidence supports pre-K’s effectiveness, rather than Whitehurst’s position, that the research is more uncertain? The best way is to simply look at the research studies on your own, and make up your own mind, but how is one supposed to do this without an extensive background in statistics and research methodology?

Whitehurst’s position of doubt has a structural advantage in the debate in the public square. Some researchers argue that pre-K works, others say it may not: the headline news to an outside observer is that doubt wins the debate as long as the side that is promoting doubt has a consistent position that cites evidence. It’s easier to spread doubt than to assuage doubt.

Therefore, I would also make the following argument: many other researchers familiar with the pre-K research evidence disagree with Whitehurst, and agree that pre-K can work. Among pre-K researchers, Whitehurst’s weighting of the evidence is a distinct minority position.

Consider a recent research summary, “Investing in Our Future: The Evidence Base on Preschool Education”, which was authored by 10 prominent researchers on pre-K from a variety of disciplines and universities. This study concluded the following:

“Recent meta-analyses drawing together the evidence across decades of evaluation research now permit us to say with confidence that preschool programs can have a substantial impact on early learning and development….

While there is clear evidence that preschool education boosts early learning for children from a range of backgrounds, we also see a convergence of test scores during the elementary school grades so that there are diminishing differences over time on tests of academic achievement between children who did and did not attend preschool. Yet the most recent research is showing an accumulation of evidence that even when the difference in test scores declines to zero, children who have attended preschool go on to show positive effects on important adolescent and young adult outcomes, such as high school graduation, reduced teen pregnancy, years of education completed, earnings, and reduced crime…

Although random assignment of children or parents to program and comparison groups is the “gold standard” for program evaluation, sometimes this is not possible.One of the most frequently used alternative methods…is called a Regression-Discontinuity Design. …Comparing kindergarten entry achievement scores for children who have completed a year in Pre-K with the scores measured at the same time for children who just missed the birthday cutoff and are about to enter Pre-K can be a strong indicator of program impacts…Other methods used in recent nonexperimental preschool studies include propensity score weighting, individual, sibling or state fixed-effects, and instrumental variables analysis…Evaluations that select comparison groups in other ways should be approached with healthy skepticism.”

Therefore, it is clear that other researchers weight the evidence quite differently from Whitehurst. This is in part because other researchers, while noting that RCTs are the “gold standard”, view other studies as having sufficiently good comparison groups that they provide good “silver standard” evidence. Other researchers are also aware that few RCTs are so perfect that they are pure “gold standard”; in practice, we find that the gold is almost always alloyed with some less precious metal.

Now, onto the details. To do this, I’ll regrade the various studies that Whitehurst examines, along with adding one recent study that he omits. I’ll use his criteria of looking at “internal validity” (intuitively, the reliability of the research in identifying some causal effect of some program), and “external validity” (intuitively, the study’s relevance to current policy debates). I’ll also add my own take on what the reported impact shows.

Programs from the 1960s and 1970s

Program/Research	Reported Impact (after initial year)		Internal validity		External validity
	Whitehurst	Bartik	Whitehurst grade	Bartik grade	Whitehurst grade	Bartik grade
Perry	+	+	A-	A-	C	B
Abecedarian	+	+	B+	B+	C	B
Chicago CPC	+	+	C	B-	B	B+
Head Start in 60s	+ (for mortality)	+ (for mortality and ed attainment)	B	B	C	C

For Perry and Abecedarian, I would upgrade their “external validity/policy relevance” from C to B because I think that these program designs and the study’s results are quite relevant to what we are doing today. First, the design of these two programs is similar to what we are doing today. Abecedarian is quite similar to today’s Educare program. Perry is similar in many respects to today’s pre-K programs. Class sizes in Perry were smaller than most of today’s programs, and the program went for two years, versus one-year for most programs today, which would tend to reduce impact below Perry’s estimated adult earnings impact of 19%. However, most studies suggest modest impacts of class size on pre-K outcomes and that two years of pre-K does not double benefits, so we would not expect Perry to have effects far beyond current pre-K programs. And many of today’s pre-K programs are full-day, which has been shown to have larger impacts than half-day.

Another aspect of Perry and Abecedarian that increases their relevance is that they contain direct evidence on effects on adult outcomes. Because effects on test scores often fade, and these faded test score effects may not reflect adult outcomes, this makes these studies more important.

In addition, it is true that Perry and Abecedarian are experiments that mostly compare pre-K with no pre-K, whereas today any new pre-K program is comparing the new pre-K program with both no pre-K and with some control group members going to some other subsidized pre-K program. But this merely complicates the analysis of the impact of a new pre-K program and puts a premium on the program being as high-quality as existing programs. A real world benefit-cost analysis of a new pre-K program will adjust the benefits and costs downwards for substitution of the new program for existing programs. For example, the Rand Corporation did this in its analysis of the effects of a universal pre-K program. Because the options in the pre-K market are always changing, even today, impact analyses of a new pre-K program will have to make adjustments for changes in the options in the pre-K market. There are some scientific advantages to having “clean” estimates of the impact of pre-K versus no pre-K, which Perry and Abecedarian provide.

For Chicago CPC, Whitehurst fails to note that much of the variation in pre-K use was due to neighborhood of residence, and hence can be viewed as being in part a “natural experiment”. Furthermore, the CPC researchers have gone to much length to try to correct for any remaining selection bias in the estimates, and have found that a variety of methods for doing so yield similar results. Therefore, I think the internal validity of CPC is higher than Whitehurst’s grade of C. CPC is in-between a “B grade” study based on a natural experiment, and a “C grade” study that simply controls for observable characteristics.

In addition, the external validity and policy relevance of CPC is quite high, as the program was run by Chicago Public Schools and is quite similar to pre-K programs run in many state and local areas. So Whitehurst’s grade there also seems too low. The study also includes direct evidence on adult earnings effects and adult educational attainment effects.

As for Head Start, Whitehurst’s table says that the Ludwig-Miller study he cites only finds long-term impacts on mortality. But the study also finds some long-term impacts on educational attainment, which he does not note.

Programs from the 1980s

Program/Research	Reported Impact (after initial year)		Internal validity		External validity
	Whitehurst	Bartik	Whitehurst grade	Bartik grade	Whitehurst grade	Bartik grade
Head Start in 1980s	+	+	C	B-	A	C
Infant Health and Development	+ (impacts only for disadvantaged children with close to normal birth weights)	+ (impacts only for disadvantaged children with close to normal birth weights)	A	A	B	B-

For the Head Start sibling studies from the 1980s, Whitehurst argues that the sibling comparison will be biased towards finding effects of Head Start. However, as discussed in the research, there are reasons to think that the bias in which sibling gets into Head Start could go in either direction. Also, research such as Deming’s tries to look very closely at pre-existing characteristics to see whether there is a significant bias, and does not find strong signs of bias sufficient to overturn the main results.

As for policy relevance/external validity, I regard many of the pre-K programs that we are pursuing today at the state level, and considering trying to encourage via federal policy, as much more educationally focused than was traditionally the case for Head Start, although this may be changing for Head Start in recent years. Therefore, it is unclear to me whether the effects of Head Start are as relevant as Whitehurst thinks to current state and local pre-K programs.

On IHDP, it is true that the results are only significant for disadvantaged children of close to normal birth weight. However, the near-normal birth weight group is the group most relevant to current debates about early childhood education. As for policy relevance/external validity, IHDP is really a test of an early child care program, at ages 1 and 2. This is relevant to evaluating a program such as Educare, but not to evaluating most current proposals for pre-K at age 4.

Recent Programs

Program/Research	Reported Impact (after initial year)		Internal validity		External validity
	Whitehurst	Bartik	Whitehurst grade	Bartik grade	Whitehurst grade	Bartik grade
Head Start RCT	None	None statistically signif., but point estimates consistent with important effects	A	A	A	C
District programs, e.g., Tulsa	Unknown (research design doesn’t allow follow-up after pre-K	Unknown, but would predict sizable adult earnings effects based on test scores	B	B	B	B+
Georgia & OK Universal	+ (very small at best)	+ (large enough to have high benefit-cost ratio)	B	B-	A	B+
Tennessee Pre-K	–	–	A	C	A	B+
North Carolina More at Four	Not included	+	NA	B	NA	B+

The recent Head Start RCT does not show much in the way of statistically significant effects from kindergarten on, but some of the point estimates are consistent with effects that might be important. For example, the point estimates on the cognitive tests that are consistently given over time in the experiment show effects at third grade that would predict about a 1% increase in adult earnings, which adds up to a lot of money over an entire career. Given the uncertainty, the true effect could be 2 or 3%, or could be zero or negative – we just can’t tell.

As mentioned before, one issue that Whitehurst does not grapple with is that even quite small test score effects would predict adult earnings effects that might be very important in a benefit-cost analysis. This makes research more difficult because it is hard to rule out test score effects that could be large in that they might make the program pay off. Even with a relatively large sample, such as in the Head Start RCT, the studies are “underpowered” for detecting some effects that might be relevant.

This problem is exacerbated because studies frequently find that test score effects at third grade underpredict the long-run earnings effects of pre-K. Test score effects often fade but then re-emerge in better adult outcomes than would have been predicted at 3^rd grade. This increases the uncertainty about the results beyond what is described by the Head Start RCT’s standard errors.

The other issue with the Head Start RCT is its relevance to current policy debates. First, as noted before, many of the state and local pre-K programs being debated are more educationally focused than has traditionally been the case for Head Start, which raises the issue of whether the Head Start results are generalizable to these state and local pre-K programs.

Second, the Head Start RCT is not comparing Head Start with no Head Start. Only 80% of the treatment group enrolled in Head Start. About half of the control group attended some pre-K program, including 14% in Head Start and 35% in some other pre-K program. If some of these other pre-K programs were more educationally focused than Head Start, this would reduce the net impact of a “More Head Start” treatment group versus an “Other Pre-K” control group. The issue would still remain of whether Head Start’s generally higher costs per child are justified by stronger results. But the Head Start RCT does not do a great job of answering the question “do educationally focused pre-K programs work?”

The various regression discontinuity studies of state and local pre-K programs, as Whitehurst notes, by their design cannot detect long-term effects. However, based on other studies, early post-program test scores frequently are better predictors of a program’s long-run adult earnings effects than are later test scores. Therefore, the early test score information is more valuable than might at first appear.

Whitehurst and my grade on internal validity of the state and local RDD studies is the same, at B. However, Whitehurst’s text makes some disparaging remarks about RDD studies, which I have dealt with in previous blog posts.

Whitehurst’s problems with RDD seem to lead to him downgrading these studies’ external validity, which seems like the wrong place to downgrade the studies for any perceived issues with RDD. It seems to me that current studies of state and local pre-K programs are about as relevant as one can get to whether expanding such programs today is a good idea . I only give a grade of B+ because it is always the case that just because Location X’s pre-K program works, this doesn’t always mean that Location Y’s pre-K program works – there might be quality differences in the pre-K programs between location X and location Y.

For Georgia and Oklahoma’s universal pre-K programs, I think Whitehurst mistakes the magnitude of these results. He states that these studies find less than a one point difference on fourth-grade NAEP scaled scores. But the Cascio and Schanzenbach study he cites finds fourth-grade NAEP scores, in the estimates they regard as preferred, of about 3 points. They also state that it would only take a NAEP score effect of 1.0 to 1.4 points for these programs to pass a benefit-cost test. “Small” and “large” are fuzzy terms. I would define “large” as being large enough for the program to plausibly pass a benefit-cost test.

The estimates in Cascio and Schanzenbach for Georgia and Oklahoma are statistically insignificant when the most rigorous corrections for statistical noise are made. This in part reflects an inherent problem in studies of aggregate data on one or two states – there’s so much noise in individual state test score trends that it is difficult for any intervention, even one with large effects, to show statistically significant effects.

For this reason, I downgrade the internal validity of such studies of universal programs, because standard errors tend to both be large, and to be difficult to get right, in studies with only one or two geographic units in the treatment group that are being compared with all other geographic units. Often estimates are more imprecise than indicated by standard statistical software packages.

As for external validity, I see no basis for giving a stronger or weaker external validity grade to the Georgia and Oklahoma studies over studies of Kalamazoo, Tulsa, Boston , Michigan, New Jersey, South Carolina, West Virginia, Oklahoma , New Mexico, and Arkansas, which are all examined in the RDD research. These are all studies of state and local pre-K programs, and are generalizable to other state and local pre-K programs if these other programs are of similar quality.

For Tennessee Pre-K, as I have noted in previous blog posts, although the original design of this study was a randomized control trial, problems with attrition mean that the study falls short of the gold standard, by quite a bit. For example, in the first cohort of children, the study was only able to get test score data from 46% of the pre-K participants versus 32% of the control group. The original treatment and control group are randomly chosen, but this is not true of the children for whom we actually have test score data.

Furthermore, there is some evidence that this attrition leads to bias, in that the full sample shows a reduction in kindergarten retention from 8% to 4%, and the smaller sample with test score data only shows a reduction from 8% to 6%. In addition, these “retention” effects suggest that the program must be doing something to student achievement that is not fully reflected in test scores, otherwise why would retention be cut in half, as it is in the full sample?

For all these reasons, I regard the Tennessee study as meeting not a gold standard, or a silver standard, but a bronze standard. It is similar to the many other studies that are NOT discussed by Whitehurst that try to evaluate pre-K by controlling for observable characteristics of students, an approach that cannot correct for selection bias.

As for external validity, the Tennessee study is definitely relevant to other state pre-K programs, but it is most relevant to the pre-K programs of states that are not spending enough per child on pre-K. According to the National Institute for Early Education Research, Tennessee’s program has spending per child that is over $2,000 per child less than what is judged to be desirable for high-quality pre-K. So Tennessee’s program may be relevant to some proposed state and local pre-K programs, but perhaps not so much to more fully-funded pre-K programs.

Finally, there is the recent study of North Carolina’s “More at Four” program, which I reviewed in a recent blog post. Whitehurst does not mention this study. This is a good silver standard study because it relies on a “natural experiment”: the More at Four program was gradually rolled out over time in different counties. It is hard to see why a county’s test scores the appropriate number of years later in 3^rd grade would be correlated with More at Four spending, except for a true effect of the program. And as with other studies of state and local pre-K programs, the study is highly relevant, as long as one remembers that each state’s program is different.

Overall, for the 10 studies/groups of studies that are graded by both Whitehurst and me, Whitehurst’s average grade is 3.15, or between a B and a B+, and my average grade is 2.95, slightly less than B. Whitehurst isn’t quite as tough a grader as me, and in that sense he is not quite as “picky” as me.

Where we differ most obviously, is over the Head Start RCT and Tennessee studies, versus the older studies and the other more recent studies. He gives grades of straight A’s to the Head Start RCT and Tennessee study, so these two studies clearly dominate all the other studies from his perspective. In contrast, I give an average grade of B to the Head Start RCT, and B- to the Tennessee study. I give grades higher than B to Perry, Abecedarian, IHDP, the state/local RDD studies, and the North Carolina studies, and give B grade averages to CPC and the OK/GA studies. So, in my view, this other evidence dominates – there’s a preponderance of evidence of equal or greater quality that suggests that pre-K can work.

Of course, the Head Start RCT and Tennessee evidence still matters – this evidence suggests that there are some doubts as to whether Head Start as of 2002 was as educationally effective as some state pre-K programs, and the Tennessee evidence raises some doubts about that state’s pre-K program. But there is no way in which I view this evidence as trumping all the other evidence, which seems to be Whitehurst’s view.

Whitehurst is unusual among researchers in privileging the Head Start RCT and Tennessee studies over all the other evidence. That doesn’t mean he’s more picky, it simply means he has a different approach than most researchers to thinking about the strengths and weaknesses of different research approaches.

(Note: Steve Barnett of the National Institute for Early Education Research has independently provided some reactions to Whitehurst’s blog post. I wrote the first draft of this blog post prior to seeing Barnett’s reaction, and did not significantly revise the post to avoid overlap – there’s some overlap, but considerable independent information. )

Grading the Pre-K Evidence

About timbartik

Recent Posts

Archives

Categories

Meta

Blogroll

Book links

Links for Tim Bartik

RSS Links

RSS Comments

Email Subscription

Grading the Pre-K Evidence

Share this post!

Related

About timbartik

Recent Posts

Archives

Categories

Meta

Blogroll

Book links

Links for Tim Bartik

RSS Links

RSS Comments

Email Subscription