Whitehurst’s latest comments on pre-K

Russ Whitehurst has some more recent comments on pre-K, this time arguing against a more recent study of Georgia pre-K. This more recent study found pre-K effects on cognitive skills which, averaged across all tests used, had an average “effect size” of 0.69. This is quite high.

(“Effect size” is education research jargon for scaling the effects of some policy on test scores by dividing the effect by the “standard deviation” of the test score across students. This is an attempt to control for the arbitrariness of test score metrics by measuring the test score effect relative to how much this particular test score seems to vary in the sample.)

Whitehurst mainly argues against this study’s validity for two reasons, one of which is a weak argument, and the other of which is a stronger argument. First, he argues that that there’s a problem in all regression discontinuity studies because some pre-K graduates inevitably disappear from the sample when they’re followed up on at the beginning of kindergarten. Although this sample attrition could cause bias in program estimates, a bias which could go in either direction, in practice careful studies find that this bias is small. For example, the Boston regression discontinuity study did numerous tests for possible biases and found no sign of them. The Kalamazoo study did some estimates that restricted the sample to only the same children observed prior to pre-K and after pre-K, and found no significant difference in the estimates.

A second and more valid concern is that the Georgia study has much larger sample attrition due to problems in obtaining consent from the families and schools of pre-K graduates entering kindergarten. Furthermore, there are some signs that this differential sample attrition led to the entering kindergarten sample being somewhat more advantaged.  This differential in family consent rates could have led to more advantaged children being over-represented in program graduates, which might bias the study towards over-estimating program effects. I’m sure these issues will be discussed as this report is submitted to academic journals, and is evaluated and re-estimated during the academic refereeing process.

Whitehurst also expresses some doubt about the large size of the estimated effects. The effects are large, although Whitehurst exaggerates the differentials from other research. The average effect size from previous studies in a meta-analysis by Duncan and Magnuson is 0.35, and in a meta-analysis by the Washington State Institute for Public Policy is 0.31.  These average effect sizes tend to be lower for more recent studies, and for Head Start than for state and local pre-K programs.

The regression discontinuity studies tend to get a bit higher effect sizes. For example, average effect sizes for the regression discontinuity study of Boston pre-K was 0.54.

But, as I have discussed previously, and as Whitehurst has alluded to previously, regression discontinuity studies of pre-K are estimating something a little bit different than other pre-K impact studies. Regression discontinuity studies are studying effects of pre-K for program graduates relative to what would have occurred if they had just missed the age cut-off for pre-K entrance and had not attended this subsidized pre-K until a year later. This means that regression discontinuity pre-K studies are in many cases comparing pre-K with no pre-K, as parents are less likely to enroll children in pre-K if they will not be attending kindergarten the next year. In contrast, other pre-K impact studies are measuring the effects of some public pre-K program relative to a comparison group which will be attending kindergarten the next year, and therefore the comparison group is more likely to attend pre-K. The fact that the comparison group is more likely to attend pre-K probably reduces the net impact estimates for these other pre-K studies.

Which type of estimate is more useful? I think they’re both useful. The regression discontinuity results tell us something about the effects of pre-K versus no pre-K. This is useful for comparison with the gross costs of pre-K. The RD estimates are closer to what a labor economist would call “structural estimates” of the effects of pre-K, which can be useful for modeling the effects of other pre-K programs.

On the other hand, other pre-K estimates tell you the effects of this particular pre-K program versus whatever other pre-K programs are currently available in that particular pre-K marketplace. This is useful if the only policy we are considering is whether or not to adopt this particular pre-K program in this particular pre-K market.  In that case, a benefit cost analysis would have to compare the net benefits of this program versus the extra net social costs of substituting this new program for existing programs. In other words, the new program’s costs may be reduced considerably because it may save in costs on existing pre-K programs, which means it doesn’t take as big an effect size for the program to pass a benefit-cost test.

For both of these types of estimates, extrapolating the estimates to some other pre-K program in some other state or local area requires some assumptions. In general, introducing a new high-quality free pre-K program in any particular local area will result in some increases in pre-K enrollment in this program, and some reductions in enrollment in other programs, with the exact pattern depending on the program being introduced and what is currently available in that market.  Neither the RD estimates, nor the estimated effects of some other pre-K program in some other market, will tell you the net benefits of a new pre-K program in a new market without further assumptions about program take-up of the new program versus the old programs, and without some assumptions about the relative quality of the new program versus the old programs.

In sum, I think the Georgia estimates are only suggestive, because of the problem of differential attrition in the treatment and control groups due to survey non-consent. The estimates may be correct, but this would require further analyses to demonstrate that the survey non-consent problem does not significantly bias the estimates.  Because of this problem with survey non-consent, I would currently give this study a grade of “internal validity” (or “research reliability”) of C, although this grade might be moved up by further estimates by the authors to examine this issue.

However, the Georgia estimates are not representative of most of the regression discontinuity studies, which have done further analyses which suggest that the estimates are not biased by problems with attrition.

Whitehurst also updates his analysis of research to downgrade slightly his grade of “internal validity” (intuitively, research reliability)  of the recent Tennessee study, which found quick fade-out of pre-K test score effects in Tennessee, to A- from A.  But he does not note the factors that lead me to give the Tennessee study a grade for “internal validity” of C: specifically, there was differential attrition due to problems of family consent in the control group in this study, and the few estimates that did not suffer from this attrition bias suggest that the Tennessee program may have had greater effects than are found in the main estimates.

In other words, the Tennessee study actually has stronger evidence of biased estimates than is true of this recent Georgia study. However, for the Tennessee study, the bias appears to be leading the pre-K effects to be under-estimated. There certainly is no good reason to give the Tennessee study a higher grade for research reliability than the Georgia study.

Posted in Early childhood programs

The importance of education, and a pre-K experiment to watch

Two articles recently came to my attention that are of considerable relevance to early childhood education.

First, New York Times reporter Eduardo Porter has an article and interview with economist Thomas Piketty on growing economic inequality. Piketty is the author of a new book on inequality that is getting a lot of attention.

One quotation from Piketty in the interview struck me as particularly relevant to early childhood education, and indeed education in general:

“Historically, the main equalizing force — both between and within countries — has been the diffusion of knowledge and skills.”

I think this summarizes what many economists believe about the role of education. But it is important than one of our leading scholars on economic inequality across the world over the last century agrees with that conclusion.

The policy implication is that if one thinks that inequality is one of the leading social issues of our time, it is imperative to go to great lengths to broaden educational opportunities. Early childhood education is one of the most cost-effective ways to do so, although it should be accompanied by other policies as well.

Second, New York Times reporter Kate Taylor had an article reporting on an experiment testing the “Building Blocks” math curriculum in pre-K. (I thank a tweet from the Human Capital Research Collaborative for drawing this article to my attention.)

One point of note in this article is that this particular curriculum is used in Boston’s pre-K program. As noted in a previous blog post, an article by Weiland and Yoshikawa found extremely high test score effects of Boston’s program. I estimate that this program would increase kindergarten readiness among low-income students sufficiently to increase adult earnings by 15%, which is a huge effect for a one-year program.

An important issue is why Boston’s program is so effective. Perhaps this experiment will tell us whether the math curriculum is key. More time will tell.

Posted in Distribution of benefits, Early childhood program design issues

Reducing inequality may sometimes increase economic growth – and a specific example is early childhood education

Nobel prize-winning economist Paul Krugman devoted his column this morning to recent empirical evidence, from the International Monetary Fund, which indicates that reducing income inequality need not reduce economic growth. This goes against a tradition among economists as seeing an inherent tradeoff where reduced income inequality can only be pursued at a cost in reduced economic output or growth.

A prime example of a public policy that both reduces inequality and promotes economic growth is increasing access to high-quality early childhood education, such as pre-K programs and high-quality child care.

As I’ve mentioned in previous posts, high-quality pre-K can increase the adult earnings of children from low-income families by 10% or more.  High-quality child care and pre-K from birth to age 5 can increase the adult earnings of children from low-income families by over 25%.

Yet these policies will also increase economic growth. The evidence suggests that these extra skills and earnings for children from low-income families will provide spillover economic benefits for the rest of society.

These spillover benefits occur because my earnings depend in part on the skills of my fellow workers, in my firm, and elsewhere in my local economy. Firms are better able to introduce new technologies when a higher percentage of all workers are skilled, so my firm may be more competitive when my fellow workers get more skills. Firms’ competitiveness also depends on the skills of local suppliers, so my wages may depend on the skills of those suppliers’ workers. Firms may also be more innovative if they are able to get ideas and skilled workers from other local firms.

How do these spillover benefits occur? They occur by firms investing more and creating more local jobs when a local economy increases its overall skills. Expanded pre-K and other early childhood education programs can expand the local skills base. A worker can benefit from such expansion of early childhood education even if his or her skills would have been fine even without the expansion – the increased skills of other workers will boost job creation and boost worker productivity for all local workers.

Early childhood education is a prime example of a case where all workers share in the economic fortunes of an economy, which depend in part on everyone’s skills. Investing in “other people’s children” not only is a moral issue, but also an issue of enlightened self-interest.

Posted in Distribution of benefits

Grading the Pre-K Evidence

Russ Whitehurst of Brookings has a new blog post that outlines his views on pre-K research in more detail.  The title is “Does Pre-K Work? It Depends How Picky You Are”.

Whitehurst reaches the following conclusion:

“I conclude that the best available evidence raises serious doubts that a large public investment in the expansion of pre-k for four-year-olds will have the long-term effects that advocates tout. 

This doesn’t mean that we ought not to spend public money to help families with limited financial resources access good childcare for their young children.  After all, we spend tax dollars on national parks, symphony orchestras, and Amtrak because they make the lives of those who use them better today.  Why not childcare? 

It does mean that we need public debate that recognizes the mixed nature of the research findings rather than a rush to judgment based on one-sided and misleading appeals to the preponderance of the evidence.”

Therefore, it is fair to say that Whitehurst is marketing doubt. Maybe pre-K doesn’t work. Maybe we shouldn’t move forward with large-scale programs, and instead should undertake more limited measures or do more research.

He admits that opponents of his position, who believe that pre-K does work, are also basing their position on scientific research, and wonders: “how is that different individuals could look at the same research and come to such different conclusions?”

His framing of the issue is that he is just more “picky” about what research he believes. In his view, his opponents, when claiming that the “preponderance” of evidence supports pre-K, are relying on weak research, whereas he is relying on the strongest research in saying that pre-K does not work.

In his view, the strongest research, to which he gives straight “As” for quality, is the recent Head Start randomized control trial (RCT) and the recent Tennessee RCT.  All the other evidence for the effectiveness of pre-K, in his view, is inferior in research rigor (“internal validity”) and/or less policy relevant to today’s policy choices (“external validity”).

Let me make some summary comments upfront before getting into the details of Whitehurst’s research review.

First, I think all researchers seek to be “picky” in reviewing research, in trying to assess the rigor of the research, and its relevance to the policy question at hand. However, even researchers who are equally “picky” can disagree about what the strengths and weaknesses are of various studies.

Second, in my view, Whitehurst significantly overstates the quality and relevance of the Tennessee RCT, and the relevance of the Head Start RCT.  He’s not “picky” enough!

Third, Whitehurst underplays the findings and understates the research strengths and relevance of many other research studies.  He also omits recent relevant research.

Fourth, Whitehurst never grapples with a fundamental issue in pre-K research: it does not take much of a pre-K impact on test scores for pre-K’s predicted earnings benefits over an entire career to justify considerable costs. Effects he characterizes as “small” are in many cases more than sufficient for programs to pass a benefit-cost test.

Fifth, Whitehurst never discusses another fundamental issue in pre-K research: test score effects often fade as children go through the K-12 system, but then effects on adult outcomes such as educational attainment or earnings re-emerge despite the fading. The faded test score effects are often poorer predictors of adult outcomes than the initial post-program test score effects. This means that studies with good evidence on adult outcomes gain importance relative to studies that only go through elementary school, and that studies with good evidence on immediate post-program outcomes gain in importance relative to studies that only go through elementary school.  The elementary school test data adds some evidence, but not as much as might at first appear.

Sixth, if Whitehurst believes in the usefulness of child care services, the most logically consistent position is that he should back expanding programs such as Educare (full-time child care and pre-K from birth to age 5) to all low-income children. In my book Investing in Kids, I argued that the research evidence on child care and on the Abecedarian program, which was very similar to today’s Educare program,  suggested that a program such as Educare would have earnings benefits for parents that significantly exceeded program costs.

So why not expand Educare, which would help low-income parents increase their work and their educational attainment, leading to significant boosts to parents’ short-run and long-run earnings? If Educare also helps improve the children’s long-run prospects, so much the better. (And in fact Whitehurst seems to like the Infant Health and Development Program research that supports that there would be such benefits for low-income children from an Educare-style program.)  

I estimate that an Educare program for all families below the poverty line would cost around $70 billion per year, but would have parental earnings benefits significantly greater than that. This proposal would be consistent with a previous proposal made by colleagues of Whitehurst at Brookings.  Such a proposal goes far beyond the cost of any preschool proposal made by the Obama Administration. But I think it would be a logically consistent proposal for Whitehurst to make. Whitehurst should be arguing that the Obama Administration preschool proposal is underfunded, not sufficiently comprehensive in its birth-to-five services, and insufficiently targeted on low-income families. (Note: this is not my position; for example, I’m in favor of universal pre-K. What I am describing is the position that is most consistent with Whitehurst’s own review of the research evidence.)  

Before I get into the details, one more important headline issue: why should policymakers or journalists or other policy “influencers” believe my position, that the best evidence supports pre-K’s effectiveness, rather than Whitehurst’s position, that the research is more uncertain? The best way is to simply look at the research studies on your own, and make up your own mind, but how is one supposed to do this without an extensive background in statistics and research methodology?

Whitehurst’s position of doubt has a structural advantage in the debate in the public square.  Some researchers argue that pre-K works, others say it may not: the headline news to an outside observer is that doubt wins the debate as long as the side that is promoting doubt has a consistent position that cites evidence.  It’s easier to spread doubt than to assuage doubt.

Therefore, I would also make the following argument: many other researchers familiar with the pre-K research evidence disagree with Whitehurst, and agree that pre-K can work.  Among pre-K researchers, Whitehurst’s weighting of the evidence is a distinct minority position.

Consider a recent research summary, “Investing in Our Future: The Evidence Base on Preschool Education”, which was authored by 10 prominent researchers on pre-K from a variety of disciplines and universities.  This study concluded the following:

“Recent meta-analyses drawing together the evidence across decades of evaluation research now permit us to say with confidence that preschool programs can have a substantial impact on early learning and development….

While there is clear evidence that preschool education boosts early learning for children from a range of backgrounds, we also see a convergence of test scores during the elementary school grades so that there are diminishing differences over time on tests of academic achievement between children who did and did not attend preschool. Yet the most recent research is showing an accumulation of evidence that even when the difference in test scores declines to zero, children who have attended preschool go on to show positive effects on important adolescent and young adult outcomes, such as high school graduation, reduced teen pregnancy, years of education completed, earnings, and reduced crime…

Although random assignment of children or parents to program and comparison groups is the “gold standard” for program evaluation, sometimes this is not possible.One of the most frequently used alternative methods…is called a Regression-Discontinuity Design. …Comparing kindergarten entry achievement scores for children who have completed a year in Pre-K with the scores measured at the same time for children who just missed the birthday cutoff and are about to enter Pre-K can be a strong indicator of program impacts…Other methods used in recent nonexperimental preschool studies include propensity score weighting, individual, sibling or state fixed-effects, and instrumental variables analysis…Evaluations that select comparison groups in other ways should be approached with healthy skepticism.”

Therefore, it is clear that other researchers weight the evidence quite differently from Whitehurst. This is in part because other researchers, while noting that RCTs are the “gold standard”, view other studies as having sufficiently good comparison groups that they provide good “silver standard” evidence.  Other researchers are also aware that few RCTs are so perfect that they are pure “gold standard”; in practice, we find that the gold is almost always alloyed with some less precious metal.

Now, onto the details. To do this, I’ll regrade the various studies that Whitehurst examines, along with adding one recent study that he omits. I’ll use his criteria of looking at “internal validity” (intuitively, the reliability of the research in identifying some causal effect of some program), and “external validity” (intuitively, the study’s relevance to current policy debates). I’ll also add my own take on what the reported impact shows.

Programs from the 1960s and 1970s

Program/Research Reported Impact (after initial year) Internal validity External validity
Whitehurst Bartik Whitehurst grade Bartik grade Whitehurst grade Bartik grade
Perry + + A- A- C B
Abecedarian + + B+ B+ C B
Chicago CPC + + C B- B B+
Head Start in 60s + (for mortality) + (for mortality and ed attainment) B B C C

For Perry and Abecedarian, I would upgrade their “external validity/policy relevance” from C to B because I think that these program designs and the study’s results are quite relevant to what we are doing today. First, the design of these two programs is similar to what we are doing today. Abecedarian is quite similar to today’s Educare program. Perry is similar in many respects to today’s pre-K programs. Class sizes in Perry were smaller than most of today’s programs, and the program went for two years, versus one-year for most programs today, which would tend to reduce impact below Perry’s estimated adult earnings impact of 19%. However, most studies suggest modest impacts of class size on pre-K outcomes and that two years of pre-K does not double benefits, so we would not expect Perry to have effects far beyond current pre-K programs. And many of today’s pre-K programs are full-day, which has been shown to have larger impacts than half-day.

Another aspect of Perry and Abecedarian that increases their relevance is that they contain direct evidence on effects on adult outcomes. Because effects on test scores often fade, and these faded test score effects may not reflect adult outcomes, this makes these studies more important.

In addition, it is true that Perry and Abecedarian are experiments that mostly compare pre-K with no pre-K, whereas today any new pre-K program is comparing the new pre-K program with both no pre-K and with some control group members going to some other subsidized pre-K program. But this merely complicates the analysis of the impact of a new pre-K program and puts a premium on the program being as high-quality as existing programs. A real world benefit-cost analysis of a new pre-K program will adjust the benefits and costs downwards for substitution of the new program for existing programs.  For example, the Rand Corporation did this in its analysis of the effects of a universal pre-K program. Because the options in the pre-K market are always changing, even today, impact analyses of a new pre-K program will have to make adjustments for changes in the options in the pre-K market.  There are some scientific advantages to having “clean” estimates of the impact of pre-K versus no pre-K, which Perry and Abecedarian provide.

For Chicago CPC, Whitehurst fails to note that much of the variation in pre-K use was due to neighborhood of residence, and hence can be viewed as being in part a “natural experiment”.  Furthermore, the CPC researchers have gone to much length to try to correct for any remaining selection bias in the estimates, and have found that a variety of methods for doing so yield similar results. Therefore, I think the internal validity of CPC is higher than Whitehurst’s grade of C. CPC is in-between a “B grade” study based on a natural experiment, and a “C grade” study that simply controls for observable characteristics.

In addition, the external validity and policy relevance of CPC is quite high, as the program was run by Chicago Public Schools and is quite similar to pre-K programs run in many state and local areas.  So Whitehurst’s grade there also seems too low.  The study also includes direct evidence on adult earnings effects and adult educational attainment effects.

As for Head Start, Whitehurst’s table says that the Ludwig-Miller study he cites only finds long-term impacts on mortality. But the study also finds some long-term impacts on educational attainment, which he does not note.

Programs from the 1980s

 

Program/Research Reported Impact (after initial year) Internal validity External validity
Whitehurst Bartik Whitehurst grade Bartik grade Whitehurst grade Bartik grade
Head Start in 1980s + + C B- A C
Infant Health and Development + (impacts only for disadvantaged children with close to normal birth weights) + (impacts only for disadvantaged children with close to normal birth weights) A A B B-

For the Head Start sibling studies from the 1980s, Whitehurst argues that the sibling comparison will be biased towards finding effects of Head Start. However, as discussed in the research, there are reasons to think that the bias in which sibling gets into Head Start could go in either direction. Also, research such as Deming’s tries to look very closely at pre-existing characteristics to see whether there is a significant bias, and does not find strong signs of bias sufficient to overturn the main results.

As for policy relevance/external validity, I regard many of the pre-K programs that we are pursuing today at the state level, and considering trying to encourage via federal policy, as much more educationally focused than was traditionally the case for Head Start, although this may be changing for Head Start in recent years. Therefore, it is unclear to me whether the effects of Head Start are as relevant as Whitehurst thinks to current state and local pre-K programs.

On IHDP, it is true that the results are only significant for disadvantaged children of close to normal birth weight. However, the near-normal birth weight group is the group most relevant to current debates about early childhood education. As for policy relevance/external validity, IHDP is really a test of an early child care program, at ages 1 and 2.  This is relevant to evaluating a program such as Educare, but not to evaluating most current proposals for pre-K at age 4.

Recent Programs

 

Program/Research Reported Impact (after initial year) Internal validity External validity
Whitehurst Bartik Whitehurst grade Bartik grade Whitehurst grade Bartik grade
Head Start RCT None None statistically signif., but point estimates consistent with important effects A A A C
District programs, e.g., Tulsa Unknown (research design doesn’t allow follow-up after pre-K Unknown, but would predict sizable adult earnings effects based on test scores B B B B+
Georgia & OK Universal + (very small at best) + (large enough to have high benefit-cost ratio) B B- A B+
Tennessee Pre-K - - A C A B+
North Carolina More at Four Not included + NA B NA B+

The recent Head Start RCT does not show much in the way of statistically significant effects from kindergarten on, but some of the point estimates are consistent with effects that might be important. For example, the point estimates on the cognitive tests that are consistently given over time in the experiment show effects at third grade that would predict about a 1% increase in adult earnings, which adds up to a lot  of money over an entire career. Given the uncertainty, the true effect could be 2 or 3%, or could be zero or negative – we just can’t tell.

As mentioned before, one issue that Whitehurst does not grapple with is that even quite small test score effects would predict adult earnings effects that might be very important in a benefit-cost analysis. This makes research more difficult because it is hard to rule out test score effects that could be large in that they might make the program pay off.  Even with a relatively large sample, such as in the Head Start RCT, the studies are “underpowered” for detecting some effects that might be relevant.

This problem is exacerbated because studies frequently find that test score effects at third grade underpredict the long-run earnings effects of pre-K. Test score effects often fade but then re-emerge in better adult outcomes than would have been predicted at 3rd grade.  This increases the uncertainty about the results beyond what is described by the Head Start RCT’s standard errors.  

The other issue with the Head Start RCT is its relevance to current policy debates. First, as noted before, many of the state and local pre-K programs being debated are more educationally focused than has traditionally been the case for Head Start, which raises the issue of whether the Head Start results are generalizable to these state and local pre-K programs.

Second, the Head Start RCT is not comparing Head Start with no Head Start. Only 80% of the treatment group enrolled in Head Start. About half of the control group attended some pre-K program, including 14% in Head Start and 35% in some other pre-K program. If some of these other pre-K programs were more educationally focused than Head Start, this would reduce the net impact of a “More Head Start” treatment group versus an “Other Pre-K” control group. The issue would still remain of whether Head Start’s generally higher costs per child are justified by stronger results. But the Head Start RCT does not do a great job of answering the question “do educationally focused pre-K programs work?”

The various regression discontinuity studies of state and local pre-K programs, as Whitehurst notes, by their design cannot detect long-term effects. However, based on other studies, early post-program test scores frequently are better predictors of a program’s long-run adult earnings effects than are later test scores. Therefore, the early test score information is more valuable than might at first appear.

Whitehurst and my grade on internal validity of the state and local RDD studies is the same, at B. However, Whitehurst’s text makes some disparaging remarks about RDD studies, which I have dealt with in previous blog posts.

Whitehurst’s problems with RDD seem to lead to him downgrading these studies’ external validity, which seems like the wrong place to downgrade the studies for any perceived issues with RDD. It seems to me that current studies of state and local pre-K programs are about as relevant as one can get to whether expanding such programs today is a good idea . I only give a grade of B+ because it is always the case that just because Location X’s pre-K program works, this doesn’t always mean that Location Y’s pre-K program works – there might be quality differences in the pre-K programs between location X and location Y.  

For Georgia and Oklahoma’s universal pre-K programs, I think Whitehurst mistakes the magnitude of these results.  He states that these studies find less than a one point difference on fourth-grade NAEP scaled scores. But the Cascio and Schanzenbach study he cites finds fourth-grade NAEP scores, in the estimates they regard as preferred, of about 3 points. They also state that it would only take a NAEP score effect of 1.0 to 1.4 points for these programs to pass a benefit-cost test. “Small” and “large” are fuzzy terms. I would define “large” as being large enough for the program to plausibly pass a benefit-cost test.

The estimates in Cascio and Schanzenbach for Georgia and Oklahoma are statistically insignificant when the most rigorous corrections for statistical noise are made. This in part reflects an inherent problem in studies of aggregate data on one or two states – there’s so much noise in individual state test score trends that it is difficult for any intervention, even one with large effects, to show statistically significant effects.

For this reason, I downgrade the internal validity of such studies of universal programs, because standard errors tend to both be large, and to be difficult to get right, in studies with only one or two geographic units in the treatment group that are being compared with all other geographic units. Often estimates are more imprecise than indicated by standard statistical software packages.

As for external validity, I see no basis for giving a stronger or weaker external validity grade to the Georgia and Oklahoma studies over studies of Kalamazoo,  Tulsa, Boston, Michigan, New Jersey, South Carolina, West Virginia, Oklahoma , New Mexico, and Arkansas,  which are all examined in  the RDD research. These are all studies of state and local pre-K programs, and are generalizable to other state and local pre-K programs if these other programs are of similar quality.

For Tennessee Pre-K, as I have noted in previous blog posts, although the original design of this study was a randomized control trial, problems with attrition mean that the study falls short of the gold standard, by quite a bit.  For example, in the first cohort of children, the study was only able to get test score data from 46% of the pre-K participants versus 32% of the control group. The original treatment and control group are randomly chosen, but this is not true of the children for whom we actually have test score data.

Furthermore, there is some evidence that this attrition leads to bias, in that the full sample shows a reduction in kindergarten retention from 8% to 4%, and the smaller sample with test score data only shows a reduction from 8% to 6%.  In addition, these “retention” effects suggest that the program must be doing something to student achievement that is not fully reflected in test scores, otherwise why would retention be cut in half, as it is in the full sample?

For all these reasons, I regard the Tennessee study as meeting not a gold standard, or a silver standard, but a bronze standard. It is similar to the many other studies that are NOT discussed by Whitehurst that try to evaluate pre-K by controlling for observable characteristics of students, an approach that cannot correct for selection bias.

As for external validity, the Tennessee study is definitely relevant to other state pre-K programs, but it is most relevant to the pre-K programs of states that are not spending enough per child on pre-K. According to the National Institute for Early Education Research, Tennessee’s program has spending per child that is over $2,000 per child less than what is judged to be desirable for high-quality pre-K.  So Tennessee’s program may be relevant to some proposed state and local pre-K programs, but perhaps not so much to more fully-funded pre-K programs.

Finally, there is the recent study of North Carolina’s “More at Four” program, which I reviewed in a recent blog post. Whitehurst does not mention this study. This is a good silver standard study because it relies on a “natural experiment”: the More at Four program was gradually rolled out over time in different counties. It is hard to see why a county’s test scores the appropriate number of years later in 3rd grade would be correlated with More at Four spending, except for a true effect of the program.  And as with other studies of state and local pre-K programs, the study is highly relevant, as long as one remembers that each state’s program is different.

Overall, for the 10 studies/groups of studies that are graded by both Whitehurst and me, Whitehurst’s average grade is 3.15, or between a B and a B+, and my average grade is 2.95, slightly less than B. Whitehurst isn’t quite as tough a grader as me, and in that sense he is not quite as “picky” as me.

Where we differ most obviously, is over the Head Start RCT and Tennessee studies, versus the older studies and the other more recent studies. He gives grades of straight A’s to the Head Start RCT and Tennessee study, so these two studies clearly dominate all the other studies from his perspective. In contrast, I give an average grade of B to the Head Start RCT, and B- to the Tennessee study. I give grades higher than B to Perry, Abecedarian, IHDP, the state/local RDD studies, and the North Carolina studies, and give B grade averages to CPC and the OK/GA studies. So, in my view, this other evidence dominates – there’s a preponderance of evidence of equal or greater quality that suggests that pre-K can work.

Of course, the Head Start RCT and Tennessee evidence still matters – this evidence suggests that there are some doubts as to whether Head Start as of 2002 was as educationally effective as some state pre-K programs, and the Tennessee evidence raises some doubts about that state’s pre-K program. But there is no way in which I view this evidence as trumping all the other evidence, which seems to be Whitehurst’s view.  

Whitehurst is unusual among researchers in privileging the Head Start RCT and Tennessee studies over all the other evidence. That doesn’t mean he’s more picky, it simply means he has a different approach than most researchers to thinking about the strengths and weaknesses of different research approaches.   

(Note: Steve Barnett of the National Institute for Early Education Research has independently provided some reactions to Whitehurst’s blog post.  I wrote the first draft of this blog post prior to seeing Barnett’s reaction, and did not significantly revise the post to avoid overlap – there’s some overlap, but considerable independent information. )

Posted in Early childhood program design issues, Early childhood programs

The appeal of universal programs rests in part on simplicity

A summary of my paper with my colleague Marta Lachowska on the Kalamazoo Promise recently was published in Education Next. (The summary even received a tweet from Arne Duncan!) The Kalamazoo Promise is a program begun in 2005, under which anonymous private donors promised to provide all graduates of Kalamazoo Public Schools with up to 4 years of free tuition at public colleges and universities in the United States.

Our paper relied on one aspect of the Kalamazoo Promise that provides a “natural experiment”. Promise eligibility requires that students be continuously enrolled in Kalamazoo Public Schools since the beginning of 9th grade. Our paper compared the behavior and academic achievement of high school students who were “Promise eligible”, versus high school students who were “Promise ineligible”, based on length of enrollment in KPS, from before to after the Promise announcement in 2005.

We found statistically significant and large effects of the Promise on improving the behavior of all students, and on improving high school GPA for African-American students.  The point estimates of effects on all students’ GPA were positive, but insignificantly different from zero. Some of the estimated effects are large. For example, in the 2007-2008 school year, the estimated Promise effect on the GPA of African-American students is an increase of 0.7 points, on a four-point scale.

I think several points from these findings might be relevant to early childhood education advocates.

First, this study, like many other studies, shows that there are definitely many interventions after early childhood that can make a difference in educational attainment and life prospects. I think it is both a political mistake and substantively wrong to argue that early childhood education inherently has a higher rate of return than later interventions. There are many later interventions with high rates of return, for example some high school tutoring and counseling programs, and demand-oriented adult job training programs. The argument for early childhood education is that it has a high benefit/cost ratio or a high rate of return, not that other interventions don’t also have a high rate of return.

Second, I do think it is true that many later interventions are more complicated to implement than early childhood education. In early childhood education, we are essentially adding learning time. The research evidence suggests that if this is implemented reasonably well, by a typical government agency, we get long-term benefits that significantly exceed costs.

For many later interventions, implementation is more complex, and more politically and substantively difficult. For example, improving teacher quality and school quality in K-12 education is a huge challenge.

The Kalamazoo Promise is an exception to this general pattern for later interventions. The program is simple: graduate from high school and get into a college, and you get a scholarship that pays the tuition. The form required to get the Promise is one page long.

Third, I think the Promise points out one of the virtues of universal programs: simplicity. The Promise would be much more complicated to implement if eligibility depended on family income, student performance in high school, etc. And a more complicated program would be more difficult to explain to parents and students, which makes it less likely to affect attitudes and behavior.

Targeted programs are more complicated to administer, and hence more costly. They are harder to explain to those who are eligible, which restricts participation. Targeting also imposes an implicit tax on earnings, which may discourage labor force participation.

Having said that, targeting is obviously justified if the benefits are demonstrably greater for the targeted group, and if we can organize the targeting so that it is as simple to administer as possible, and so that it minimizes the implicit tax on earnings.  But I think the administrative issues with targeting should not be under-emphasized. Any discussion of universality versus targeted programs in early childhood education needs to consider these practical implementation issues.

Posted in Distribution of benefits, Early childhood program design issues, Early childhood programs

Dealing with uncertainty in research on pre-K

Jason Richwine, in a recent blog post at “The Corner” blog of National Review, expressed surprise at my interpretation of the estimated effects in the Head Start randomized control trial.

I had pointed out that the impact estimates, while not statistically significantly different from zero, are also not statistically significantly different from predicting a 2 to 3% increase in adult earnings, which would probably be sufficient for Head Start to pass a benefit-cost test from earnings effects alone.

Richwine argues that the estimates and their confidence intervals also can’t rule out that Head Start has negative effects. He interprets my comments as arguing that the Head Start impact estimates are “large”. He concludes by arguing the following:

“Such analysis reverses the traditional burden of proof: Rather than showing that government preschool works, advocates now demand proof that it doesn’t work.”

These comments raise some interesting issues about how policymakers should make policy when given research that inevitably has some uncertainty about its estimates.

In making policy decisions, concepts such as the “burden of proof” are more confusing than helpful. The “burden of proof” is a legal concept used in court cases. In making policy, what we have are estimates with some uncertainty, and we have to decide what policy rules are likely over the long-haul to maximize net social benefits.

If the only evidence on public pre-K was the Head Start experiment, policymakers would face a difficult policy decision with considerably uncertain evidence. The point estimates of test score effects at the end of 3rd grade suggest a little more than a 1% increase in adult earnings. This is a modest-sized effect, in my opinion, not a “large effect”, although what is “large” or “modest” is a highly subjective judgment, not a rigorous scientific judgment.  But because adult earnings are so large over an entire career, it would sum to many thousands of dollars. The present value of this earnings gain would probably exceed $5,000.  Head Start costs more than that, but then Head Start also clearly has benefits in the value of the child care services it provides to parents. So the point estimate implies a close call on net benefits.

Furthermore, there is significant uncertainty in these estimates.  The confidence interval includes zero and negative effects, as well as positive effects two or three times as large. How should policymakers deal with such uncertainty?

One approach is to take a skeptical attitude, and assume effects are zero until proven otherwise. But this skeptical approach would not be a particularly good policy rule to adopt if one were faced with many policy decisions over a long period of time. If a policymaker were simply trying to maximize the expected present value of net benefits over thousands of policy decisions, each with evidence from only one experiment, then the optimal decision rule would be to use each experiment’s point estimate to guide decisions, regardless of the confidence intervals. If we use the point estimates, which represent the mean expected impact of each intervention, then over time we will maximize net social benefits by following this rule.

In other words, the legal “burden of proof” principle is not a particularly good guide to making policy decisions over time. The legal rule that we should convict someone of a crime only if they are guilty “beyond a reasonable doubt” is ultimately based on the judgment that we find it socially abhorrent to deprive someone of their life or liberty based on any lesser standard. The huge social cost of convicting an innocent is not really relevant to deciding whether to spend a little more or less on some social or educational program. The costs of mistakenly expanding a social or educational program are not as great as the cost of locking someone up because the probability is 51% that they are guilty.

Another important point is that the Head Start experiment is NOT the only good evidence on the effects of pre-K. We have good evidence from two randomized experiments, Perry and Abecedarian, that pre-K can have large long-run effects. For example, long-run earnings effects are 19% in Perry. We also have good evidence from some natural experiments of long-run earnings effects, for example 8% in the Chicago Child-Parent Center study and 11% in Deming’s study of Head Start.  Finally, we have some good natural experiments, for example in Tulsa and Boston, that show short-run test score effects of pre-K that are larger than found in the Head Start experiment.

In social science, or for that matter natural science, how we interpret any new experiment is influenced by what we already know. If we have substantial reasons from prior research to believe that variable X affects outcome Y, then in considering new evidence, our prior belief is not that X has no effect on Y. In interpreting the new research, we would ask whether the estimated effects in the new research are consistent not only with a null hypothesis of zero effects, but also with a null hypothesis of the estimated effects implied by prior research. Both of these null hypotheses are interesting to explore.  If the new research shows lower effects of X on Y than implied by prior research, this should influence us going forward towards believing that X has lower effects.

In the case of the Head Start experiment, the modest effects found should influence researchers towards believing that at least some pre-K programs have considerably smaller effects than found by Perry or the Chicago Child-Parent Center study or the Tulsa or Boston studies. It should also influence us towards wondering whether Head Start as of the 2002 experiment might have lower effects than it did in the past. And it might influence us towards desiring to reform Head Start to increase its effectiveness, in part by imitating the practices of pre-K programs that have larger estimated effects.  As Barnett has pointed out, there is some evidence that Head Start has increased its educational effectiveness since the time of the 2002 experiment.

The Head Start experiment by itself is not strong evidence in favor of public pre-K. But it is not the only evidence, and it is not necessarily inconsistent with this other evidence. On the whole, the weight of the evidence, as suggested by a number of reviews of the research, is that high-quality pre-K programs can make a significant difference in improving the opportunities of children.  The estimated benefits in the bulk of the research are sufficient to be significantly greater than program costs.

Posted in Early childhood programs

More on weighing the evidence on pre-K

Andrew Coulson of the Cato Institute has a blog post commenting on the debate between me and Russ Whitehurst over what evidence to believe about the effects of pre-K programs.

Coulson’s argument is that the only reliable evidence for ascertaining the effects of “large-scale” public pre-K programs is the randomized control trials for Head Start and Tennessee pre-K, and that these studies reach a consensus: “program effects fade out by the elementary school years…” In Coulson’s view, this is the “evidence that matters when discussing proposals for expanding government pre-K”.

What evidence doesn’t matter, in Coulson’s view? First, he doesn’t regard the Perry Preschool program and Abecedarian “randomized control trial” studies as mattering, because these were “tiny programs”, and it is “difficult to massively replicate any service without compromising its quality”.  Second, he doesn’t regard any non-experimental study as mattering.

The main thing that Coulson’s blog post overlooks is that not all randomized control trials provide evidence of equal quality, and that not all non-experimental studies are of equal quality.  Some non-experimental studies provide evidence on the true causal effects of public policies that is superior to that of some randomized control trial studies.

As mentioned in a previous blog post, the key problem that all pre-K studies are trying to address is “selection bias”. We are concerned that children and families participating in pre-K may differ from those who do not participate. We can control for observed characteristics of the children and families, but unobserved characteristics of children and families may differ between the treatment and comparison groups. These unobserved differences could be causing different outcomes, which would bias the estimated effects of pre-K by some amount that is potentially large and of unknown sign.

In theory, a perfectly run randomized control trial addresses this selection bias issue. Because participation is determined randomly, the treatment and comparison groups would be expected to on average be similar in unobserved characteristics.

But this advantage only holds fully for the full original sample. If there is “large enough” attrition in the randomly chosen groups, the final sample of treatment and comparison households could easily differ greatly in unobserved characteristics.  The final observed sample in that case would no longer really be randomly chosen.

In the case of pre-K, as I pointed out in a previous blog post, the Tennessee pre-K study has some serious problems with large and differential attrition. Therefore, I don’t think it at all appropriate to cite it as meeting some “gold standard” of providing evidence that is more reliable than any study that makes a serious effort to control for observable characteristics. The evidence from the Tennessee pre-K study is suggestive but not definitive, as is true for all studies that can only control for observable characteristics.

On the other hand, non-experimental studies can persuasively deal with selection bias if these studies rely on a “natural experiment”, in which access to the pre-K program varies due to geography or age or some other factor that is plausibly unrelated to unobserved child and family characteristics. All “non-experimental studies” are not created equal. Some non-experimental studies only can control for observable characteristics. Other non-experimental studies have variations in access that while not randomly assigned, may be almost as good as random for identifying the causal effects of the pre-K program.  One such study may be an outlier, but multiple studies increases confidence that we are identifying true causal effects of pre-K on outcomes.

We have many “natural experiments” that show large effects of pre-K. These include: Ladd et al.’s study of North Carolina pre-K; Ludwig et al., Currie et al., and Deming’s studies of Head Start; the various studies of the Chicago Child-Parent Center program; the various “regression discontinuity” studies of state and local pre-K programs. This evidence matters.

The Head Start and Tennessee evidence also matters. The Tennessee evidence is only suggestive, but is not particularly supportive of the effectiveness of that state’s program, which may have too low funding per child to be effective.  The Head Start RCT evidence also suggests that Head Start during that period may not have particularly effective relative to the other pre-K alternatives available to the control group, which included state pre-K programs. Changes over time in Head Start quality and Head Start alternatives is the most straightforward way to reconcile the fading effects in the Head Start RCT with the previous natural experiments that suggest that Head Start has long-term effects. Recent Head Start reforms may have improved its quality relative to Head Start’s quality at the time of the RCT. Head Start reforms to improve quality should continue.

Two other points. I would challenge the argument that Perry and Abecedarian don’t matter. These are programs with characteristics and delivery models that are well-understood. Today’s Educare program is quite similar to the Abecedarian program. Perry is essentially a smaller class size two-year version of today’s state and local pre-K programs. I think the Perry and Abecedarian evidence, in conjunction with the natural experiments, suggest that high-quality pre-K can make a difference.

Second, fading test-score effects are found in a variety of pre-K programs, including Perry, Abecedarian, and the Chicago CPC program. Such fading test-score effects appear to be consistent with large long-term effects on adult outcomes.  One theory to account for this fading and re-emergence is the importance of so-called “soft skills”, such as social skills and character skills. Such soft skills are important in determining educational attainment and adult earnings.

A broad view of the research evidence on pre-K suggests that a variety of programs have strong effects, although this should not be interpreted as meaning that every program always works. There is a lot of variation in program quality and program effects over space and time. For other recent research reviews that also take a broader view of the research evidence, and conclude that pre-K can work, see the review by Yoshikawa et al., and the review by Kay and Pennucci (Report 14-01-2201) for the Washington State Institute for Public Policy.

Posted in Early childhood programs