Reducing inequality may sometimes increase economic growth – and a specific example is early childhood education

Nobel prize-winning economist Paul Krugman devoted his column this morning to recent empirical evidence, from the International Monetary Fund, which indicates that reducing income inequality need not reduce economic growth. This goes against a tradition among economists as seeing an inherent tradeoff where reduced income inequality can only be pursued at a cost in reduced economic output or growth.

A prime example of a public policy that both reduces inequality and promotes economic growth is increasing access to high-quality early childhood education, such as pre-K programs and high-quality child care.

As I’ve mentioned in previous posts, high-quality pre-K can increase the adult earnings of children from low-income families by 10% or more.  High-quality child care and pre-K from birth to age 5 can increase the adult earnings of children from low-income families by over 25%.

Yet these policies will also increase economic growth. The evidence suggests that these extra skills and earnings for children from low-income families will provide spillover economic benefits for the rest of society.

These spillover benefits occur because my earnings depend in part on the skills of my fellow workers, in my firm, and elsewhere in my local economy. Firms are better able to introduce new technologies when a higher percentage of all workers are skilled, so my firm may be more competitive when my fellow workers get more skills. Firms’ competitiveness also depends on the skills of local suppliers, so my wages may depend on the skills of those suppliers’ workers. Firms may also be more innovative if they are able to get ideas and skilled workers from other local firms.

How do these spillover benefits occur? They occur by firms investing more and creating more local jobs when a local economy increases its overall skills. Expanded pre-K and other early childhood education programs can expand the local skills base. A worker can benefit from such expansion of early childhood education even if his or her skills would have been fine even without the expansion – the increased skills of other workers will boost job creation and boost worker productivity for all local workers.

Early childhood education is a prime example of a case where all workers share in the economic fortunes of an economy, which depend in part on everyone’s skills. Investing in “other people’s children” not only is a moral issue, but also an issue of enlightened self-interest.

Posted in Distribution of benefits

Grading the Pre-K Evidence

Russ Whitehurst of Brookings has a new blog post that outlines his views on pre-K research in more detail.  The title is “Does Pre-K Work? It Depends How Picky You Are”.

Whitehurst reaches the following conclusion:

“I conclude that the best available evidence raises serious doubts that a large public investment in the expansion of pre-k for four-year-olds will have the long-term effects that advocates tout. 

This doesn’t mean that we ought not to spend public money to help families with limited financial resources access good childcare for their young children.  After all, we spend tax dollars on national parks, symphony orchestras, and Amtrak because they make the lives of those who use them better today.  Why not childcare? 

It does mean that we need public debate that recognizes the mixed nature of the research findings rather than a rush to judgment based on one-sided and misleading appeals to the preponderance of the evidence.”

Therefore, it is fair to say that Whitehurst is marketing doubt. Maybe pre-K doesn’t work. Maybe we shouldn’t move forward with large-scale programs, and instead should undertake more limited measures or do more research.

He admits that opponents of his position, who believe that pre-K does work, are also basing their position on scientific research, and wonders: “how is that different individuals could look at the same research and come to such different conclusions?”

His framing of the issue is that he is just more “picky” about what research he believes. In his view, his opponents, when claiming that the “preponderance” of evidence supports pre-K, are relying on weak research, whereas he is relying on the strongest research in saying that pre-K does not work.

In his view, the strongest research, to which he gives straight “As” for quality, is the recent Head Start randomized control trial (RCT) and the recent Tennessee RCT.  All the other evidence for the effectiveness of pre-K, in his view, is inferior in research rigor (“internal validity”) and/or less policy relevant to today’s policy choices (“external validity”).

Let me make some summary comments upfront before getting into the details of Whitehurst’s research review.

First, I think all researchers seek to be “picky” in reviewing research, in trying to assess the rigor of the research, and its relevance to the policy question at hand. However, even researchers who are equally “picky” can disagree about what the strengths and weaknesses are of various studies.

Second, in my view, Whitehurst significantly overstates the quality and relevance of the Tennessee RCT, and the relevance of the Head Start RCT.  He’s not “picky” enough!

Third, Whitehurst underplays the findings and understates the research strengths and relevance of many other research studies.  He also omits recent relevant research.

Fourth, Whitehurst never grapples with a fundamental issue in pre-K research: it does not take much of a pre-K impact on test scores for pre-K’s predicted earnings benefits over an entire career to justify considerable costs. Effects he characterizes as “small” are in many cases more than sufficient for programs to pass a benefit-cost test.

Fifth, Whitehurst never discusses another fundamental issue in pre-K research: test score effects often fade as children go through the K-12 system, but then effects on adult outcomes such as educational attainment or earnings re-emerge despite the fading. The faded test score effects are often poorer predictors of adult outcomes than the initial post-program test score effects. This means that studies with good evidence on adult outcomes gain importance relative to studies that only go through elementary school, and that studies with good evidence on immediate post-program outcomes gain in importance relative to studies that only go through elementary school.  The elementary school test data adds some evidence, but not as much as might at first appear.

Sixth, if Whitehurst believes in the usefulness of child care services, the most logically consistent position is that he should back expanding programs such as Educare (full-time child care and pre-K from birth to age 5) to all low-income children. In my book Investing in Kids, I argued that the research evidence on child care and on the Abecedarian program, which was very similar to today’s Educare program,  suggested that a program such as Educare would have earnings benefits for parents that significantly exceeded program costs.

So why not expand Educare, which would help low-income parents increase their work and their educational attainment, leading to significant boosts to parents’ short-run and long-run earnings? If Educare also helps improve the children’s long-run prospects, so much the better. (And in fact Whitehurst seems to like the Infant Health and Development Program research that supports that there would be such benefits for low-income children from an Educare-style program.)  

I estimate that an Educare program for all families below the poverty line would cost around $70 billion per year, but would have parental earnings benefits significantly greater than that. This proposal would be consistent with a previous proposal made by colleagues of Whitehurst at Brookings.  Such a proposal goes far beyond the cost of any preschool proposal made by the Obama Administration. But I think it would be a logically consistent proposal for Whitehurst to make. Whitehurst should be arguing that the Obama Administration preschool proposal is underfunded, not sufficiently comprehensive in its birth-to-five services, and insufficiently targeted on low-income families. (Note: this is not my position; for example, I’m in favor of universal pre-K. What I am describing is the position that is most consistent with Whitehurst’s own review of the research evidence.)  

Before I get into the details, one more important headline issue: why should policymakers or journalists or other policy “influencers” believe my position, that the best evidence supports pre-K’s effectiveness, rather than Whitehurst’s position, that the research is more uncertain? The best way is to simply look at the research studies on your own, and make up your own mind, but how is one supposed to do this without an extensive background in statistics and research methodology?

Whitehurst’s position of doubt has a structural advantage in the debate in the public square.  Some researchers argue that pre-K works, others say it may not: the headline news to an outside observer is that doubt wins the debate as long as the side that is promoting doubt has a consistent position that cites evidence.  It’s easier to spread doubt than to assuage doubt.

Therefore, I would also make the following argument: many other researchers familiar with the pre-K research evidence disagree with Whitehurst, and agree that pre-K can work.  Among pre-K researchers, Whitehurst’s weighting of the evidence is a distinct minority position.

Consider a recent research summary, “Investing in Our Future: The Evidence Base on Preschool Education”, which was authored by 10 prominent researchers on pre-K from a variety of disciplines and universities.  This study concluded the following:

“Recent meta-analyses drawing together the evidence across decades of evaluation research now permit us to say with confidence that preschool programs can have a substantial impact on early learning and development….

While there is clear evidence that preschool education boosts early learning for children from a range of backgrounds, we also see a convergence of test scores during the elementary school grades so that there are diminishing differences over time on tests of academic achievement between children who did and did not attend preschool. Yet the most recent research is showing an accumulation of evidence that even when the difference in test scores declines to zero, children who have attended preschool go on to show positive effects on important adolescent and young adult outcomes, such as high school graduation, reduced teen pregnancy, years of education completed, earnings, and reduced crime…

Although random assignment of children or parents to program and comparison groups is the “gold standard” for program evaluation, sometimes this is not possible.One of the most frequently used alternative methods…is called a Regression-Discontinuity Design. …Comparing kindergarten entry achievement scores for children who have completed a year in Pre-K with the scores measured at the same time for children who just missed the birthday cutoff and are about to enter Pre-K can be a strong indicator of program impacts…Other methods used in recent nonexperimental preschool studies include propensity score weighting, individual, sibling or state fixed-effects, and instrumental variables analysis…Evaluations that select comparison groups in other ways should be approached with healthy skepticism.”

Therefore, it is clear that other researchers weight the evidence quite differently from Whitehurst. This is in part because other researchers, while noting that RCTs are the “gold standard”, view other studies as having sufficiently good comparison groups that they provide good “silver standard” evidence.  Other researchers are also aware that few RCTs are so perfect that they are pure “gold standard”; in practice, we find that the gold is almost always alloyed with some less precious metal.

Now, onto the details. To do this, I’ll regrade the various studies that Whitehurst examines, along with adding one recent study that he omits. I’ll use his criteria of looking at “internal validity” (intuitively, the reliability of the research in identifying some causal effect of some program), and “external validity” (intuitively, the study’s relevance to current policy debates). I’ll also add my own take on what the reported impact shows.

Programs from the 1960s and 1970s

Program/Research Reported Impact (after initial year) Internal validity External validity
Whitehurst Bartik Whitehurst grade Bartik grade Whitehurst grade Bartik grade
Perry + + A- A- C B
Abecedarian + + B+ B+ C B
Chicago CPC + + C B- B B+
Head Start in 60s + (for mortality) + (for mortality and ed attainment) B B C C

For Perry and Abecedarian, I would upgrade their “external validity/policy relevance” from C to B because I think that these program designs and the study’s results are quite relevant to what we are doing today. First, the design of these two programs is similar to what we are doing today. Abecedarian is quite similar to today’s Educare program. Perry is similar in many respects to today’s pre-K programs. Class sizes in Perry were smaller than most of today’s programs, and the program went for two years, versus one-year for most programs today, which would tend to reduce impact below Perry’s estimated adult earnings impact of 19%. However, most studies suggest modest impacts of class size on pre-K outcomes and that two years of pre-K does not double benefits, so we would not expect Perry to have effects far beyond current pre-K programs. And many of today’s pre-K programs are full-day, which has been shown to have larger impacts than half-day.

Another aspect of Perry and Abecedarian that increases their relevance is that they contain direct evidence on effects on adult outcomes. Because effects on test scores often fade, and these faded test score effects may not reflect adult outcomes, this makes these studies more important.

In addition, it is true that Perry and Abecedarian are experiments that mostly compare pre-K with no pre-K, whereas today any new pre-K program is comparing the new pre-K program with both no pre-K and with some control group members going to some other subsidized pre-K program. But this merely complicates the analysis of the impact of a new pre-K program and puts a premium on the program being as high-quality as existing programs. A real world benefit-cost analysis of a new pre-K program will adjust the benefits and costs downwards for substitution of the new program for existing programs.  For example, the Rand Corporation did this in its analysis of the effects of a universal pre-K program. Because the options in the pre-K market are always changing, even today, impact analyses of a new pre-K program will have to make adjustments for changes in the options in the pre-K market.  There are some scientific advantages to having “clean” estimates of the impact of pre-K versus no pre-K, which Perry and Abecedarian provide.

For Chicago CPC, Whitehurst fails to note that much of the variation in pre-K use was due to neighborhood of residence, and hence can be viewed as being in part a “natural experiment”.  Furthermore, the CPC researchers have gone to much length to try to correct for any remaining selection bias in the estimates, and have found that a variety of methods for doing so yield similar results. Therefore, I think the internal validity of CPC is higher than Whitehurst’s grade of C. CPC is in-between a “B grade” study based on a natural experiment, and a “C grade” study that simply controls for observable characteristics.

In addition, the external validity and policy relevance of CPC is quite high, as the program was run by Chicago Public Schools and is quite similar to pre-K programs run in many state and local areas.  So Whitehurst’s grade there also seems too low.  The study also includes direct evidence on adult earnings effects and adult educational attainment effects.

As for Head Start, Whitehurst’s table says that the Ludwig-Miller study he cites only finds long-term impacts on mortality. But the study also finds some long-term impacts on educational attainment, which he does not note.

Programs from the 1980s


Program/Research Reported Impact (after initial year) Internal validity External validity
Whitehurst Bartik Whitehurst grade Bartik grade Whitehurst grade Bartik grade
Head Start in 1980s + + C B- A C
Infant Health and Development + (impacts only for disadvantaged children with close to normal birth weights) + (impacts only for disadvantaged children with close to normal birth weights) A A B B-

For the Head Start sibling studies from the 1980s, Whitehurst argues that the sibling comparison will be biased towards finding effects of Head Start. However, as discussed in the research, there are reasons to think that the bias in which sibling gets into Head Start could go in either direction. Also, research such as Deming’s tries to look very closely at pre-existing characteristics to see whether there is a significant bias, and does not find strong signs of bias sufficient to overturn the main results.

As for policy relevance/external validity, I regard many of the pre-K programs that we are pursuing today at the state level, and considering trying to encourage via federal policy, as much more educationally focused than was traditionally the case for Head Start, although this may be changing for Head Start in recent years. Therefore, it is unclear to me whether the effects of Head Start are as relevant as Whitehurst thinks to current state and local pre-K programs.

On IHDP, it is true that the results are only significant for disadvantaged children of close to normal birth weight. However, the near-normal birth weight group is the group most relevant to current debates about early childhood education. As for policy relevance/external validity, IHDP is really a test of an early child care program, at ages 1 and 2.  This is relevant to evaluating a program such as Educare, but not to evaluating most current proposals for pre-K at age 4.

Recent Programs


Program/Research Reported Impact (after initial year) Internal validity External validity
Whitehurst Bartik Whitehurst grade Bartik grade Whitehurst grade Bartik grade
Head Start RCT None None statistically signif., but point estimates consistent with important effects A A A C
District programs, e.g., Tulsa Unknown (research design doesn’t allow follow-up after pre-K Unknown, but would predict sizable adult earnings effects based on test scores B B B B+
Georgia & OK Universal + (very small at best) + (large enough to have high benefit-cost ratio) B B- A B+
Tennessee Pre-K - - A C A B+
North Carolina More at Four Not included + NA B NA B+

The recent Head Start RCT does not show much in the way of statistically significant effects from kindergarten on, but some of the point estimates are consistent with effects that might be important. For example, the point estimates on the cognitive tests that are consistently given over time in the experiment show effects at third grade that would predict about a 1% increase in adult earnings, which adds up to a lot  of money over an entire career. Given the uncertainty, the true effect could be 2 or 3%, or could be zero or negative – we just can’t tell.

As mentioned before, one issue that Whitehurst does not grapple with is that even quite small test score effects would predict adult earnings effects that might be very important in a benefit-cost analysis. This makes research more difficult because it is hard to rule out test score effects that could be large in that they might make the program pay off.  Even with a relatively large sample, such as in the Head Start RCT, the studies are “underpowered” for detecting some effects that might be relevant.

This problem is exacerbated because studies frequently find that test score effects at third grade underpredict the long-run earnings effects of pre-K. Test score effects often fade but then re-emerge in better adult outcomes than would have been predicted at 3rd grade.  This increases the uncertainty about the results beyond what is described by the Head Start RCT’s standard errors.  

The other issue with the Head Start RCT is its relevance to current policy debates. First, as noted before, many of the state and local pre-K programs being debated are more educationally focused than has traditionally been the case for Head Start, which raises the issue of whether the Head Start results are generalizable to these state and local pre-K programs.

Second, the Head Start RCT is not comparing Head Start with no Head Start. Only 80% of the treatment group enrolled in Head Start. About half of the control group attended some pre-K program, including 14% in Head Start and 35% in some other pre-K program. If some of these other pre-K programs were more educationally focused than Head Start, this would reduce the net impact of a “More Head Start” treatment group versus an “Other Pre-K” control group. The issue would still remain of whether Head Start’s generally higher costs per child are justified by stronger results. But the Head Start RCT does not do a great job of answering the question “do educationally focused pre-K programs work?”

The various regression discontinuity studies of state and local pre-K programs, as Whitehurst notes, by their design cannot detect long-term effects. However, based on other studies, early post-program test scores frequently are better predictors of a program’s long-run adult earnings effects than are later test scores. Therefore, the early test score information is more valuable than might at first appear.

Whitehurst and my grade on internal validity of the state and local RDD studies is the same, at B. However, Whitehurst’s text makes some disparaging remarks about RDD studies, which I have dealt with in previous blog posts.

Whitehurst’s problems with RDD seem to lead to him downgrading these studies’ external validity, which seems like the wrong place to downgrade the studies for any perceived issues with RDD. It seems to me that current studies of state and local pre-K programs are about as relevant as one can get to whether expanding such programs today is a good idea . I only give a grade of B+ because it is always the case that just because Location X’s pre-K program works, this doesn’t always mean that Location Y’s pre-K program works – there might be quality differences in the pre-K programs between location X and location Y.  

For Georgia and Oklahoma’s universal pre-K programs, I think Whitehurst mistakes the magnitude of these results.  He states that these studies find less than a one point difference on fourth-grade NAEP scaled scores. But the Cascio and Schanzenbach study he cites finds fourth-grade NAEP scores, in the estimates they regard as preferred, of about 3 points. They also state that it would only take a NAEP score effect of 1.0 to 1.4 points for these programs to pass a benefit-cost test. “Small” and “large” are fuzzy terms. I would define “large” as being large enough for the program to plausibly pass a benefit-cost test.

The estimates in Cascio and Schanzenbach for Georgia and Oklahoma are statistically insignificant when the most rigorous corrections for statistical noise are made. This in part reflects an inherent problem in studies of aggregate data on one or two states – there’s so much noise in individual state test score trends that it is difficult for any intervention, even one with large effects, to show statistically significant effects.

For this reason, I downgrade the internal validity of such studies of universal programs, because standard errors tend to both be large, and to be difficult to get right, in studies with only one or two geographic units in the treatment group that are being compared with all other geographic units. Often estimates are more imprecise than indicated by standard statistical software packages.

As for external validity, I see no basis for giving a stronger or weaker external validity grade to the Georgia and Oklahoma studies over studies of Kalamazoo,  Tulsa, Boston, Michigan, New Jersey, South Carolina, West Virginia, Oklahoma , New Mexico, and Arkansas,  which are all examined in  the RDD research. These are all studies of state and local pre-K programs, and are generalizable to other state and local pre-K programs if these other programs are of similar quality.

For Tennessee Pre-K, as I have noted in previous blog posts, although the original design of this study was a randomized control trial, problems with attrition mean that the study falls short of the gold standard, by quite a bit.  For example, in the first cohort of children, the study was only able to get test score data from 46% of the pre-K participants versus 32% of the control group. The original treatment and control group are randomly chosen, but this is not true of the children for whom we actually have test score data.

Furthermore, there is some evidence that this attrition leads to bias, in that the full sample shows a reduction in kindergarten retention from 8% to 4%, and the smaller sample with test score data only shows a reduction from 8% to 6%.  In addition, these “retention” effects suggest that the program must be doing something to student achievement that is not fully reflected in test scores, otherwise why would retention be cut in half, as it is in the full sample?

For all these reasons, I regard the Tennessee study as meeting not a gold standard, or a silver standard, but a bronze standard. It is similar to the many other studies that are NOT discussed by Whitehurst that try to evaluate pre-K by controlling for observable characteristics of students, an approach that cannot correct for selection bias.

As for external validity, the Tennessee study is definitely relevant to other state pre-K programs, but it is most relevant to the pre-K programs of states that are not spending enough per child on pre-K. According to the National Institute for Early Education Research, Tennessee’s program has spending per child that is over $2,000 per child less than what is judged to be desirable for high-quality pre-K.  So Tennessee’s program may be relevant to some proposed state and local pre-K programs, but perhaps not so much to more fully-funded pre-K programs.

Finally, there is the recent study of North Carolina’s “More at Four” program, which I reviewed in a recent blog post. Whitehurst does not mention this study. This is a good silver standard study because it relies on a “natural experiment”: the More at Four program was gradually rolled out over time in different counties. It is hard to see why a county’s test scores the appropriate number of years later in 3rd grade would be correlated with More at Four spending, except for a true effect of the program.  And as with other studies of state and local pre-K programs, the study is highly relevant, as long as one remembers that each state’s program is different.

Overall, for the 10 studies/groups of studies that are graded by both Whitehurst and me, Whitehurst’s average grade is 3.15, or between a B and a B+, and my average grade is 2.95, slightly less than B. Whitehurst isn’t quite as tough a grader as me, and in that sense he is not quite as “picky” as me.

Where we differ most obviously, is over the Head Start RCT and Tennessee studies, versus the older studies and the other more recent studies. He gives grades of straight A’s to the Head Start RCT and Tennessee study, so these two studies clearly dominate all the other studies from his perspective. In contrast, I give an average grade of B to the Head Start RCT, and B- to the Tennessee study. I give grades higher than B to Perry, Abecedarian, IHDP, the state/local RDD studies, and the North Carolina studies, and give B grade averages to CPC and the OK/GA studies. So, in my view, this other evidence dominates – there’s a preponderance of evidence of equal or greater quality that suggests that pre-K can work.

Of course, the Head Start RCT and Tennessee evidence still matters – this evidence suggests that there are some doubts as to whether Head Start as of 2002 was as educationally effective as some state pre-K programs, and the Tennessee evidence raises some doubts about that state’s pre-K program. But there is no way in which I view this evidence as trumping all the other evidence, which seems to be Whitehurst’s view.  

Whitehurst is unusual among researchers in privileging the Head Start RCT and Tennessee studies over all the other evidence. That doesn’t mean he’s more picky, it simply means he has a different approach than most researchers to thinking about the strengths and weaknesses of different research approaches.   

(Note: Steve Barnett of the National Institute for Early Education Research has independently provided some reactions to Whitehurst’s blog post.  I wrote the first draft of this blog post prior to seeing Barnett’s reaction, and did not significantly revise the post to avoid overlap – there’s some overlap, but considerable independent information. )

Posted in Early childhood program design issues, Early childhood programs

The appeal of universal programs rests in part on simplicity

A summary of my paper with my colleague Marta Lachowska on the Kalamazoo Promise recently was published in Education Next. (The summary even received a tweet from Arne Duncan!) The Kalamazoo Promise is a program begun in 2005, under which anonymous private donors promised to provide all graduates of Kalamazoo Public Schools with up to 4 years of free tuition at public colleges and universities in the United States.

Our paper relied on one aspect of the Kalamazoo Promise that provides a “natural experiment”. Promise eligibility requires that students be continuously enrolled in Kalamazoo Public Schools since the beginning of 9th grade. Our paper compared the behavior and academic achievement of high school students who were “Promise eligible”, versus high school students who were “Promise ineligible”, based on length of enrollment in KPS, from before to after the Promise announcement in 2005.

We found statistically significant and large effects of the Promise on improving the behavior of all students, and on improving high school GPA for African-American students.  The point estimates of effects on all students’ GPA were positive, but insignificantly different from zero. Some of the estimated effects are large. For example, in the 2007-2008 school year, the estimated Promise effect on the GPA of African-American students is an increase of 0.7 points, on a four-point scale.

I think several points from these findings might be relevant to early childhood education advocates.

First, this study, like many other studies, shows that there are definitely many interventions after early childhood that can make a difference in educational attainment and life prospects. I think it is both a political mistake and substantively wrong to argue that early childhood education inherently has a higher rate of return than later interventions. There are many later interventions with high rates of return, for example some high school tutoring and counseling programs, and demand-oriented adult job training programs. The argument for early childhood education is that it has a high benefit/cost ratio or a high rate of return, not that other interventions don’t also have a high rate of return.

Second, I do think it is true that many later interventions are more complicated to implement than early childhood education. In early childhood education, we are essentially adding learning time. The research evidence suggests that if this is implemented reasonably well, by a typical government agency, we get long-term benefits that significantly exceed costs.

For many later interventions, implementation is more complex, and more politically and substantively difficult. For example, improving teacher quality and school quality in K-12 education is a huge challenge.

The Kalamazoo Promise is an exception to this general pattern for later interventions. The program is simple: graduate from high school and get into a college, and you get a scholarship that pays the tuition. The form required to get the Promise is one page long.

Third, I think the Promise points out one of the virtues of universal programs: simplicity. The Promise would be much more complicated to implement if eligibility depended on family income, student performance in high school, etc. And a more complicated program would be more difficult to explain to parents and students, which makes it less likely to affect attitudes and behavior.

Targeted programs are more complicated to administer, and hence more costly. They are harder to explain to those who are eligible, which restricts participation. Targeting also imposes an implicit tax on earnings, which may discourage labor force participation.

Having said that, targeting is obviously justified if the benefits are demonstrably greater for the targeted group, and if we can organize the targeting so that it is as simple to administer as possible, and so that it minimizes the implicit tax on earnings.  But I think the administrative issues with targeting should not be under-emphasized. Any discussion of universality versus targeted programs in early childhood education needs to consider these practical implementation issues.

Posted in Distribution of benefits, Early childhood program design issues, Early childhood programs

Dealing with uncertainty in research on pre-K

Jason Richwine, in a recent blog post at “The Corner” blog of National Review, expressed surprise at my interpretation of the estimated effects in the Head Start randomized control trial.

I had pointed out that the impact estimates, while not statistically significantly different from zero, are also not statistically significantly different from predicting a 2 to 3% increase in adult earnings, which would probably be sufficient for Head Start to pass a benefit-cost test from earnings effects alone.

Richwine argues that the estimates and their confidence intervals also can’t rule out that Head Start has negative effects. He interprets my comments as arguing that the Head Start impact estimates are “large”. He concludes by arguing the following:

“Such analysis reverses the traditional burden of proof: Rather than showing that government preschool works, advocates now demand proof that it doesn’t work.”

These comments raise some interesting issues about how policymakers should make policy when given research that inevitably has some uncertainty about its estimates.

In making policy decisions, concepts such as the “burden of proof” are more confusing than helpful. The “burden of proof” is a legal concept used in court cases. In making policy, what we have are estimates with some uncertainty, and we have to decide what policy rules are likely over the long-haul to maximize net social benefits.

If the only evidence on public pre-K was the Head Start experiment, policymakers would face a difficult policy decision with considerably uncertain evidence. The point estimates of test score effects at the end of 3rd grade suggest a little more than a 1% increase in adult earnings. This is a modest-sized effect, in my opinion, not a “large effect”, although what is “large” or “modest” is a highly subjective judgment, not a rigorous scientific judgment.  But because adult earnings are so large over an entire career, it would sum to many thousands of dollars. The present value of this earnings gain would probably exceed $5,000.  Head Start costs more than that, but then Head Start also clearly has benefits in the value of the child care services it provides to parents. So the point estimate implies a close call on net benefits.

Furthermore, there is significant uncertainty in these estimates.  The confidence interval includes zero and negative effects, as well as positive effects two or three times as large. How should policymakers deal with such uncertainty?

One approach is to take a skeptical attitude, and assume effects are zero until proven otherwise. But this skeptical approach would not be a particularly good policy rule to adopt if one were faced with many policy decisions over a long period of time. If a policymaker were simply trying to maximize the expected present value of net benefits over thousands of policy decisions, each with evidence from only one experiment, then the optimal decision rule would be to use each experiment’s point estimate to guide decisions, regardless of the confidence intervals. If we use the point estimates, which represent the mean expected impact of each intervention, then over time we will maximize net social benefits by following this rule.

In other words, the legal “burden of proof” principle is not a particularly good guide to making policy decisions over time. The legal rule that we should convict someone of a crime only if they are guilty “beyond a reasonable doubt” is ultimately based on the judgment that we find it socially abhorrent to deprive someone of their life or liberty based on any lesser standard. The huge social cost of convicting an innocent is not really relevant to deciding whether to spend a little more or less on some social or educational program. The costs of mistakenly expanding a social or educational program are not as great as the cost of locking someone up because the probability is 51% that they are guilty.

Another important point is that the Head Start experiment is NOT the only good evidence on the effects of pre-K. We have good evidence from two randomized experiments, Perry and Abecedarian, that pre-K can have large long-run effects. For example, long-run earnings effects are 19% in Perry. We also have good evidence from some natural experiments of long-run earnings effects, for example 8% in the Chicago Child-Parent Center study and 11% in Deming’s study of Head Start.  Finally, we have some good natural experiments, for example in Tulsa and Boston, that show short-run test score effects of pre-K that are larger than found in the Head Start experiment.

In social science, or for that matter natural science, how we interpret any new experiment is influenced by what we already know. If we have substantial reasons from prior research to believe that variable X affects outcome Y, then in considering new evidence, our prior belief is not that X has no effect on Y. In interpreting the new research, we would ask whether the estimated effects in the new research are consistent not only with a null hypothesis of zero effects, but also with a null hypothesis of the estimated effects implied by prior research. Both of these null hypotheses are interesting to explore.  If the new research shows lower effects of X on Y than implied by prior research, this should influence us going forward towards believing that X has lower effects.

In the case of the Head Start experiment, the modest effects found should influence researchers towards believing that at least some pre-K programs have considerably smaller effects than found by Perry or the Chicago Child-Parent Center study or the Tulsa or Boston studies. It should also influence us towards wondering whether Head Start as of the 2002 experiment might have lower effects than it did in the past. And it might influence us towards desiring to reform Head Start to increase its effectiveness, in part by imitating the practices of pre-K programs that have larger estimated effects.  As Barnett has pointed out, there is some evidence that Head Start has increased its educational effectiveness since the time of the 2002 experiment.

The Head Start experiment by itself is not strong evidence in favor of public pre-K. But it is not the only evidence, and it is not necessarily inconsistent with this other evidence. On the whole, the weight of the evidence, as suggested by a number of reviews of the research, is that high-quality pre-K programs can make a significant difference in improving the opportunities of children.  The estimated benefits in the bulk of the research are sufficient to be significantly greater than program costs.

Posted in Early childhood programs

More on weighing the evidence on pre-K

Andrew Coulson of the Cato Institute has a blog post commenting on the debate between me and Russ Whitehurst over what evidence to believe about the effects of pre-K programs.

Coulson’s argument is that the only reliable evidence for ascertaining the effects of “large-scale” public pre-K programs is the randomized control trials for Head Start and Tennessee pre-K, and that these studies reach a consensus: “program effects fade out by the elementary school years…” In Coulson’s view, this is the “evidence that matters when discussing proposals for expanding government pre-K”.

What evidence doesn’t matter, in Coulson’s view? First, he doesn’t regard the Perry Preschool program and Abecedarian “randomized control trial” studies as mattering, because these were “tiny programs”, and it is “difficult to massively replicate any service without compromising its quality”.  Second, he doesn’t regard any non-experimental study as mattering.

The main thing that Coulson’s blog post overlooks is that not all randomized control trials provide evidence of equal quality, and that not all non-experimental studies are of equal quality.  Some non-experimental studies provide evidence on the true causal effects of public policies that is superior to that of some randomized control trial studies.

As mentioned in a previous blog post, the key problem that all pre-K studies are trying to address is “selection bias”. We are concerned that children and families participating in pre-K may differ from those who do not participate. We can control for observed characteristics of the children and families, but unobserved characteristics of children and families may differ between the treatment and comparison groups. These unobserved differences could be causing different outcomes, which would bias the estimated effects of pre-K by some amount that is potentially large and of unknown sign.

In theory, a perfectly run randomized control trial addresses this selection bias issue. Because participation is determined randomly, the treatment and comparison groups would be expected to on average be similar in unobserved characteristics.

But this advantage only holds fully for the full original sample. If there is “large enough” attrition in the randomly chosen groups, the final sample of treatment and comparison households could easily differ greatly in unobserved characteristics.  The final observed sample in that case would no longer really be randomly chosen.

In the case of pre-K, as I pointed out in a previous blog post, the Tennessee pre-K study has some serious problems with large and differential attrition. Therefore, I don’t think it at all appropriate to cite it as meeting some “gold standard” of providing evidence that is more reliable than any study that makes a serious effort to control for observable characteristics. The evidence from the Tennessee pre-K study is suggestive but not definitive, as is true for all studies that can only control for observable characteristics.

On the other hand, non-experimental studies can persuasively deal with selection bias if these studies rely on a “natural experiment”, in which access to the pre-K program varies due to geography or age or some other factor that is plausibly unrelated to unobserved child and family characteristics. All “non-experimental studies” are not created equal. Some non-experimental studies only can control for observable characteristics. Other non-experimental studies have variations in access that while not randomly assigned, may be almost as good as random for identifying the causal effects of the pre-K program.  One such study may be an outlier, but multiple studies increases confidence that we are identifying true causal effects of pre-K on outcomes.

We have many “natural experiments” that show large effects of pre-K. These include: Ladd et al.’s study of North Carolina pre-K; Ludwig et al., Currie et al., and Deming’s studies of Head Start; the various studies of the Chicago Child-Parent Center program; the various “regression discontinuity” studies of state and local pre-K programs. This evidence matters.

The Head Start and Tennessee evidence also matters. The Tennessee evidence is only suggestive, but is not particularly supportive of the effectiveness of that state’s program, which may have too low funding per child to be effective.  The Head Start RCT evidence also suggests that Head Start during that period may not have particularly effective relative to the other pre-K alternatives available to the control group, which included state pre-K programs. Changes over time in Head Start quality and Head Start alternatives is the most straightforward way to reconcile the fading effects in the Head Start RCT with the previous natural experiments that suggest that Head Start has long-term effects. Recent Head Start reforms may have improved its quality relative to Head Start’s quality at the time of the RCT. Head Start reforms to improve quality should continue.

Two other points. I would challenge the argument that Perry and Abecedarian don’t matter. These are programs with characteristics and delivery models that are well-understood. Today’s Educare program is quite similar to the Abecedarian program. Perry is essentially a smaller class size two-year version of today’s state and local pre-K programs. I think the Perry and Abecedarian evidence, in conjunction with the natural experiments, suggest that high-quality pre-K can make a difference.

Second, fading test-score effects are found in a variety of pre-K programs, including Perry, Abecedarian, and the Chicago CPC program. Such fading test-score effects appear to be consistent with large long-term effects on adult outcomes.  One theory to account for this fading and re-emergence is the importance of so-called “soft skills”, such as social skills and character skills. Such soft skills are important in determining educational attainment and adult earnings.

A broad view of the research evidence on pre-K suggests that a variety of programs have strong effects, although this should not be interpreted as meaning that every program always works. There is a lot of variation in program quality and program effects over space and time. For other recent research reviews that also take a broader view of the research evidence, and conclude that pre-K can work, see the review by Yoshikawa et al., and the review by Kay and Pennucci (Report 14-01-2201) for the Washington State Institute for Public Policy.

Posted in Early childhood programs

Weighing the preschool research evidence

Professor Bruce Fuller had an op-ed on preschool in the Washington Post on February 9. Professor Fuller’s interpretations of preschool research omit some important research.

Specifically, Professor Fuller argues that “youngsters from middle-class and well-off homes benefit little from preschool”.  He goes on to say that “young children attending quality half-day programs display the same learning gains as those attending full-day programs”.  Therefore, “we must avoid squandering scarce dollars on full-day programs for children who gain little from preschool”.

Professor Fuller cites some studies that support his arguments. But he fails to mention other studies that go against his arguments.

For example, Professor Fuller does not mention the research studies in Tulsa and Boston that find that universal preschool produces benefits for middle-class children that are only slightly less than the benefits for low-income children. Professor Fuller also does not mention a research study from New Jersey that finds significantly greater benefits from full-day preschool compared to half-day preschool.

An obvious and important question is: which studies should you believe? Should we believe the studies that Professor Fuller cites, or the studies that I cite? Or should we just say that the evidence is mixed and uncertain, which can be interpreted as an argument for inaction until more research is done?

The key problem in any preschool research is what social scientists call “selection bias”. The families that choose preschool differ from those who do not choose preschool, due to both family characteristics that we can observe, and family characteristics that we can’t observe. In addition, programs may choose to select preschool participants due to both observed and unobserved family characteristics.

For example, perhaps families that are more ambitious choose preschool. Or perhaps some preschool programs try to choose children who are easier to manage. Either source of selection would tend to mean that preschool participants will tend to do better than non-participants because of pre-existing family and child characteristics, above and beyond the true effect of the preschool program. Selection bias in estimating program effects would be positive.

Alternatively, perhaps families that are having more trouble with their children tend to try to put their children in preschool. Or perhaps preschool programs with a social mission try to choose needier children. These sources of selection will tend to produce a negative selection bias in estimating the true effects of preschool.

How can this selection bias be dealt with? If there are infinite resources and time, the ideal method is a large and perfectly-run randomized control trial. Preschool applicants would be randomly divided into a treatment and control group. As a result, we would expect average observed and unobserved characteristics in both the treatment and control group to be similar, and as the sample size gets larger, that expectation is increasingly likely to be realized.

But randomized trials are expensive and difficult to run, particularly on a large scale. Therefore, an alternative is to rely on natural experiments, in which some aspect of the world has resulted in different children having differing access to preschool, for reasons that have nothing to do with unobserved characteristics of the child and his or her family.  The treatment and comparison groups, with different access to preschool, will differ in preschool participation, but not observed and unobserved characteristics, and therefore we can interpret the outcome differences as being due to preschool, not pre-existing differences between the two groups.

A third method of trying to control for selection bias is to control for observed characteristics of the child and family.  Such controls help, but by their very nature cannot control for unobserved pre-existing differences between the treatment and comparison groups. Hence, such estimates may be subject to selection biases of unknown size and sign.

The Tulsa and Boston evidence that I am citing on middle-class benefits is based on natural experiments. Access to preschool and to kindergarten is based on an age cutoff.  The essence of the methodology used in these two studies is to compare the test scores of children who just missed the kindergarten age cut-off and are therefore just entering preschool, with test scores of similar children who just made the kindergarten age cut-off, who are just entering kindergarten, and who participated in preschool the preceding year.  These two groups are arguably similar in unobserved as well as observed characteristics because they were similarly selected into the same preschool program. The timing of their preschool access was based on age, and a few days of age in either direction should not make a big direct difference in test scores. The “jump” in test scores that is observed for the slightly older group in such studies is therefore reasonably attributable to the preschool participation the preceding year.

The New Jersey evidence I am citing on full-day versus half-day preschool is based on a randomized control trial. Excess applicants for a full-day preschool opportunity were randomly assigned to either receive full-day preschool, or only receive half-day preschool. The results showed significantly greater test score effects of full-day preschool. In Bartik (2011), I used these estimates to predict that full-day preschool produces 56% greater earnings benefits than half-day preschool.  Therefore, there are some diminishing returns to preschool time (benefits are not doubled), but there are benefits to full-day preschool over half-day preschool.

Most of the evidence that Professor Fuller cites is from the third category of studies, which only can control for observable child and family characteristics. These studies may be biased upwards or downwards by selection bias. Therefore, I would not weigh these studies as heavily.

In my view, the research studies that should receive the greatest weight use randomized or natural experiments to examine the causal effects of preschool, which avoids problems due to selection bias. The research studies that use such evidence support middle-class benefits of preschool, and support greater benefits for full-day programs.

Posted in Distribution of benefits, Early childhood program design issues, Early childhood programs | 4 Comments

What the available evidence shows about middle-class benefits of early childhood education

At the recent Education Writers Association conference on early childhood education, Russ Whitehurst of the Brookings Institution cited Tulsa and Boston studies as evidence that the benefits of early childhood education are much greater for low-income children than for middle-class children.

This is incorrect. The Tulsa and Boston studies actually provide evidence that the benefits of early childhood education are only modestly less for middle-class children than for lower-income children. In Tulsa, the research, in a paper on which I was a co-author, suggests that the test score boost from full-day pre-K for middle class children is about 88% of the boost for lower income children. In Boston, Weiland and Yoshikawa’s research suggests test score benefits of Boston’s full-day pre-K program for middle-class children are 71% of the benefits for lower-income children.

These test score benefits for middle-class children are sufficient to predict adult earnings gains that will be many multiples of costs. The Tulsa study calculates that the ratio of the present value of future adult earnings benefits to program costs for full-day pre-K is 2.82 for middle-class children, which is only modestly less than the 3.09 ratio for children eligible for a free lunch.  For Boston, my analysis of Weiland and Yoshikawa’s findings suggest that the ratio of the present value of future adult earnings benefits to costs for Boston’s full-day pre-K program is 2.30 for middle class children, versus 3.22 for children eligible for a subsidized lunch.

A key point in both findings is that the ratio of predicted future adult earnings benefits for middle class children to program costs is much greater than one. Providing free, high-quality pre-K to middle class children can be rationalized because economic benefits exceed costs. Universal pre-K may also win middle-class votes and support, but universal pre-K can be rationalized on its economic merits rather than just on political expediency.

I know of no other evidence that allows a direct comparison of the relative benefits of pre-K for middle-class and lower-income children. There is one study of pre-K for middle-class children in Utah that shows some benefits.

There might be various reasons why the social benefits of pre-K for lower-income children are much greater than for middle class children, even if the dollar earnings benefits are similar. Lower-income children would be predicted to have baseline adult earnings that are lower, so a similar dollar benefit will be a larger percentage boost to adult earnings.  In Tulsa, our study predicts that the percentage boost to adult earnings for children eligible for a free lunch is over 10%, whereas the percentage boost for middle-class children is between 5 and 6%. We might judge that providing extra dollars to lower income children is more valuable because it has a more dramatic impact on their future well-being.  In addition, it is a plausible hypothesis that pre-K may have greater benefits in reducing crime and welfare usage for children from lower-income families than for middle-class children, although I know of no empirical evidence for or against such greater relative benefits.

For child care programs, Duncan and Sojourner’s study of the Infant Health and Development program suggests that this program only boosts test scores for lower-income children. For parenting programs, studies of the Nurse Family Partnership suggest that NFP only works for lower-income families, not middle-class families.  Pre-K may be different from child care and parenting programs because pre-K may provide social and cognitive learning in a group setting that is hard for many middle class families to duplicate on their own.

The evidence is sparse on the absolute and relative benefits of early childhood education for middle-class children. This evidence is always likely to be sparse because there is not great interest from government or the philanthropic community in sponsoring extensive research on how early childhood education affects the middle class.  But the available evidence provides some economic support for universality in pre-K programs, while the pattern of benefits for children would argue for targeting child care and parenting programs on lower-income families.  Considering how programs benefit parents might alter these calculations for relative benefits and costs for different income groups, and is an important topic for future research.

Posted in Distribution of benefits | 1 Comment