Where is the weight of the evidence, and the burden of proof, for targeted vs. universal pre-K?

The Hamilton Project has released a useful e-book that presents evidence on selected anti-poverty policies. This includes some discussion of pre-K programs, by Elizabeth Cascio and Diane Whitmore Schanzenbach.

The Cascio/Schanzenbach chapter argues for expansion of high-quality targeted pre-K.  My own view, as I have stated previously, is in favor of high-quality universal pre-K. What is the evidence for and against each position?

Part of the issue is political. I simply do not think there will ever be sufficient political support for targeted pre-K programs to enable large-scale access of the poor and near-poor to high-quality pre-K. Therefore, from a political perspective, Cascio/Schanzenbach are supporting a policy proposal that will never be fully implemented.

But I also think there are good economic reasons to support universal pre-K over targeted pre-K. Part of the issue is what evidence one finds more convincing.

Cascio/Schanzenbach rely on their own evidence, from a prior paper analyzing test score effects in Georgia and Oklahoma, to argue that universal pre-K mainly benefits the disadvantaged. The problem with this evidence is that it relies on comparisons between what happens in two states and what happens in other states. There are many unobservables that affect test scores in any two states. These unobservables can bias estimates. Furthermore, these unobservables do blow up standard errors when one fully allows for them. As they acknowledge in their own prior paper, estimation procedures that fully allow for the many unobserved “shocks” that affect state test scores find that most of their estimated effects have very big standard errors, so it is hard to reach definitive conclusions.

It is a tempting strategy to use states as “laboratories of democracy”. The estimation problem is that if one only has two “test subjects” (states) in the “treatment group”, estimation errors tend to be very wide.

In contrast, I would point to what I would argue is stronger evidence, from “regression discontinuity” studies of Boston and Tulsa. (Cascio/Schanzenbach have a footnote referring to the Boston study, but they don’t mention the Tulsa study.) This evidence indicates that these universal pre-K programs have effects on kindergarten entry test scores for middle-class children that are 70% (Boston) to 90% (Tulsa) of effects for lower-income children.

In my opinion, regression discontinuity studies provide “silver standard” evidence of effects of pre-K. These studies are NOT random assignment experiments, which would provide stronger, “gold standard” evidence. However, these regression discontinuity studies seem unlikely to be biased by unobservable differences between the “treatment group” (the children who have completed pre-K, and who are at kindergarten entrance) and the “control group” (the children just entering the pre-K program). Both groups of students either are just entering or have recently finished the same pre-K program, so the two groups do not differ in factors causing families to select or not select a pre-K program, or in factors that might affect how program procedures might lead to the programs selecting certain types of families.

The regression discontinuity procedure is essentially to look at the differences between the two groups of students in scores on the same tests, and then to look at how scores vary with age to control for the fact that the treatment group is on average one year older. We expect to see that the scores will go up with age, and if pre-K has an effect, will show a “jump” (a discontinuity) at the age cut-off that separates children who are just a little too young to yet attend kindergarten, and who are entering pre-K, and children a few days older who are just old enough to enter kindergarten, and who participated in pre-K the previous year.  (See my paper with Gormley and Adelstein for more extended discussion of the regression discontinuity model applied to pre-K.)

In addition to this “silver standard” evidence, we have one random assignment study, of Utah, that provides “gold standard evidence” that pre-K has positive effects for middle-class children.

I would regard the Boston/Tulsa, and Utah studies, as providing stronger evidence than comparisons between Georgia/Oklahoma and other states, largely because of the many unobservable variables that can cause test scores in a state to fluctuate. Having said that, is the evidence that pre-K benefits middle-class children as strong as the evidence that pre-K benefits lower-income children? No, it is not, largely because pre-K for lower-income children has been subject to many high-quality studies, both random assignment and others with good comparison groups, which show that pre-K benefits lower-income children. In contrast, there are fewer studies on pre-K services to middle-class children.

Here is where we get into the highly subjective issue of where the burden of proof should be. One position is to say, if there is no strong and overwhelming evidence that pre-K benefits middle-class children, but that there is such evidence for lower-income children, we should only favor targeted services. We shouldn’t favor a universal program unless there is very strong evidence for benefits to middle-class children.

The other position is to say, given that there is very strong evidence of pre-K benefits for lower-income children, and some good evidence for its benefits for middle-class children, it is more likely than not that pre-K programs benefit middle-class children. And one could argue that this is more than enough evidence to move ahead.

Let’s think through the benefits and costs of moving ahead with universal programs. There is some evidence that this will benefit middle-class children, although not “proof”. There is some evidence that this will increase political support for the programs. At the very least such programs provide some help to working-class voters facing various economic issues. And pre-K programs including middle-class children may have more favorable peer effects that will increase program quality for lower-income children. In my view, the benefits of moving ahead with universal programs outweigh the risks.

In contrast, if we just move ahead with targeted pre-K for lower-income children, we may be missing an opportunity to have more significant effects on overall labor force quality by also assisting middle-class children. We lose any opportunity for positive peer effects from having income-mixed classrooms. And we are setting up a program that is doomed to difficulties in eliciting political support, particularly at the state and local level. There are risks involved in supporting targeted programs over universal programs.

Targeted vs. universal pre-K programs is a debate in which we must go beyond a narrow economics debate over what effects are statistically significant, to a debate over the preponderance of the evidence, and debate over what policy approach is best suited for making progress in a real-world political environment.

Posted in Distribution of benefits, Early childhood program design issues, Early childhood programs | 1 Comment

Achievement gaps at kindergarten entry, income inequality, universal pre-K, and more-intensive early childhood education

Milagros Nores and Steve Barnett have written a recently-released report on how kindergarten readiness and preschool enrollment varies by different groups, including for different income groups. What they document is that at kindergarten entrance, children in disadvantaged groups are far behind children from advantaged groups in school readiness, as measured by cognitive test scores. These kindergarten readiness gaps are not offset by pre-K attendance at age 4, as it appears attendance in quality pre-K is higher for children from more advantaged backgrounds.

For example, they find that children from families with incomes below twice the poverty line have kindergarten entry test scores that average between 0.6 and 0.7 standard deviations lower than the test scores of children from families with incomes above twice the poverty line. To put this in terms that might be more intuitive, average kindergarten entry test scores for low-income children are at about the 40th percentile of all children, whereas average kindergarten entry test scores for higher-income children are at about the 65th percentile of all children, a gap of 25 percentiles. On the other hand, they find that while only about 25% of higher –income 4-year olds are enrolled in pre-K programs that are high-quality, this percentage is still lower for low-income children, at around 18%. Quality pre-K enrollment patterns are reinforcing rather than alleviating kindergarten readiness gaps.

In this blog post, I want to relate these kindergarten readiness gaps to the adult earnings distribution, and see what various designs of early childhood education interventions can do to address these achievement and earnings gaps.

Eliminating the kindergarten readiness gap would do a great deal to help the future prospects of children from low-income families, but would not come close to eliminating all earnings gaps. Based on the estimated effects of kindergarten test scores on adult earnings, if we eliminated these kindergarten readiness gaps, by increasing kindergarten entry test scores of lower-income children by 25 percentiles, we would predict that this would increase the adult earnings of these children by about 17%. This is certainly a significant lifetime earnings boost. But it is far short of the amount their expected earnings fall below that of their more affluent peers. Children from low-income families would be expected to have future adult earnings of 55% of children from higher-income families. A 17% earnings boost would increase this ratio of earnings of children from the two groups from 55% to 64%. (64%=55% times 1.17). Thus, eliminating the kindergarten readiness gap in cognitive test scores would only close one-fifth of the adult earnings gap (9% improvement in initial 45% gap).

(Notes on calculations. These calculations of earnings ratios rely on relative earnings of the parents of these children, and estimated intergenerational correlations of earnings. In addition, the effects of raising kindergarten test scores are estimated based on research by Chetty et al. Bartik, Gormley and Adelstein use similar methods, and similar methods are also used in my forthcoming book, From Preschool to Prosperity).

One can view the glass as either surprisingly half-full or surprisingly half-empty. On the one hand, it is amazing that just increasing kindergarten entry test scores by 25 percentiles can have such profound effects on adult earnings, increasing earnings by over one-sixth. This is probably attributable to “skills begetting skills” (Heckman). Students who enter kindergarten with stronger test scores test will tend to learn more in kindergarten, and so on.

On the other hand, why doesn’t eliminating the starting cognitive test score gap fully equalize earnings? For several reasons. First, eliminating the kindergarten entry gap does not eliminate subsequent gaps in the quality of K-12 education, or in access to quality post-secondary education, which will affect adult cognitive skills. Second, there is more to skills affecting earnings than cognitive skills, so eliminating cognitive skills gaps does not necessarily skill gaps in “soft skills”. Third, adult earnings are not just affected by skills, but by access to networks and wealth that can help a person get a better job.

What can early childhood education do to reduce these achievement gaps, and to reduce income inequality? Quite a bit, but early childhood education cannot fully solve either the achievement gap problem or the earnings gap problem. Based on studies in Tulsa and Boston, universal pre-K would help both lower-income children and middle-class children to improve their kindergarten entry test scores, and by similar percentiles. Realistic projections suggest that kindergarten entrance scores might increase due to universal pre-K by 15 percentiles for lower-income children, and for higher-income children by 12 percentiles. This would only slightly lower the test score gap, from 25 percentiles to 22 percentiles. These test score increases would be predicted to significantly boost earnings of both groups. The expected future earnings of children from lower-income families would increase by about 10 percent, and by children from upper-income families by about 5 percent. Although these earnings increases are significant for both groups, it would only be predicted to slightly reduce the earnings gap – the predicted earnings of children from lower-income families would increase from 55% to 58% of the predicted future earnings for children from higher-income families.

However, these projections understate the potential income redistribution from early childhood education for at least two reasons. First, universal pre-K by design is helping all children to improve their future prosperity. It could do more to improve income distribution by not helping middle-class children, but this hardly makes sense if the program is benefitting these children. However, as discussed further below, other early childhood education programs are better designed as targeted programs.

Second, these test score projections may understate the future earnings impact of universal pre-K and other early childhood education. There is considerable evidence that early test score impacts of pre-K programs may tend to understate long-term earnings impacts. For example, this is true for the Perry Preschool Program. (See my forthcoming book for more discussion of this point. In that book, I project that Perry’s early test score impacts would predict an adult earnings impact of 12%, which is over one-third below the estimated impact based on actual adult outcomes of 19%.) This understatement is probably because cognitive test score impacts do not capture the benefits of pre-K programs for improving non-cognitive skills.

Some early childhood education programs, such as the Abecedarian/Educare program of full-time child care from birth to age 5, seems to work far better for lower-income children than for other groups. These programs should therefore be designed as targeted programs for lower-income families.

Estimates suggest that an Abecedarian/Educare program can increase future earnings of former child participants from lower-income families by 26%. (Again, my forthcoming book has more supportive information behind this calculation.)This is a huge increase in future living standards. However, again, we should remember that it only closes a portion of the earnings gap. If Abecedarian/Educare was implemented for all children from lower-income backgrounds, it would increase their expected future earnings from 55% to 69% of the expected future earnings of children from upper-income families (69%=1.26 times 55%). This cuts the future earnings gap by 1/4th. (45% to 31%)

We could again debate whether the glass is half-empty or half-full. On the one hand, early childhood education programs do not “solve” the income distribution problem in that they do not make the income distribution match some utopian scheme for perfect earnings equality. On the other hand, perfect earnings equality is probably unachievable. It is amazing what can be done to reduce earnings inequality with just a few years of high-quality targeted intervention in early childhood.

Early childhood education programs have the great advantage of being an economic intervention that we know how to do, that will both promote greater economic growth and greater economic equity, at the same time. Early childhood education does not achieve utopian economic justice, but what one program can, in the real world?

Posted in Distribution of benefits

What do we know about pre-K peer effects?

A recent opinion piece by David Kirp in the New York Times argued that it makes no sense to put low-income children in income-segregated pre-K programs, as we do in the Head Start program, because of the importance of classroom peer effects. If low-income children learn more if more of their peers are from a variety of backgrounds, then pre-K programs will be more effective in closing achievement gaps if they are income-integrated programs. This does not necessarily require universal free pre-K (one could imagine some sort of voucher program for low-income children, or some sort of sliding scale fees for a universal pre-K program), but it is one argument in favor of real-world universal programs such as the program in Oklahoma, and against common targeted programs in which low-income children are restricted to their own pre-K classrooms.

(David Kirp is a former newspaper editor and current public policy professor at Berkeley who has written two very good books that directly address early childhood education, The Sandbox Investment, and Kids First. His most recent book, Improbable Scholars, looks at an inner-city school district that he argues has achieved success through a variety of policies and practices that include early childhood education. )

Kirp cites one study of Connecticut preschools to support his opinion piece, by Schechter and Bye (2007). To my knowledge, there are four other preschool studies that look at peer effects, by Henry and Rickman (2007), Mashburn et al. (2009),  Justice et al. (2011), and Reid and Ready (2013). What do we find from these studies?

(1)    Only Schechter/Bye and Reid/Ready directly look at peer income effects. The other studies look at effects of peer pre-existing skills on student learning during pre-K. All of these studies show effects of peers in raising learning during pre-K.

(2)    Henry and Rickman’s results suggest that the magnitude of this effect averages about a 20% spillover: if my peers at the beginning of pre-K have 50% higher skills, we would expect my test scores to be 10% higher at the end of pre-K (10% = 20% of 50%), holding all other pre-K characteristics constant.

(3)    The studies have mixed evidence on whether greater integration of pre-K of children with different characteristics will increase overall pre-K effects for the entire population.


Let me elaborate on that last point. Peer effects at least potentially go in both directions: students with stronger initial skills may have peer effects on their initially less-skilled peers, and students with weaker initial skills may have peer effects on their initially more-skilled peers. If these peer effects are completely the same for all types of children, in all types of classrooms, then peer effects would not lead greater income integration or skills integration to increase overall pre-K performance.

For example, suppose we consider two alternatives: completely segregated classrooms by income and skill level, and completely mixed classrooms by income and skill level. If we go from the segregated to the mixed classroom situation, the lower income or lower-skill students benefit from the influence of their higher income or higher-skill peers. But the upper income or higher-skill students might lose the same amount from the influence of their lower income or lower-skill peers.

The case for income and skills integration is that these peer effects are ASYMMETRICAL, that is differ either across different types of students, or at different levels of integration. For example, suppose that lower-income or lower-skill students on average are very influenced by their peers, but upper-income or higher-skill students on average are not so influenced by their peers. This might well be plausible. One could imagine that learning depends upon the richness of language one hears at school, at home, and at play. If middle-class or middle-skill children already have a higher likelihood of having been exposed to such rich language outside of school, perhaps they are less dependent on hearing such language at school. But lower-skill or low-income children might be more dependent, on average, on hearing such language at school.

(A word might be appropriate about the dangers of generalizing about a group. I fully recognize that there is great diversity of children within any group we might define based on income or some test. The peer effect patterns I am referring to might be tendencies for the average child in the group, and may not be at all true of any individual child.)

As another example, peer effects might differ across classrooms.  For example, perhaps an income or skills-integrated classroom is far better in its “peer effects” than a classroom with 100% low-income or initially lower-skilled children, but perhaps the well-integrated classroom is similar in its peer effects to a classroom with 100% higher-income or initially higher-skilled children. This also seems plausible as a hypothesis. In other words, there might be some threshold or tipping effects of different levels of income and skills integration.

When I say the effects from the current research is mixed, what I mean is that only two studies specifically look at this symmetry, and they find different results. Mashburn et al. (2009) find some evidence that peer effects are larger for children with higher initial skills. On the other hand, Justice et al. (2011) find some evidence that peer effects are larger in going from 100% low-skill classrooms to integrated classrooms than they are in going from integrated classrooms to 100% high-skill classrooms.

However, there is other evidence that also bears on this issue. Both Tulsa and Boston run pre-K programs that include some middle-class children as well as low-income children. For both these cities’ pre-K programs, although there are plenty of middle-class children, they are in a minority: 25% of the Tulsa pre-K children are ineligible for a free or reduced price lunch, and 31% of the Boston pre-K children are ineligible for a free or reduced price lunch.

The evidence from both Tulsa and Boston suggests that whatever negative peer effects MIGHT occur for middle-class children from being in a pre-K program that includes a substantial majority of children eligible for a free or reduced price lunch, these peer effects do not prevent middle-class children from gaining substantially from income-integrated pre-K programs. The gains from pre-K for middle class children in test score percentiles in Tulsa are about 90%of the test score gains for lower-income children; the gains from pre-K for middle class children in Boston are about 70% of the test score gains for lower-income children. In both cases, the predicted dollar effect of these pre-K programs on future adult earnings for middle-class children are at least double program costs.

In other words, whatever academic uncertainty there is about the nature of peer effects, universal income-integrated pre-K programs, if run in a high-quality fashion, appear to be able to achieve substantial benefits for both the poor and the middle class.

Posted in Distribution of benefits, Early childhood program design issues

Whitehurst’s latest comments on pre-K

Russ Whitehurst has some more recent comments on pre-K, this time arguing against a more recent study of Georgia pre-K. This more recent study found pre-K effects on cognitive skills which, averaged across all tests used, had an average “effect size” of 0.69. This is quite high.

(“Effect size” is education research jargon for scaling the effects of some policy on test scores by dividing the effect by the “standard deviation” of the test score across students. This is an attempt to control for the arbitrariness of test score metrics by measuring the test score effect relative to how much this particular test score seems to vary in the sample.)

Whitehurst mainly argues against this study’s validity for two reasons, one of which is a weak argument, and the other of which is a stronger argument. First, he argues that that there’s a problem in all regression discontinuity studies because some pre-K graduates inevitably disappear from the sample when they’re followed up on at the beginning of kindergarten. Although this sample attrition could cause bias in program estimates, a bias which could go in either direction, in practice careful studies find that this bias is small. For example, the Boston regression discontinuity study did numerous tests for possible biases and found no sign of them. The Kalamazoo study did some estimates that restricted the sample to only the same children observed prior to pre-K and after pre-K, and found no significant difference in the estimates.

A second and more valid concern is that the Georgia study has much larger sample attrition due to problems in obtaining consent from the families and schools of pre-K graduates entering kindergarten. Furthermore, there are some signs that this differential sample attrition led to the entering kindergarten sample being somewhat more advantaged.  This differential in family consent rates could have led to more advantaged children being over-represented in program graduates, which might bias the study towards over-estimating program effects. I’m sure these issues will be discussed as this report is submitted to academic journals, and is evaluated and re-estimated during the academic refereeing process.

Whitehurst also expresses some doubt about the large size of the estimated effects. The effects are large, although Whitehurst exaggerates the differentials from other research. The average effect size from previous studies in a meta-analysis by Duncan and Magnuson is 0.35, and in a meta-analysis by the Washington State Institute for Public Policy is 0.31.  These average effect sizes tend to be lower for more recent studies, and for Head Start than for state and local pre-K programs.

The regression discontinuity studies tend to get a bit higher effect sizes. For example, average effect sizes for the regression discontinuity study of Boston pre-K was 0.54.

But, as I have discussed previously, and as Whitehurst has alluded to previously, regression discontinuity studies of pre-K are estimating something a little bit different than other pre-K impact studies. Regression discontinuity studies are studying effects of pre-K for program graduates relative to what would have occurred if they had just missed the age cut-off for pre-K entrance and had not attended this subsidized pre-K until a year later. This means that regression discontinuity pre-K studies are in many cases comparing pre-K with no pre-K, as parents are less likely to enroll children in pre-K if they will not be attending kindergarten the next year. In contrast, other pre-K impact studies are measuring the effects of some public pre-K program relative to a comparison group which will be attending kindergarten the next year, and therefore the comparison group is more likely to attend pre-K. The fact that the comparison group is more likely to attend pre-K probably reduces the net impact estimates for these other pre-K studies.

Which type of estimate is more useful? I think they’re both useful. The regression discontinuity results tell us something about the effects of pre-K versus no pre-K. This is useful for comparison with the gross costs of pre-K. The RD estimates are closer to what a labor economist would call “structural estimates” of the effects of pre-K, which can be useful for modeling the effects of other pre-K programs.

On the other hand, other pre-K estimates tell you the effects of this particular pre-K program versus whatever other pre-K programs are currently available in that particular pre-K marketplace. This is useful if the only policy we are considering is whether or not to adopt this particular pre-K program in this particular pre-K market.  In that case, a benefit cost analysis would have to compare the net benefits of this program versus the extra net social costs of substituting this new program for existing programs. In other words, the new program’s costs may be reduced considerably because it may save in costs on existing pre-K programs, which means it doesn’t take as big an effect size for the program to pass a benefit-cost test.

For both of these types of estimates, extrapolating the estimates to some other pre-K program in some other state or local area requires some assumptions. In general, introducing a new high-quality free pre-K program in any particular local area will result in some increases in pre-K enrollment in this program, and some reductions in enrollment in other programs, with the exact pattern depending on the program being introduced and what is currently available in that market.  Neither the RD estimates, nor the estimated effects of some other pre-K program in some other market, will tell you the net benefits of a new pre-K program in a new market without further assumptions about program take-up of the new program versus the old programs, and without some assumptions about the relative quality of the new program versus the old programs.

In sum, I think the Georgia estimates are only suggestive, because of the problem of differential attrition in the treatment and control groups due to survey non-consent. The estimates may be correct, but this would require further analyses to demonstrate that the survey non-consent problem does not significantly bias the estimates.  Because of this problem with survey non-consent, I would currently give this study a grade of “internal validity” (or “research reliability”) of C, although this grade might be moved up by further estimates by the authors to examine this issue.

However, the Georgia estimates are not representative of most of the regression discontinuity studies, which have done further analyses which suggest that the estimates are not biased by problems with attrition.

Whitehurst also updates his analysis of research to downgrade slightly his grade of “internal validity” (intuitively, research reliability)  of the recent Tennessee study, which found quick fade-out of pre-K test score effects in Tennessee, to A- from A.  But he does not note the factors that lead me to give the Tennessee study a grade for “internal validity” of C: specifically, there was differential attrition due to problems of family consent in the control group in this study, and the few estimates that did not suffer from this attrition bias suggest that the Tennessee program may have had greater effects than are found in the main estimates.

In other words, the Tennessee study actually has stronger evidence of biased estimates than is true of this recent Georgia study. However, for the Tennessee study, the bias appears to be leading the pre-K effects to be under-estimated. There certainly is no good reason to give the Tennessee study a higher grade for research reliability than the Georgia study.

Posted in Early childhood programs

The importance of education, and a pre-K experiment to watch

Two articles recently came to my attention that are of considerable relevance to early childhood education.

First, New York Times reporter Eduardo Porter has an article and interview with economist Thomas Piketty on growing economic inequality. Piketty is the author of a new book on inequality that is getting a lot of attention.

One quotation from Piketty in the interview struck me as particularly relevant to early childhood education, and indeed education in general:

“Historically, the main equalizing force — both between and within countries — has been the diffusion of knowledge and skills.”

I think this summarizes what many economists believe about the role of education. But it is important than one of our leading scholars on economic inequality across the world over the last century agrees with that conclusion.

The policy implication is that if one thinks that inequality is one of the leading social issues of our time, it is imperative to go to great lengths to broaden educational opportunities. Early childhood education is one of the most cost-effective ways to do so, although it should be accompanied by other policies as well.

Second, New York Times reporter Kate Taylor had an article reporting on an experiment testing the “Building Blocks” math curriculum in pre-K. (I thank a tweet from the Human Capital Research Collaborative for drawing this article to my attention.)

One point of note in this article is that this particular curriculum is used in Boston’s pre-K program. As noted in a previous blog post, an article by Weiland and Yoshikawa found extremely high test score effects of Boston’s program. I estimate that this program would increase kindergarten readiness among low-income students sufficiently to increase adult earnings by 15%, which is a huge effect for a one-year program.

An important issue is why Boston’s program is so effective. Perhaps this experiment will tell us whether the math curriculum is key. More time will tell.

Posted in Distribution of benefits, Early childhood program design issues

Reducing inequality may sometimes increase economic growth – and a specific example is early childhood education

Nobel prize-winning economist Paul Krugman devoted his column this morning to recent empirical evidence, from the International Monetary Fund, which indicates that reducing income inequality need not reduce economic growth. This goes against a tradition among economists as seeing an inherent tradeoff where reduced income inequality can only be pursued at a cost in reduced economic output or growth.

A prime example of a public policy that both reduces inequality and promotes economic growth is increasing access to high-quality early childhood education, such as pre-K programs and high-quality child care.

As I’ve mentioned in previous posts, high-quality pre-K can increase the adult earnings of children from low-income families by 10% or more.  High-quality child care and pre-K from birth to age 5 can increase the adult earnings of children from low-income families by over 25%.

Yet these policies will also increase economic growth. The evidence suggests that these extra skills and earnings for children from low-income families will provide spillover economic benefits for the rest of society.

These spillover benefits occur because my earnings depend in part on the skills of my fellow workers, in my firm, and elsewhere in my local economy. Firms are better able to introduce new technologies when a higher percentage of all workers are skilled, so my firm may be more competitive when my fellow workers get more skills. Firms’ competitiveness also depends on the skills of local suppliers, so my wages may depend on the skills of those suppliers’ workers. Firms may also be more innovative if they are able to get ideas and skilled workers from other local firms.

How do these spillover benefits occur? They occur by firms investing more and creating more local jobs when a local economy increases its overall skills. Expanded pre-K and other early childhood education programs can expand the local skills base. A worker can benefit from such expansion of early childhood education even if his or her skills would have been fine even without the expansion – the increased skills of other workers will boost job creation and boost worker productivity for all local workers.

Early childhood education is a prime example of a case where all workers share in the economic fortunes of an economy, which depend in part on everyone’s skills. Investing in “other people’s children” not only is a moral issue, but also an issue of enlightened self-interest.

Posted in Distribution of benefits

Grading the Pre-K Evidence

Russ Whitehurst of Brookings has a new blog post that outlines his views on pre-K research in more detail.  The title is “Does Pre-K Work? It Depends How Picky You Are”.

Whitehurst reaches the following conclusion:

“I conclude that the best available evidence raises serious doubts that a large public investment in the expansion of pre-k for four-year-olds will have the long-term effects that advocates tout. 

This doesn’t mean that we ought not to spend public money to help families with limited financial resources access good childcare for their young children.  After all, we spend tax dollars on national parks, symphony orchestras, and Amtrak because they make the lives of those who use them better today.  Why not childcare? 

It does mean that we need public debate that recognizes the mixed nature of the research findings rather than a rush to judgment based on one-sided and misleading appeals to the preponderance of the evidence.”

Therefore, it is fair to say that Whitehurst is marketing doubt. Maybe pre-K doesn’t work. Maybe we shouldn’t move forward with large-scale programs, and instead should undertake more limited measures or do more research.

He admits that opponents of his position, who believe that pre-K does work, are also basing their position on scientific research, and wonders: “how is that different individuals could look at the same research and come to such different conclusions?”

His framing of the issue is that he is just more “picky” about what research he believes. In his view, his opponents, when claiming that the “preponderance” of evidence supports pre-K, are relying on weak research, whereas he is relying on the strongest research in saying that pre-K does not work.

In his view, the strongest research, to which he gives straight “As” for quality, is the recent Head Start randomized control trial (RCT) and the recent Tennessee RCT.  All the other evidence for the effectiveness of pre-K, in his view, is inferior in research rigor (“internal validity”) and/or less policy relevant to today’s policy choices (“external validity”).

Let me make some summary comments upfront before getting into the details of Whitehurst’s research review.

First, I think all researchers seek to be “picky” in reviewing research, in trying to assess the rigor of the research, and its relevance to the policy question at hand. However, even researchers who are equally “picky” can disagree about what the strengths and weaknesses are of various studies.

Second, in my view, Whitehurst significantly overstates the quality and relevance of the Tennessee RCT, and the relevance of the Head Start RCT.  He’s not “picky” enough!

Third, Whitehurst underplays the findings and understates the research strengths and relevance of many other research studies.  He also omits recent relevant research.

Fourth, Whitehurst never grapples with a fundamental issue in pre-K research: it does not take much of a pre-K impact on test scores for pre-K’s predicted earnings benefits over an entire career to justify considerable costs. Effects he characterizes as “small” are in many cases more than sufficient for programs to pass a benefit-cost test.

Fifth, Whitehurst never discusses another fundamental issue in pre-K research: test score effects often fade as children go through the K-12 system, but then effects on adult outcomes such as educational attainment or earnings re-emerge despite the fading. The faded test score effects are often poorer predictors of adult outcomes than the initial post-program test score effects. This means that studies with good evidence on adult outcomes gain importance relative to studies that only go through elementary school, and that studies with good evidence on immediate post-program outcomes gain in importance relative to studies that only go through elementary school.  The elementary school test data adds some evidence, but not as much as might at first appear.

Sixth, if Whitehurst believes in the usefulness of child care services, the most logically consistent position is that he should back expanding programs such as Educare (full-time child care and pre-K from birth to age 5) to all low-income children. In my book Investing in Kids, I argued that the research evidence on child care and on the Abecedarian program, which was very similar to today’s Educare program,  suggested that a program such as Educare would have earnings benefits for parents that significantly exceeded program costs.

So why not expand Educare, which would help low-income parents increase their work and their educational attainment, leading to significant boosts to parents’ short-run and long-run earnings? If Educare also helps improve the children’s long-run prospects, so much the better. (And in fact Whitehurst seems to like the Infant Health and Development Program research that supports that there would be such benefits for low-income children from an Educare-style program.)  

I estimate that an Educare program for all families below the poverty line would cost around $70 billion per year, but would have parental earnings benefits significantly greater than that. This proposal would be consistent with a previous proposal made by colleagues of Whitehurst at Brookings.  Such a proposal goes far beyond the cost of any preschool proposal made by the Obama Administration. But I think it would be a logically consistent proposal for Whitehurst to make. Whitehurst should be arguing that the Obama Administration preschool proposal is underfunded, not sufficiently comprehensive in its birth-to-five services, and insufficiently targeted on low-income families. (Note: this is not my position; for example, I’m in favor of universal pre-K. What I am describing is the position that is most consistent with Whitehurst’s own review of the research evidence.)  

Before I get into the details, one more important headline issue: why should policymakers or journalists or other policy “influencers” believe my position, that the best evidence supports pre-K’s effectiveness, rather than Whitehurst’s position, that the research is more uncertain? The best way is to simply look at the research studies on your own, and make up your own mind, but how is one supposed to do this without an extensive background in statistics and research methodology?

Whitehurst’s position of doubt has a structural advantage in the debate in the public square.  Some researchers argue that pre-K works, others say it may not: the headline news to an outside observer is that doubt wins the debate as long as the side that is promoting doubt has a consistent position that cites evidence.  It’s easier to spread doubt than to assuage doubt.

Therefore, I would also make the following argument: many other researchers familiar with the pre-K research evidence disagree with Whitehurst, and agree that pre-K can work.  Among pre-K researchers, Whitehurst’s weighting of the evidence is a distinct minority position.

Consider a recent research summary, “Investing in Our Future: The Evidence Base on Preschool Education”, which was authored by 10 prominent researchers on pre-K from a variety of disciplines and universities.  This study concluded the following:

“Recent meta-analyses drawing together the evidence across decades of evaluation research now permit us to say with confidence that preschool programs can have a substantial impact on early learning and development….

While there is clear evidence that preschool education boosts early learning for children from a range of backgrounds, we also see a convergence of test scores during the elementary school grades so that there are diminishing differences over time on tests of academic achievement between children who did and did not attend preschool. Yet the most recent research is showing an accumulation of evidence that even when the difference in test scores declines to zero, children who have attended preschool go on to show positive effects on important adolescent and young adult outcomes, such as high school graduation, reduced teen pregnancy, years of education completed, earnings, and reduced crime…

Although random assignment of children or parents to program and comparison groups is the “gold standard” for program evaluation, sometimes this is not possible.One of the most frequently used alternative methods…is called a Regression-Discontinuity Design. …Comparing kindergarten entry achievement scores for children who have completed a year in Pre-K with the scores measured at the same time for children who just missed the birthday cutoff and are about to enter Pre-K can be a strong indicator of program impacts…Other methods used in recent nonexperimental preschool studies include propensity score weighting, individual, sibling or state fixed-effects, and instrumental variables analysis…Evaluations that select comparison groups in other ways should be approached with healthy skepticism.”

Therefore, it is clear that other researchers weight the evidence quite differently from Whitehurst. This is in part because other researchers, while noting that RCTs are the “gold standard”, view other studies as having sufficiently good comparison groups that they provide good “silver standard” evidence.  Other researchers are also aware that few RCTs are so perfect that they are pure “gold standard”; in practice, we find that the gold is almost always alloyed with some less precious metal.

Now, onto the details. To do this, I’ll regrade the various studies that Whitehurst examines, along with adding one recent study that he omits. I’ll use his criteria of looking at “internal validity” (intuitively, the reliability of the research in identifying some causal effect of some program), and “external validity” (intuitively, the study’s relevance to current policy debates). I’ll also add my own take on what the reported impact shows.

Programs from the 1960s and 1970s

Program/Research Reported Impact (after initial year) Internal validity External validity
Whitehurst Bartik Whitehurst grade Bartik grade Whitehurst grade Bartik grade
Perry + + A- A- C B
Abecedarian + + B+ B+ C B
Chicago CPC + + C B- B B+
Head Start in 60s + (for mortality) + (for mortality and ed attainment) B B C C

For Perry and Abecedarian, I would upgrade their “external validity/policy relevance” from C to B because I think that these program designs and the study’s results are quite relevant to what we are doing today. First, the design of these two programs is similar to what we are doing today. Abecedarian is quite similar to today’s Educare program. Perry is similar in many respects to today’s pre-K programs. Class sizes in Perry were smaller than most of today’s programs, and the program went for two years, versus one-year for most programs today, which would tend to reduce impact below Perry’s estimated adult earnings impact of 19%. However, most studies suggest modest impacts of class size on pre-K outcomes and that two years of pre-K does not double benefits, so we would not expect Perry to have effects far beyond current pre-K programs. And many of today’s pre-K programs are full-day, which has been shown to have larger impacts than half-day.

Another aspect of Perry and Abecedarian that increases their relevance is that they contain direct evidence on effects on adult outcomes. Because effects on test scores often fade, and these faded test score effects may not reflect adult outcomes, this makes these studies more important.

In addition, it is true that Perry and Abecedarian are experiments that mostly compare pre-K with no pre-K, whereas today any new pre-K program is comparing the new pre-K program with both no pre-K and with some control group members going to some other subsidized pre-K program. But this merely complicates the analysis of the impact of a new pre-K program and puts a premium on the program being as high-quality as existing programs. A real world benefit-cost analysis of a new pre-K program will adjust the benefits and costs downwards for substitution of the new program for existing programs.  For example, the Rand Corporation did this in its analysis of the effects of a universal pre-K program. Because the options in the pre-K market are always changing, even today, impact analyses of a new pre-K program will have to make adjustments for changes in the options in the pre-K market.  There are some scientific advantages to having “clean” estimates of the impact of pre-K versus no pre-K, which Perry and Abecedarian provide.

For Chicago CPC, Whitehurst fails to note that much of the variation in pre-K use was due to neighborhood of residence, and hence can be viewed as being in part a “natural experiment”.  Furthermore, the CPC researchers have gone to much length to try to correct for any remaining selection bias in the estimates, and have found that a variety of methods for doing so yield similar results. Therefore, I think the internal validity of CPC is higher than Whitehurst’s grade of C. CPC is in-between a “B grade” study based on a natural experiment, and a “C grade” study that simply controls for observable characteristics.

In addition, the external validity and policy relevance of CPC is quite high, as the program was run by Chicago Public Schools and is quite similar to pre-K programs run in many state and local areas.  So Whitehurst’s grade there also seems too low.  The study also includes direct evidence on adult earnings effects and adult educational attainment effects.

As for Head Start, Whitehurst’s table says that the Ludwig-Miller study he cites only finds long-term impacts on mortality. But the study also finds some long-term impacts on educational attainment, which he does not note.

Programs from the 1980s


Program/Research Reported Impact (after initial year) Internal validity External validity
Whitehurst Bartik Whitehurst grade Bartik grade Whitehurst grade Bartik grade
Head Start in 1980s + + C B- A C
Infant Health and Development + (impacts only for disadvantaged children with close to normal birth weights) + (impacts only for disadvantaged children with close to normal birth weights) A A B B-

For the Head Start sibling studies from the 1980s, Whitehurst argues that the sibling comparison will be biased towards finding effects of Head Start. However, as discussed in the research, there are reasons to think that the bias in which sibling gets into Head Start could go in either direction. Also, research such as Deming’s tries to look very closely at pre-existing characteristics to see whether there is a significant bias, and does not find strong signs of bias sufficient to overturn the main results.

As for policy relevance/external validity, I regard many of the pre-K programs that we are pursuing today at the state level, and considering trying to encourage via federal policy, as much more educationally focused than was traditionally the case for Head Start, although this may be changing for Head Start in recent years. Therefore, it is unclear to me whether the effects of Head Start are as relevant as Whitehurst thinks to current state and local pre-K programs.

On IHDP, it is true that the results are only significant for disadvantaged children of close to normal birth weight. However, the near-normal birth weight group is the group most relevant to current debates about early childhood education. As for policy relevance/external validity, IHDP is really a test of an early child care program, at ages 1 and 2.  This is relevant to evaluating a program such as Educare, but not to evaluating most current proposals for pre-K at age 4.

Recent Programs


Program/Research Reported Impact (after initial year) Internal validity External validity
Whitehurst Bartik Whitehurst grade Bartik grade Whitehurst grade Bartik grade
Head Start RCT None None statistically signif., but point estimates consistent with important effects A A A C
District programs, e.g., Tulsa Unknown (research design doesn’t allow follow-up after pre-K Unknown, but would predict sizable adult earnings effects based on test scores B B B B+
Georgia & OK Universal + (very small at best) + (large enough to have high benefit-cost ratio) B B- A B+
Tennessee Pre-K - - A C A B+
North Carolina More at Four Not included + NA B NA B+

The recent Head Start RCT does not show much in the way of statistically significant effects from kindergarten on, but some of the point estimates are consistent with effects that might be important. For example, the point estimates on the cognitive tests that are consistently given over time in the experiment show effects at third grade that would predict about a 1% increase in adult earnings, which adds up to a lot  of money over an entire career. Given the uncertainty, the true effect could be 2 or 3%, or could be zero or negative – we just can’t tell.

As mentioned before, one issue that Whitehurst does not grapple with is that even quite small test score effects would predict adult earnings effects that might be very important in a benefit-cost analysis. This makes research more difficult because it is hard to rule out test score effects that could be large in that they might make the program pay off.  Even with a relatively large sample, such as in the Head Start RCT, the studies are “underpowered” for detecting some effects that might be relevant.

This problem is exacerbated because studies frequently find that test score effects at third grade underpredict the long-run earnings effects of pre-K. Test score effects often fade but then re-emerge in better adult outcomes than would have been predicted at 3rd grade.  This increases the uncertainty about the results beyond what is described by the Head Start RCT’s standard errors.  

The other issue with the Head Start RCT is its relevance to current policy debates. First, as noted before, many of the state and local pre-K programs being debated are more educationally focused than has traditionally been the case for Head Start, which raises the issue of whether the Head Start results are generalizable to these state and local pre-K programs.

Second, the Head Start RCT is not comparing Head Start with no Head Start. Only 80% of the treatment group enrolled in Head Start. About half of the control group attended some pre-K program, including 14% in Head Start and 35% in some other pre-K program. If some of these other pre-K programs were more educationally focused than Head Start, this would reduce the net impact of a “More Head Start” treatment group versus an “Other Pre-K” control group. The issue would still remain of whether Head Start’s generally higher costs per child are justified by stronger results. But the Head Start RCT does not do a great job of answering the question “do educationally focused pre-K programs work?”

The various regression discontinuity studies of state and local pre-K programs, as Whitehurst notes, by their design cannot detect long-term effects. However, based on other studies, early post-program test scores frequently are better predictors of a program’s long-run adult earnings effects than are later test scores. Therefore, the early test score information is more valuable than might at first appear.

Whitehurst and my grade on internal validity of the state and local RDD studies is the same, at B. However, Whitehurst’s text makes some disparaging remarks about RDD studies, which I have dealt with in previous blog posts.

Whitehurst’s problems with RDD seem to lead to him downgrading these studies’ external validity, which seems like the wrong place to downgrade the studies for any perceived issues with RDD. It seems to me that current studies of state and local pre-K programs are about as relevant as one can get to whether expanding such programs today is a good idea . I only give a grade of B+ because it is always the case that just because Location X’s pre-K program works, this doesn’t always mean that Location Y’s pre-K program works – there might be quality differences in the pre-K programs between location X and location Y.  

For Georgia and Oklahoma’s universal pre-K programs, I think Whitehurst mistakes the magnitude of these results.  He states that these studies find less than a one point difference on fourth-grade NAEP scaled scores. But the Cascio and Schanzenbach study he cites finds fourth-grade NAEP scores, in the estimates they regard as preferred, of about 3 points. They also state that it would only take a NAEP score effect of 1.0 to 1.4 points for these programs to pass a benefit-cost test. “Small” and “large” are fuzzy terms. I would define “large” as being large enough for the program to plausibly pass a benefit-cost test.

The estimates in Cascio and Schanzenbach for Georgia and Oklahoma are statistically insignificant when the most rigorous corrections for statistical noise are made. This in part reflects an inherent problem in studies of aggregate data on one or two states – there’s so much noise in individual state test score trends that it is difficult for any intervention, even one with large effects, to show statistically significant effects.

For this reason, I downgrade the internal validity of such studies of universal programs, because standard errors tend to both be large, and to be difficult to get right, in studies with only one or two geographic units in the treatment group that are being compared with all other geographic units. Often estimates are more imprecise than indicated by standard statistical software packages.

As for external validity, I see no basis for giving a stronger or weaker external validity grade to the Georgia and Oklahoma studies over studies of Kalamazoo,  Tulsa, Boston, Michigan, New Jersey, South Carolina, West Virginia, Oklahoma , New Mexico, and Arkansas,  which are all examined in  the RDD research. These are all studies of state and local pre-K programs, and are generalizable to other state and local pre-K programs if these other programs are of similar quality.

For Tennessee Pre-K, as I have noted in previous blog posts, although the original design of this study was a randomized control trial, problems with attrition mean that the study falls short of the gold standard, by quite a bit.  For example, in the first cohort of children, the study was only able to get test score data from 46% of the pre-K participants versus 32% of the control group. The original treatment and control group are randomly chosen, but this is not true of the children for whom we actually have test score data.

Furthermore, there is some evidence that this attrition leads to bias, in that the full sample shows a reduction in kindergarten retention from 8% to 4%, and the smaller sample with test score data only shows a reduction from 8% to 6%.  In addition, these “retention” effects suggest that the program must be doing something to student achievement that is not fully reflected in test scores, otherwise why would retention be cut in half, as it is in the full sample?

For all these reasons, I regard the Tennessee study as meeting not a gold standard, or a silver standard, but a bronze standard. It is similar to the many other studies that are NOT discussed by Whitehurst that try to evaluate pre-K by controlling for observable characteristics of students, an approach that cannot correct for selection bias.

As for external validity, the Tennessee study is definitely relevant to other state pre-K programs, but it is most relevant to the pre-K programs of states that are not spending enough per child on pre-K. According to the National Institute for Early Education Research, Tennessee’s program has spending per child that is over $2,000 per child less than what is judged to be desirable for high-quality pre-K.  So Tennessee’s program may be relevant to some proposed state and local pre-K programs, but perhaps not so much to more fully-funded pre-K programs.

Finally, there is the recent study of North Carolina’s “More at Four” program, which I reviewed in a recent blog post. Whitehurst does not mention this study. This is a good silver standard study because it relies on a “natural experiment”: the More at Four program was gradually rolled out over time in different counties. It is hard to see why a county’s test scores the appropriate number of years later in 3rd grade would be correlated with More at Four spending, except for a true effect of the program.  And as with other studies of state and local pre-K programs, the study is highly relevant, as long as one remembers that each state’s program is different.

Overall, for the 10 studies/groups of studies that are graded by both Whitehurst and me, Whitehurst’s average grade is 3.15, or between a B and a B+, and my average grade is 2.95, slightly less than B. Whitehurst isn’t quite as tough a grader as me, and in that sense he is not quite as “picky” as me.

Where we differ most obviously, is over the Head Start RCT and Tennessee studies, versus the older studies and the other more recent studies. He gives grades of straight A’s to the Head Start RCT and Tennessee study, so these two studies clearly dominate all the other studies from his perspective. In contrast, I give an average grade of B to the Head Start RCT, and B- to the Tennessee study. I give grades higher than B to Perry, Abecedarian, IHDP, the state/local RDD studies, and the North Carolina studies, and give B grade averages to CPC and the OK/GA studies. So, in my view, this other evidence dominates – there’s a preponderance of evidence of equal or greater quality that suggests that pre-K can work.

Of course, the Head Start RCT and Tennessee evidence still matters – this evidence suggests that there are some doubts as to whether Head Start as of 2002 was as educationally effective as some state pre-K programs, and the Tennessee evidence raises some doubts about that state’s pre-K program. But there is no way in which I view this evidence as trumping all the other evidence, which seems to be Whitehurst’s view.  

Whitehurst is unusual among researchers in privileging the Head Start RCT and Tennessee studies over all the other evidence. That doesn’t mean he’s more picky, it simply means he has a different approach than most researchers to thinking about the strengths and weaknesses of different research approaches.   

(Note: Steve Barnett of the National Institute for Early Education Research has independently provided some reactions to Whitehurst’s blog post.  I wrote the first draft of this blog post prior to seeing Barnett’s reaction, and did not significantly revise the post to avoid overlap – there’s some overlap, but considerable independent information. )

Posted in Early childhood program design issues, Early childhood programs