Using test scores to evaluate early childhood programs does not imply that they should be used for accountability purposes for individual program centers or teachers

In some of my blog posts and published articles, I have used effects of early childhood programs on early test scores to evaluate programs.  For example, in my Tulsa study with Gormley and Adelstein, we estimated the effects of Tulsa’s pre-K program on kindergarten entrance test scores.  We then used these kindergarten test scores to simulate the likely impact of Tulsa’s pre-K program on adult earnings.  This simulation was based on estimates from Chetty et al. on the likely relationship between kindergarten test scores and adult earnings. (Chetty et al. “How Does Your Kindergarten Classroom Affect Your Earnings?”) With adult earnings effects, and the costs of the program, we then estimated a partial benefit-cost ratio: the ratio of the present value of the earnings benefits of the program to the costs of the program. This benefit-cost ratio is partial because it does not include other benefits such as lower crime, but adult earnings benefits are a major proportion of the likely long-run benefits of pre-K programs for former child participants and for society.

I think the evidence suggests that early test scores are a sufficiently good predictor of adult earnings effects, at least on average, that this is a legitimate procedure. Given that long-run studies of how early childhood programs affect adult outcomes are quite difficult and costly to do, because of  the long-time period involved, and the problems in identifying causal relationships, it is important to find some short-run indicators that give us a reasonably good idea of long-run effects of these programs. This allows us to use many more studies to see how different types of programs will have different benefit-cost ratios. We can use short-run studies to examine the relative effectiveness of different program designs for children from different groups.  Such short-run studies are cheaper to obtain and get results in a timelier manner.  Therefore, such short-run studies are more relevant to real-world policy decisions.

However, I do not think that this necessarily implies that test score effects of early childhood programs should play any substantial role in holding individual program centers or individual teachers accountable for “results”. Why might such an accountability role for test scores be a bad idea, even if test scores are useful in evaluating some overall program design?

The key problem is that once you use test scores of individual program centers or teachers to reward centers or teachers in some way, you distort the behavior of centers and teachers to focus much more attention on maximizing test scores. This would be fine if test score effects of a program, center, or teacher were all that mattered for determining program benefits. But even though test score effects of a program that is NOT being rewarded directly based on test scores are a good predictor of adult earnings, this does not mean that all that matters about the program are its effects on test scores. There is much else in program effects that probably matters. Test score effects may help proxy for these other program effects if the program’s operations are not artificially distorted by accountability incentives, but this is likely to be much less true if centers or teachers are simply “teaching to the test”.

For example, it is commonly believed in pre-K research that much of the long-run benefits for adult earnings and reduced crime of pre-K are due to the program’s effects on “soft skills”: social skills and character skills.  These better soft-skills at kindergarten entrance help students get along better with teachers and classmates in kindergarten, and therefore learn more in both soft skills in kindergarten. And so on through the K-12 system and on into adulthood. In adulthood, these better soft skills are very important in determining success in the labor market.  A person who can get along better with supervisors, co-workers, and customers is likely to retain their job longer, and be promoted to higher-paying jobs.

Why then might it be OK to use effects on kindergarten tests to estimate the effects of a program on adult earnings? After all, these tests mostly measure “hard skills”: various literacy and math skills. Soft skills are hard to measure.

The reason that using test scores is OK for predicting adult earnings is that effects on the hard skills that are measured by these tests are often a reasonable proxy for broader effects of the program on both hard skills and soft skills.  For example, pre-K programs that have more-experienced and better-trained teachers with smaller class sizes and a good curriculum are likely to have stronger effects on both hard skills and soft skills.  We can measure the effects on hard skills, and this is a good indicator of these broader effects.

But effects on test scores may no longer be a good indicator for soft-skill effects and long-run benefits if the behavior of individual centers and teachers is distorted by an undue emphasis on rewarding individual centers and teachers for increasing test scores. This reward for higher test scores may encourage individual centers and teachers to distort what they do to unduly emphasize how many letters or numbers the child knows at kindergarten entrance, rather than the child’s creativity, planning ability, and ability to be both a leader and a team player.

(In theory one could argue that if an entire statewide pre-K program knows it might be evaluated by some outside evaluator periodically based on test scores, this puts pressure on the overall system to overweight tests. But if these test score evaluations are combined with some other evaluation measures of the program, and a conscious decision is made that state program officials should be discouraged from pushing teaching to the test, then this danger is minimal. In contrast, if we set up some accountability system on autopilot with test score measures directly responsible for rewarding individual centers and teachers, there is a much greater danger of distorting behavior.)

How then, can we hold individual centers and teachers accountable for providing high-quality services? Answer: by using that magical thing called human judgment, but hopefully disciplined to some reasonably objective standards, and allowing for multiple sources for such judgments.

For example, a recent article in Science found some evidence that the Classroom Assessment Scoring System (CLASS) gave measures of quality of pre-K classrooms that were strongly correlated with the study’s measure of average test score effects.  These CLASS measures are based on what trained classroom observers see about the quality of teacher-child interactions in the classroom. The CLASS measures, if done with properly trained personnel, seem to be fairly consistent across different observers.

CLASS measures seem likely to be harder to game than test score measures. To improve your CLASS score, you have to improve the quality of teacher-child interactions.  For accountability measures, CLASS measures also have some other advantages compared to test score measures. CLASS measures tell you something about classroom quality now, rather than waiting until the beginning of kindergarten or the end of pre-K to see what the test score effects are. CLASS measures also naturally lead to conversations about how to improve the quality of teacher-child interactions, which seems to be at the core of what quality means in pre-K.

Using test score effects for accountability in pre-K is certainly worth experimenting with. Maybe it can work if the test score measures are not over-emphasized, and if other quality measures, such as CLASS, are also used. But I think there are clearly dangers in over-emphasizing test scores in accountability measures.  I don’t think we know enough yet for the federal government, state governments, or even local governments, to mandate that test scores should be used for some percentage of an accountability metric for individual pre-K centers or teachers, unless such a system is part of an experiment or demonstration project.  I suspect that as we learn more about the best accountability system for pre-K quality for individual centers or teachers, that the accountability systems that work the best will emphasize measures such as CLASS much more than test scores.

About timbartik

Tim Bartik is a senior economist at the Upjohn Institute for Employment Research, a non-profit and non-partisan research organization in Kalamazoo, Michigan. His research specializes in state and local economic development policies and local labor markets.
This entry was posted in Early childhood program design issues. Bookmark the permalink.

3 Responses to Using test scores to evaluate early childhood programs does not imply that they should be used for accountability purposes for individual program centers or teachers

  1. Sandra says:

    agree completely. but the CLASS is very labor intensive and expensive. is it realistic to expect early childhood programs to implement it as a quality measure in regular practice (as opposed to in the context of a research study)?

    • timbartik says:

      Very good comment, Sandra. I’m an economist, so I naturally think that costs — and benefits — should also be an issue.

      Do you have cost estimates for doing a CLASS evaluation? I think that’s the first issue. And then there’s the issue of whether it’s worth the costs, in terms of benefits provided.

      If the evaluation is useful for influencing pre-K quality by any significant percentage, then even a very slight improvement in quality would be worth it. For example, in a pre-K classroom, suppose there are 15 kids. A high-quality pre-K program might raise the present value of future adult earnings for each kid by somewhere between $15,000 and $30,000. Summed over all 15 kids, that’s a collective increase in the present value of earnings of between $225,000 and $450,000. Even a 1% improvement in pre-K effectiveness would therefore bring a return of between $2,250 and $4,500 per cohort. Therefore, it is worth investing a considerable amount in quality improvement measures even if they only slightly improve quality. If something such as CLASS helps better inform and guide such quality improvement measures, even a large cost would be worth it.

      As I’ve emphasized before on this blog, the present value of future earnings is so large that even very slight improvements in pre-K quality are worth considerable up-front investment costs.

  2. Pingback: Preschool Testing: The False Dichotomy - Brodsky Research and Consulting

Comments are closed.