During the week of Feb. 24, the New York City Education Department released data estimating the performance (based on a value-added testing evaluation metric) of over 12,000 teachers in the city's public schools. The individual data sets were requested by journalists, and these requests were at first resisted by many advocates for education "reform." Evaluating teachers based on student performance on standardized testing has become increasingly chic within the "reform" community. In part to receive funding from President Obama's "Race to the Top" initiative, New York state has recently put in place new measures that will tie a teacher's overall evaluation to his or her value-added testing rating. New York's "value-added" ratings attempt to estimate how much a teacher "improved" students' performances on standardized tests while adjusting for student past performance, demographics, and other factors.
Teach For America alum and current Stuyvesant High School math teacher Gary Rubinstein recently dug into the raw teacher data -- and his findings have potentially devastating implications for New York's testing regime.
Rubinstein found a few startling results:
- There is relatively little correlation between how a single teacher does one year and how a single teacher will do the next year.
- There is next to no correlation between a teacher's success in teaching a group of students one subject and teaching that same group of students another subject.
- There is very little correlation between a teacher's success in teaching a subject at one grade level and his or her success in teaching that same subject in a different grade level in the same year.
Rubinstein makes many other intriguing points, but those three conclusions are striking. They raise serious questions about New York's value-added metrics.
The first result suggests little connection between a teacher's performance one year and his or her performance the next. Rubinstein notes:
I found that 50% of the teachers had a 21 point 'swing' one way or the other. There were even teachers who had gone up or down as much as 80 points. The average change was 25 points. I also noticed that 49% of the teachers got lower value-added in 2010 than they did in 2009, contrary to my experience that most teachers improve from year to year.
That 25-point swing makes the difference between a teacher performing significantly above average (say with a score of 65) and significantly below average (say with a score of 40). Are teachers really that uneven from one year to the next? Perhaps teacher scores will fluctuate within a certain range in the short term but stay within that range over the long term, but, for many educational reform policy programs, teachers would be evaluated on a year-to-year basis, and this kind of swing in scores could hamper the effectiveness of such evaluation. Moreover, this kind of swing suggests a huge margin of error in this model's results. (Other analyses have also suggested this lack of year-to-year correlation and the wide margin of error in individual teacher evaluations based on these value-added metrics: According to the New York Times, the math rating for a teacher could be off by as much as 35 percent while the English rating could be off by as much as 53 percent.)
Rubinstein suggests various reasons why we should be skeptical of a data set that shows very little correlation between a teacher's success in teaching one subject and his or her success teaching another subject. But perhaps the result that stands out the most is Rubinstein's finding that, according to the NYC testing metrics, there is little correlation between a teacher's success in teaching one subject at one grade level and his or her success teaching the same subject (in the same year) at a different grade level:
Out of 665 teachers who taught two different grade levels of the same subject in 2010, the average difference between the two scores was nearly 30 points [out of 100 points]. One out of four teachers, or approximately 28 percent, had a difference of 40 or more points. Ten percent of the teachers had differences of 60 points or more, and a full five percent had differences of 70 points or more.
So a teacher could score a 70 in teaching math at the sixth grade level while also scoring only a 40 in teaching math in the seventh grade level during the same year -- and that's only the average swing between the two scores. Rubinstein notes that one teacher he looked at scored a 97 teaching sixth-grade math while scoring a 6 -- yes, a 6 -- teaching seventh-grade math in the same year. Is it highly likely that one person could be truly exceptional teaching math to sixth graders while also being utterly abysmal teaching math to seven graders? I guess it's possible, but that's a very far-out possibility.
Even when one dismisses those outliers (even though such outliers could still lose their jobs due to these test results), the average gap between these two categories is huge. The testing metric of NYC thus implies that there is almost no connection between a teacher's success in teaching one grade level and his or her success teaching another grade level, an assertion that flies in the face of observable experience and the standards of common sense. There might not be a perfect correlation between success at one level and success at another, but for there to be almost none seems a very odd result. Common sense is not always correct, but there is also a chance that highly complicated technical instruments can be mistaken as well.
We can broaden these points. One of the dangers of elaborate technocratic schemes is that they may produce results that are utterly unconnected to real needs and may create organizational imperatives that have no connection with realities. That's one of the reasons why traditional conservatives have been skeptical about radically centrally planned economies: the metrics of top-down bureaucrats, however sophisticated, will not always accord with reality.
As I have written before, it is unfortunate that some conservatives have forgotten the limits of technocratic instruments when it comes to education reform. Instead, many Republicans and supposed conservatives are doubling down on testing-driven educational reform, making standardized tests the central focus for student, teacher, and school evaluations. Yet Rubinstein's analysis suggests that the value-added metrics of NYC, however sophisticated, implicitly lead to results that seem to have little, if any, basis in reality. And if this testing regime has led to such results, we may have little reason to take any of its results seriously; all results, and all conclusions drawn from them, may be utterly poisoned by a false methodology.
Unfortunately, political power often has only a passing acquaintance with reason, and many teachers will have to take these results quite seriously indeed: their future employment may depend upon them. But the prejudices of the powerful should not deter a forthright use of reason. So we should ask: If this testing instrument, devised by the largest and one of the most sophisticated school districts in the United States, leads to conclusions that seem so utterly divorced from reality, why should this instrument be used to decide the fate of teachers? Maintaining real standards for education and encouraging excellence are good things, but it is not yet clear that this testing model helps achieve, or measures the achievement of, either goal. The fact that a rabbit hole is bipartisan (as testing-driven "reform" is) does not mean that it leads to a world that is any less fantastic.