12/27/2012 01:50 pm ET Updated Feb 26, 2013

A Possible Compromise on Test-Driven Accountability

I have long argued that value-added might become reliable enough to complement or supplement human observations, but that it never should be used to drive evaluations. We can never allow a statistical model, in the hands of management alone, to indict teachers as ineffective, so that they then have to prove their innocence.

Even so, I am open to Douglas Harris' proposal in "Creating a Valid Process for Using Value-Added Measures" that value-added play a part of a screening processes to identify potential low-performers. Those teachers could face further scrutiny under multiple measures. I do not fully understand Harris' logic on one point and I would demand a couple of changes, but I would pursue his proposal as a way of ending the most destructive battle in our educational civil war.

Harris seeks a school equivalent of the medical practice of screening with a more primitive test and using a more expensive and reliable test to confirm potential problems. Presumably, Harris is aware of the medical downsides of too much screening. (Excessive testing, it has been estimated, may drive up medical costs by 25 percent or more.) He also identifies one of the worst problems with value-added as a part of a multiple measures evaluation regime. Evaluations are always a political process, and in many systems being indicted as ineffective is tantamount to a conviction. Harris agrees that evaluators who know a teacher's value-added may let that information color their observations. Usually the bias will be subtle, but there will be observers who will think, "I already know this teacher is not very good so I will give her a low score."

Harris addresses many of the worst practices that have been encouraged by the Race to the Top, and other "reforms." These hurried policies threaten to wreck the careers of good teachers in schools where it is harder to raise test scores. Harris approaches value-added "not as a measure, but rather as a process." He also addresses the original sin of "reformers" who believe that the "teacher quality" can be the silver bullet for overcoming poverty. Unlike many (or most?) accountability hawks, Harris wants "clear steps, checks and balances, and opportunities to identify and fix evaluation mistakes."

Harris also says that value-added should also be used as a screener of teacher evaluators. He writes, "Two observers can look at the same classroom and see different things. That problem is more likely when the observers vary in their training."

My initial reaction to that sort of suggestion has always been consistent; it is a bad idea to do unto principals or other evaluators what we wouldn't want done to us. But, if some key principles were nailed down, I could see this as a major breakthrough.

In our rush to evaluate, all sides seem to have bought into a common assumption. The cornerstone of the teacher quality edifice (at least for persons who have never set foot in the inner city classroom) is that good teaching is good teaching and that it looks the same in high and low poverty schools. In other words, they assume that it is no more difficult to teach effectively in ineffective schools than it is to do so in effective schools. I doubt many inner city teachers would accept this dogma, but teachers who challenge it are said to have low expectations. And, districts like Washington, D.C. brag about training their observers to believe that good teaching always overcomes the baggage that students bring from home.

So, I would demand that value-added control for whether students attend neighborhood or magnet schools. If the data then shows that it is no more difficult, systemically, to raise scores for students coming from intense concentrations of extreme poverty than it is to increase test scores for poor kids in selective schools, we practitioners will stand corrected. If, however, our professional judgments are confirmed and peer effects make it more difficult to not only raise scores but, also, to demonstrate the same repertoire of best practices in schools ravaged by disorder, truancy, and violence, then those value-added findings should be used to retrain the evaluator trainers.

Some (or many) districts may refuse to control in such a manner, claiming that it would give educators an "excuse" for giving up on difficult students. In that case, the union should demand the raw data so that it could run the numbers on its own dime. If (or when) systems refuse that request, teachers should walk away from the table and the social scientists who design value-added models should do the same. They could then become expert witnesses in a legal campaign to force districts to respect the scientific method when imposing high stakes on schools and educators.

I do not fully understand why Harris would claim that value-added screening, per se, would remove the incentive to expand bubble-in testing into all grades, but maybe I am misreading this sentence, "If teachers failed on either measure [value-added or observation], then that would be a reason for collecting additional information." If he wants both measures to be applied to all teachers, so that greater scrutiny is applied to those who do not pass both, then he is just doubling the hurdles that teachers must clear. On the other hand, reducing the relative importance of value-added could reduce the motivation to distort educational practice in order to stay out of the woodshed, so maybe Harris has a point.

I read Harris as also presenting a truth in advertising proposal that could help break our cycle of teacher-bashing, which creates more failure, and which prompts more scapegoating of teachers. Legislators, who believed the inflated claims of the data-driven crowd, were willing to foot the bill for a metric which supposedly would remove ineffective teachers, reward the top performers, and guide the improvement of the rest. If they believed the promises of Bill Gates, Arne Duncan, Michelle Rhee, and Jeb Bush, the silver bullet known as "teacher quality" would seem to be a bargain even if it required the quadrupling of standardized tests.

Harris, however, wants a screening approach that creates a "feedback loop." It would take as much, or more, resources to design a viable system which "certainly wouldn't solve all the problems with the new teacher evaluation systems." That is a far cry from the panacea that the accountability hawks have foisted on federal, state, and local governments. It is less likely that political leaders would foot the bill for the computer systems, the training, and the implementation of a screening system which incorporates value-added as only one part of a complex process.

So, I would use Harris' proposal as a step towards better evaluations and ending the discord that high-stakes testing has sown. I would ask social scientists to reciprocate, however.

Perhaps they could organize their own TFA -- Testify for Accuracy. This and other recommendations by Harris and others could be compiled into an educational version of the medical Standards of Care rubrics. It could be declared unethical to continue to participate in developing a high-stakes value-added system that did not live up to scientific standards. When districts ignore the professional advice of the designers of statistical systems, the ethical response would be to testify about the system's problems. Just as attorneys -- as officers of the court -- have an ethical obligation to defendants who cannot afford counsel, a panel of statisticians should stand ready to testify in termination cases where their models are used in a way that offends their profession's sense of propriety.