Huffpost Education
The Blog

Featuring fresh takes and real-time analysis from HuffPost's signature lineup of contributors

Todd Farley Headshot

Lies, Damn Lies, and Statistics, or What's Really Up With Automated Essay Scoring

Posted: Updated:

A couple years back I wrote a vitriolic exposé of the greedy and incompetent for-profit standardized testing industry, so I didn't expect to be impressed by the automated essay scoring study released in April (Contrasting State-of-the-Art Automated Scoring of Essays: Analysis, authored by Mark Shermis of the University of Akron). As an anti-testing guy, and a writer, I never thought I'd be wowed by a technology that claims to be able to assess student writing without even being able to, you know, read. In the end, however, I couldn't help but be impressed by the shiny statistics provided in support of the study's own findings. There was page after page of tables filled with complicated numbers, graph after graph filled with colorful lines, plus the Pearson r, all purporting to show the viability (if not superiority) of automated essay scoring engines when compared to human readers.

While I'd never before heard of the Pearson r (pronounced, I believe, "ca-ching"), I did hope it was affiliated with the mammoth British conglomerate of the same name: After all, what statistic could be more trustworthy today than a number invented by Pearson Education in support of products made and sold by Pearson Education?


As any astute reader but no automated essay scoring program might have gleaned by now, I actually do have my doubts about the automated essay scoring study. I have my doubts because I worked in the test-scoring business for the better part of fifteen years (1994-2008), and most of that job entailed making statistics dance: I saw the industry fix distribution statistics when they might have showed different results than a state wanted; I saw it fudge reliability numbers when those showed human readers weren't scoring in enough of a standardized way; and I saw it fake qualifying scores to ensure enough temporary employees were kept on projects to complete them on time even when those temporary employees were actually not qualified for the job.

Given my experience in the duplicitous world of standardized test-scoring, I couldn't help but have my doubts about the statistics provided in support of the automated essay scoring study -- and, unfortunately, that study lost me with its title alone. "Contrasting State-of-the-Art Automated Scoring of Essays: Analysis," it is named, with p. 5 reemphasizing exactly what the study is supposed to be focused on: "Phase I examines the machine scoring capabilities for extended-response essays." A quick perusal of Table 3, however, on page 33, suggests that the "essays" scored in the study are barely essays at all: "Essays" tested in five of the eight sets of student responses averaged only about a hundred and fifty words.

Although Mr. Shermis refuted that claim during a radio interview by stating that, overall, the average length of the study's student responses was about 250 words, the numbers on Table 3 reveal something else. While Test Sets 1 and 2 included essays averaging 360-370 words and Test Set 8 included essays averaging 640, the mean length of the remaining five Test Sets were ridiculously short:

Test Set 3--113 words

Test Set 4--98 words

Test Set 5--127 words

Test Set 6--152 words

Test Set 7--173 words

The first paragraph of this article is 146 words, meaning it's as long as most of the "essays" included in Test Sets 3-7. While perhaps it's only semantics to argue what the term "essay" means exactly, I can't help but be disappointed. To claim this study is focused on automated scoring of essays when the "essays" in five of the eight test sets are really no more than paragraph length reeks of disingenuousness to me -- it may not be the flat-out duplicity and deceit that long characterized my time in the test-scoring industry, but neither does it come across as the height of honesty either.

Anything else about the study concern me? From my initial perusal of its results the automated scoring engines seemed to have done a considerably poorer job than the human readers on Test Set 1, which was especially troubling because that set was very similar to many of the state and national writing assessments I'd scored over the years (a persuasive topic, 8th grade holistic scoring with a six-point rubric, roughly 375 word responses -- all that sounds exactly like the writing portion of the NAEP test considered the "gold standard" of standardized testing).

From my reading of Table 8, on p. 38, however, for Test Set 1 the human readers seemed to be agreeing with each other when scoring those essays 64% of the time but the automated scoring systems were getting agreement numbers down in the forty, or even thirty, percentiles. While that human reliability number of 64% would be an acceptable stat in the test-scoring world for a six-point rubric (even for the NAEP test), all those 30% and 40% stats produced by the automated scoring engines would be absolutely and irretrievably unacceptable: Those stats would engender a human trainer being replaced, the human scorers being replaced, and all those essays having to be rescored.

When I questioned Mr. Shermis about those numbers in an e-mail, he kindly responded by telling me I was reading the stats on Table 8 correctly but that I "unfortunately came to the wrong conclusion." He pointed out to me the asterisk attached to the Test Set 1 stats ("*") and explained to me that ... yadda yadda yadda (complicated statistical stuff) ... the end result being that "the machines had suffered as a consequence." So, while I admit Mr. Shermis' linguistic gymnastics/defense of the Test Set 1 stats were a bit hard to follow, the fact he said the "machines suffered" did seem a tacit admission that the automated scoring systems hadn't done that great a job on the only set of essays that looked like a normal writing assessment to me.

During this e-mail exchange, Mr. Shermis advised me to look at the scoring stats for Test Set 5 instead of Test Set 1, where he said I would see that "all the machines but one" had better agreement numbers than did the human scorers. While I scoffed at the "essays" of Test Set 5 to begin with (all 127 words of them), in looking at those stats I again came to an opposite conclusion from the study's author: To me it looked like the two human readers were getting agreements stats when scoring those "essays" of between 77-79%, but the automated scoring systems seemed to be producing numbers in a range from 47% (ouch) up to 71%, with most of the systems earning stats in the sixties. When I asked Mr. Shermis for an explanation of that, he called me a "nitwit."

Of course he really said no such thing, instead explaining in great detail that the problem with my interpretation was that "since each [human] rater determines the resolved score (and thereby creates a part-whole relationship), the agreement coefficient is artificially raised to the higher levels that you mention." Naturally, that statistical obfuscation stunned me into incomprehension, and I had no idea what Mr. Shermis was trying to say other than that I was wrong. At that point it finally dawned on me that there would be no argument I could make about the results of his study that Mr. Shermis wouldn't have an answer for -- such is the "fun with numbers" that such data allow, a fact I should have recalled from my own career of making statistics dance.

I'll be the first to admit I might be wrong about all of this (I'm not exactly a statistician and I've been out of the test-scoring business, mercifully, for almost five years), so perhaps my doubts about the validity of the automated essay scoring study are completely unfounded. Maybe all those stats are legit, and maybe all those teeny-weeny writing samples really can be called "essays." Perhaps the fact it's possible to dupe those automated scoring systems into increasing an essay's score by adding almost anything to it -- whether random strings of words, unique vocabulary, incoherent thoughts, unusual transitions, or plagiarized paragraphs -- really isn't that big a deal, as Tom Vander Ark (the study's patron) seems to be claiming in the "Comments" section of pretty much every website on the Internet.

So what? Even if all that is true, what is it about the wonders of automated essay scoring that this study really is claiming? Come to find out, the study asserts very little. Perhaps most importantly, the study makes no claim about those automated scoring engines being able to read, which they emphatically cannot do. That means that even though every single one of those automated scoring engines was able to pass judgment on every single one of the essays painstakingly composed by American students, not even one of those scoring engines understood even one word from all those kids.

Provocative thoughts in those essays? The automated scoring programs failed to recognize them. Factual inaccuracies? The scoring engines didn't realize they were there. Witty asides? Over the scoring engines' heads they flew. Clichés on top of clichés? Unbothered by them the scoring systems were. A catchy turn-of-phrase? Not caught. A joke about a nitwit? Not laughed at. Irony or subtlety? Not seen. Emotion or repetition, depth or simplicity, sentiment or stupidity? Nope, the automated essay scoring engines missed 'em all. Humanity? Please.

In fact, on p.2 the study's major finding states only that "the results demonstrated that overall, automated essay scoring was capable of producing scores similar to human scores for extended-response writing items." A paragraph on p. 21 reiterates the same thing: "By and large, the scoring engines did a good [job] of replicating the mean scores for all of the data sets." In other words, all this hoopla about a study Tom Vander Ark calls "groundbreaking" is based on a final conclusion saying only that automated essay scoring engines are able to spew out a number that "by and large" might be "similar" to what a bored, over-worked, under-paid, possibly-underqualified, temporarily-employed human scorer skimming through an essay every two minutes might also spew out. I ask you, has there ever been a lower bar?

The fallibility of human test-scoring is confirmed in the automated essay scoring study itself, which notes on p.19 that "trained human raters, as with subject-matter or writing experts, can read the same paper and assign the same or different scores for different reasons."

In the test-scoring business, the primary statistic used to establish how valid a job is being done when humans score essays is "exact agreement" (meaning the percentage of times a second person reading an essay gives the same score as a first). As noted above, however, to be considered acceptable those "reliability numbers" only have to be about 60%, meaning it's expected that 40% of the time two human readers will give different scores to the same essay. Perhaps you can understand why in my book I called test-scoring an "inexact science."

A secondary statistic used to establish how valid a job is being done when humans score essays is "exact-adjacent agreement," meaning the percentage of times a second person reading an essay gives the same score as the first reader, or the score next to that! For instance, if a first reader gives an essay a 4 on a 6-point scale, the second reader would have to give that same essay a 3, 4, or 5 to be considered "a match." Giving a score of 3, 4, or 5 on a six-point scale is half of the rubric, a bar so low it should be buried underground -- it's akin to saying two teachers agreed about a student's work when one teacher awarded it a B but the next awarded it an A. Or a C.

In other words, this study confirms the fact humans don't do that great a job when assessing essays but also wants to celebrate the success of automated scoring engines by saying that they do "similar" work, "by and large." Unfortunately, that means the study's final conclusion is really no more than a lame claim that automated scoring engines are able to give scores to student essays that are in the ballpark of the scores human readers give, even though those human scores are probably only in the ballpark of what the student writers really deserve.

I tell you, it's a home run!

None of this would really matter if automated scoring engines were only used in the classroom for "formative assessment or the instruction of writing," as Mr. Shermis wrote to me. But when someone like Tom Vander Ark is all over the place touting the merits of automated essay scoring as being "fast" (true), "cost-effective" (true), and "accurate" (wink-wink, nudge-nudge), that can lead people to believe this "groundbreaking" study is something else entirely.

At the time of the study's release, Barbara Chow, the Education Program Director at the Hewlett Foundation, said "This demonstration of rapid and accurate automated essay scoring will encourage states to include more writing in their state assessments. And, the more we can use essays to assess what students have learned, the greater the likelihood they'll master important academic content, critical thinking, and effective communication."

State assessments, she says? Academic content? Critical thinking? Effective communication? Really? From a technology that has absolutely zero idea what any student has written?

I am not anti-technology, and I know the day will come when computers will read and understand writing, will assess writing, will teach writing. The automated scoring engines that currently exist will probably also prove fundamental to those later developments, too. But until the time comes that automated scoring engines can read and understand, the benefits of those systems seem quite limited to me, not to mention negligible when it comes to high-stakes assessment. It seems to me that claiming otherwise -- claiming automated essay scoring will change assessment as we know it or will lead to miraculously robust Common Core tests -- is to be more concerned with expediency (and, surely, profit) than anything at all to do with the education and enrichment of this country's students and writers.

Maybe a technology that purports to be able to assess a piece of writing without having so much as the teensiest inkling as to what has been said is good enough for your country, your city, your school, or your child. I'll tell you what though: Ain't good enough for mine.