Robots Are Grading Your Papers! The fact is: Machines can reproduce human essay-grading so well

Marc Bousquet:

“Insufficient number of supporting examples. C-minus. Meep.” (Photo by Flickr/CC user geishaboy500)
A just-released report confirms earlier studies showing that machines score many short essays about the same as human graders. Once again, panic ensues: We can’t let robots grade our students’ writing! That would be so, uh, mechanical. Admittedly, this panic isn’t about Scantron grading of multiple-choice tests, but an ideological, market- and foundation-driven effort to automate assessment of that exquisite brew of rhetoric, logic, and creativity called student writing. Without question, this study is performed by folks with huge financial stakes in the results, and they are driven by non-education motives. But isn’t the real question not whether the machines deliver similar scores, but why?
It seems possible that what really troubles us about the success of machine assessment of simple writing forms isn’t the scoring, but the writing itself – forms of writing that don’t exist anywhere in the world except school. It’s reasonable to say that the forms of writing successfully scored by machines are already – mechanized forms – writing designed to be mechanically produced by students, mechanically reviewed by parents and teachers, and then, once transmuted into grades and sorting of the workforce, quickly recycled. As Evan Watkins has long pointed out, the grades generated in relation to this writing stick around, but the writing itself is made to disappear. Like magic? Or like concealing the evidence of a crime?
The Pen is Advanced Technology
Of course all machines, from guitars to atom bombs, have no capacity to achieve any goals on their own. Nonetheless detractors of machine grading point out the obvious, that machines don’t possess human judgement, as if they possessed some other, alien form of reasoning. Computers can’t actually read the papers, they insist. Computers aren’t driven by selfless emotions, such as caring about students. Faced with proof that human test graders don’t always meaningfully read the papers or care about students, machine-grading detractors pull the blankets over their heads and howl: But they’re not human, damn it!
But the evidence keeps piling up. Machines successfully replicate human mass-scoring practices of simple essay forms, including the “source-based” genre. After reading reports released on the topic for nearly twenty years now, most working teachers of student writing grumble for a while, then return to the stack of papers at their elbow-and grade them mechanically.
The fact is: Machines can reproduce human essay-grading so well because human essay-grading practices are already mechanical.
To be sure, these results are usually derived from extremely limited kinds of writing in mass-scoring situations. They are easily defeated by carefully constructed “bad faith” responses. Since machines don’t read, they don’t comprehend the content, and cannot give feedback on rhetorical choices and many aspects of style. They can-and do-give feedback on surface features and what is sometimes called, more appropriately than ever, mechanical correctness. They cannot assess holistically, but can provide a probabilistic portrait by assembling numerous proxies, usually the same as those that human teachers use to substantiate holistic judgments, such as complexity of word choice and variety of sentence construction. Automated scoring can detect rhetorical dimensions of an essay, including the presence of evidence and the syntax used in simple argument.
Humans Acting Badly
Developers of these programs generally admit these limitations, primarily offering automation as an alternative to human graders in mass-assessment circumstances. When performed by humans, large-scale scoring of simple writing is commonly outsourced to poorly paid, under-qualified, overworked temps managed by incompetent greed-merchants in the scandal-ridden standardized testing industry.
Like the machines that replicate their efforts so well, the humans working in mass writing assessment are working to cookie-cutter specifications. They are not providing meaningful feedback on content. Spending a minute or two on a few hundred words, they are generally not “reading,” but scanning for many of the same characteristics that machine scorers are programmed to do. Like factory workers, they are providing results as quickly and cheaply as possible in order to line their employers’ pockets. Routinized, working to narrow formula, scanning superficially for prescribed characteristics at high speed, often incompetently managed and administered, most mass graders perform robotically.
Reading like a confessional “I was an economic hit man” for managed instruction, Making the Grades by Todd Farley chronicles one temp essay-scorer’s rise to high living at the pinnacle of mass testing’s profit-accumulation scheme. Riding in hired cars through burned-out public school districts to eat exotic meals prepared by celebrity chefs, Farley details how the for-profit scheme of high-stakes testing forces public-school teachers, students and parents on a faux-learning assembly line featuring teaching as test-prep drill instruction with 60 students in a class.
But Are Robots Also Teaching?
Teaching and test-scoring are very different circumstances. The fact that test scorers act mechanically doesn’t mean that teachers do. Except that most teachers are under very similar pressures-too many students, too little time, intense bureaucratic control, insufficient training, insufficient rewards to recruit and retain talent, and pedagogically unsound working conditions.
Just like teachers of other subjects, high school writing teachers are expected to “teach to the test,” usually following a rigid curriculum tailored to produce essays that do well in the universe of mechanical scoring, whether that mechanical scoring is provided by machines or degraded humans. Because of the high stakes involved, including teacher pay and continuing employment, the assessment drives the rest of the process. There are plenty of teachers who have the ability to teach non-mechanical forms of writing, but few are allowed to do so.
This managed–often legislated–pedagogy generally fails. Mechanical writing instruction in mechanical writing forms produces mechanical writers who experience two kinds of dead end: the dead end of not passing the mechanical assessment of their junk-instructed writing, and the dead end of passing the mechanical assessment, but not being able to overcome the junk instruction and actually learn to write.
As bad as this pedagogy’s failure is its successes. Familiar to most college faculty is the first-year writing student who is absolutely certain of their writing performance. She believes good writing is encompassed by surface correctness, a thesis statement, and assiduous quote-farming that represents “support” for an argument ramified into “three main points.”
In reality, these five-paragraph essays are near-useless hothouse productions. They bear the same relationship to future academic or professional writing as picking out “Chopsticks” bears to actually playing music at any level. Which is to say, close to none.
But students, particularly “good” students, nonetheless have terrific confidence in these efforts because they’ve been mechanically assessed by caring human beings who are, reasonably enough, helping them through the gates represented by test after test that looks for these things.
Not everything that teachers do is mechanical, but the forces of standardization, bureaucratic control, and high-stakes assessment are steadily shrinking the zone in which free teaching and learning can take place. Increasingly, time spent actually teaching is stolen from the arid waste of compulsory test preparation-in writing instruction as much as in every other subject. In this, teachers resemble police officers, nurses, and other over-managed workers, who have to steal time from their personal lives and from management in order to actually do law enforcement or patient care, as The Wire points out.
What Would Be Better?
Rebecca Moore Howard is a researcher in one of the nation’s flagship doctoral institutions in writing studies, the program in Composition and Cultural Rhetorics at Syracuse University. Howard’s Citation Project explores the relationship of college writers to source material. The first major findings of the 20-researcher project, conducted at 16 campuses? Even academically successful students generally don’t understand the source material on which they draw in their school writing.
Howard employs the term “patchwriting” to describe one common result of what I have long called the”smash and grab” approach that students employ to produce what we encourage them to pass off as “researched writing:” Scan a list of abstracts like a jewelry store window. Punch through the plate glass to grab two or three arguments or items of evidence. Run off. Re-arrange at leisure. With patchwriting, students take borrowed language and make modest alterations, usually a failed attempt at paraphrase. Together with successful paraphrase and verbatim copying, patchwriting characterizes 90 percent of the research citations in the nearly 2,000 instances Howard’s team studied at a diverse sampling of institutions. Less than 10 percent represented summary of the sense of three or more sentences taken together.
My own take on this research is that it strongly suggests the need for a different writing pedagogy. These students aren’t plagiarists. Nor are most of them intrinsically bad writers, whatever that might mean. Instead, I believe they’ve been poorly served by ill-conceived mass instruction, itself a dog wagged by the tail of mass assessment.
Like most of the students I’ve seen in two decades of teaching at every level including doctoral study, they have no flipping idea of the purpose of academic and professional writing, which is generally to make a modest original contribution to a long-running, complicated conversation.
To that end, the indispensable core attribute of academic writing is the review of relevant scholarly literature embedded within it. An actual academic writer’s original contribution might be analytical (an original reading of a tapestry or poem). Or it might be the acquisition or sorting of data (interviews, coding text generated in social media, counting mutations in an insect population). It might be a combination of both. In all of these cases, however, an actual academic writer includes at least a representative survey of the existing literature on the question.
That literature review in many circumstances will be comprehensive rather than merely representative. It functions as a warrant of originality in both professional and funding decisions (“We spent $5-million to study changes in two proteins that no other cancer researcher has studied,” or “No one else has satisfactorily explained Melville’s obsession with whale genitalia”). It offers a kind of professional bona fides (“I know what I’m talking about”). It maps the contribution in relation to other scholars. It describes the kind of contribution being made by the author.
Typically actual academic writers attempt to partly resolve an active debate between others, or answer a question that hasn’t been asked yet, what I describe to my students as “addressing either a bright spot of conflict in the map of the discourse, or a blank spot that’s been underexplored.”
In many professional writing contexts, such as legal briefing, literature review is both high-stakes and the major substance of the writing.
So why don’t we teach that relationship to scholarly discourse, the kind represented by the skill of summary in Howard’s research? Why don’t we teach students to compose a representative review of scholarship on a question? On the sound basis of a lit review, we could then facilitate an attempt at a modest original contribution to a question, whether it was gathering data or offering new insight.
The fact is, I rarely run into students at the B.A. or M.A. level who have been taught the relationship to source material represented by compiling a representative literature review. Few even recognize the term. When I do run into one, they have most commonly not been taught this relationship in a writing class, but in a small class in an academic discipline led by a practicing researcher who took the trouble to teach field conventions to her students.
Quote-Farming: So Easy a Journalist Can Do It
I personally have a lot of respect for journalists, and sympathize with their current economic plight, which is so similar to that of teachers and college faculty. They too do intellectual work under intense bureaucratic management and increasingly naked capitalist imperatives. So there are reasons why their intellectual product is often so stunted and deformed that the country turns to Jon Stewart’s parody of their work for information as well as critical perspective.
Albeit not always due to the flaws of journalists themselves: If there are real-world models for the poor ways we teach students to write, they’re drawn from newspaper editorials and television issue reporting. In editorials, “sources” are commonly authorities quoted in support of one’s views or antagonists to be debunked. In much television issue reporting, frequently composed in minutes on a deadline, quick quotes are cobbled together, usually in a false binary map of she’s-for-it and he’s-against-it. (NPR made headlines this year when it formally abandoned the fraudulent practice of representing or simulating balance by the common journalistic method of “he said, she said,” or reporting differing views, usually two, as if they held equal merit or validity, when in reality there can as easily be 13 sides, or just one, all with very different validity.)
Of course journalism can do better and often does, but it is some of journalism’s most hackneyed practices that have shaped traditional pedagogy for academic writing: quote-farming, argument from authority, false binarism, fake objectivity.
Those practices are intrinsically unappealing, but the real problem is the mismatch.
Academic writing bears a very different relationship to academic “sources” than journalism. For journalists in many kinds of reporting, academic sources are experts, hauled onto stage to speak their piece and shoved off again, perhaps never to be met with again.
It’s this sort of smash-and-grab, whether from the journalist’s Rolodex/smart phone, from a scholarly database, or the unfairly-blamed Google (as if this practice were invented by internet search!) that we teach to our students by requiring them to make thesis statements and arguments “supported by sources.”
For practicing academic and professional writers, other professional sources are rarely cited as authorities, except as representative of general agreement on a question. Most other citations are to the work of peer writers, flawed, earnest, well-meaning persons who have nonetheless overlooked an interesting point or two.
Surveying what these peers tried to do fully and fairly, and then offering some data or some insight to resolve an argument that some of them are having, or point to an area they haven’t thought about—is what we do. The substance of the originality in most academic and professional writing is a very modestly-framed contribution carefully interjected into a lacuna or debate between persons you will continue to interact with professionally for decades. In almost every respect it little resembles the outsized ambitions (let’s resolve reproductive rights in 600 words!) and modest discursive context (a news “peg”) of mass-mediated opinion.
Sure, no question, “everything’s an argument,” but argument or generic notions of persuasion used in the mass media aren’t always the best model for academic and professional discourse. (And I say this as someone who’s not afraid to argue.)
A big reason for the success of They Say/I Say, a popular composition handbook by Cathy Birkenstein and Jerry Graff, is its effort to provide an introduction to the actual “moves that matter in academic writing,” moves which generally involve relating one’s position to a complicated existing conversation.
Teaching & Grading Academic Writing By Persons Who Don’t Do It
What Becky Howard has in common with Birkenstein & Graff is valuing the ability to represent that complicated existing conversation. What is particularly useful to all of us is that they grasp that this is a problem that can’t be harrumphed out of existence-“Well, if those kids would actually read!” Let’s leave out the fact that most of the persons enrolled in higher ed aren’t kids, and that they do read, and write-a lot. Let’s leave out the whole package of dysfunctional pedagogies we impose on students and the contradictory narratives we tell about them: Large lecture classes are fine, but video capture of large lectures is bad! (Right, grandpa: it’s much better to deny me access to discussing the material with experienced faculty actively researching in their field because you’ve scaled her up with an auditorium sound system and not a video camera–that makes total sense. Defend the lecture hall!) As David Noble and I and others have pointed out many times: the reason current technologies don’t, won’t, and can’t eliminate the labor of actual teaching is the reason that earlier technologies, like the book, post office, television and radio did not: Actual teaching is dialogic and occurs in the exchange between faculty and students. The more exchange, the more learning. (Of course much of what is certified as learning isn’t anything of the kind.)
Our writing pedagogy is the main problem here what we ask faculty and teachers to do, who we ask to do it, and the ways we enable & disable them by bureaucracy and greed, whether the greed is for-profit accumulation or harvesting tuition dollars for in-house spending on a biochemist’s lab. (As I’ve previously insisted, the for-profits can accumulate capital with sleazy cheap teaching because the nonprofits do the same thing, except accumulating their capital as buildings & grounds, etc.)
One of the reasons students don’t learn to read academic articles and compose literature reviews in writing classes is that they are taught by persons who don’t do it themselves–nontenurable faculty, many without the Ph.D., or graduate students newly studying for it, many of whom don’t get an education in the practice themselves until they begin their own comprehensive lit review in preparation for a thesis. Often they are highly managed faculty, working like high-school teachers (except with much less training) to a scripted curriculum with mass syllabi, identical assignments that are easy to produce mechanically and grade mechanically-in a routinized “teaching” factory that is easy to assess mechanically, train mechanically, and supervise mechanically.
Unsurprisingly: No reliable computerized assessment can tell whether a review of scholarly literature is an accurate representation of the state of knowledge in a field. Nor can it adjudge whether a proposed intervention into a conflict or neglected area in that field is worthy of the effort, or help a student to refine that proposed experiment or line of analysis. Of course, many of the persons we presently entrust with writing instruction lack the ability, training, or academic freedom to do so as well.
If we are to do more with writing classes and writing assignments, we need to put aside the hysteria about machine grading and devote our attention to the mechanical teaching and learning environment in which we daily, all but universally, immerse our writing faculty. We need to change the kind of writing we ask them to teach. We need to enable writing faculty to actually do the kind of academic writing they should be teaching–which means changing our assumptions about how they’re appointed, supported, evaluated and rewarded. You want to be a machine-breaker and fix writing pedagogy? Great. Start with with your professional responsibility to address the working circumstances of your colleagues serving on teaching-only and teaching-intensive appointment.