Can Artificial Intelligence aid in the assessment of primary writing?

Daisy Christodoulou

Of those 3,825 human decisions, our AI agreed with 3,118 and disagreed with 707. That’s an 82% agreement rate, which is similar to the typical human-human agreement across our projects.

What about those 18% of disagreements? This is really important. It’s easier than you think to get 82% human-AI agreement. What really matters is what they are disagreeing about! Reassuringly, most of the human-AI disagreements were small. Of the 707 judgements where the human and AI disagreed, 50% were under 21 points, 90% were under 60 points, and 97% were under 80 points. (Our scale is fine-grained, and runs from about 300 – 700).

3% of the decisions – 21 in total – were above 80 points. That is 3% of the total disagreements, but just 0.5% of the total number of human judgements. 

Some element of disagreement is always going to exist with assessments of extended writing, whoever is judging it. This is a very low rate of serious disagreement, and one that we think is acceptable. 


e = get, head

Dive into said