“That was definitely the ‘oh, crap’ moment,” Bowman recalled, using a more colorful interjection. “The general reaction in the field was incredulity. BERT was getting numbers on many of the tasks that were close to what we thought would be the limit of how well you could do.” Indeed, GLUE didn’t even bother to include human baseline scores before BERT; by the time Bowman and one of his Ph.D. students added them to GLUE in February 2019, they lasted just a few months before a BERT-based system from Microsoft beat them.
As of this writing, nearly every position on the GLUE leaderboard is occupied by a system that incorporates, extends or optimizes BERT. Five of these systems outrank human performance.
But is AI actually starting to understand our language — or is it just getting better at gaming our systems? As BERT-based neural networks have taken benchmarks like GLUE by storm, new evaluation methods have emerged that seem to paint these powerful NLP systems as computational versions of Clever Hans, the early 20th-century horse who seemed smart enough to do arithmetic, but who was actually just following unconscious cues from his trainer.
“We know we’re somewhere in the gray area between solving language in a very boring, narrow sense, and solving AI,” Bowman said. “The general reaction of the field was: Why did this happen? What does this mean? What do we do now?”