Large Language Models and Diversity

Tyler Cowen:

Put aside the political issues, do Large Language Models too often give “the correct answer” when a more diverse sequence of answers might be more useful and more representative?  Peter S. Park, Pilipp Schoenegger, and Chongyang Zhu have a new paper on-line devoted to this question.  Note the work is done with GPT3.5.

Here is one simple example.  If you ask (non-deterministic) GPT 100 times in a row if you should prefer $50 or a kiss from a movie star, 100 times it will say you should prefer the kiss, at least in the trial runs of the authors.  Of course some of you might be thinking — which movie star!?  Others might be wondering about Fed policy for the next quarter.  Either way, it does not seem the answer should be so clear.  (GPT4 by the way refuses to give me a straightforward recommendation.)

Interestingly, when you pose GPT3.5 some standard trolley problems, the answers you get may vary a fair amount, for instance on one run it was utilitarian 36% of the time.

I found this result especially interesting (pp.21-22):