Document | Author Yuzhi Pan and Chris Baber |
Abstract We presented probability problems to two Large Language Models (LLMs) and asked human judges to evaluate the correctness of the outputs. Neither LLM achieved 100% on the questions but participants did not always spot the errors these made. Two types of human error were identified: i. the LLM answer is correct, but the participant thought it was wrong (especially with the smaller LLM); ii. the LLM answer was wrong, but participants thought it was correct (especially with the larger LLM). Participants tended to trust the LLM when they were unsure how to answer a question and the LLM provided an answer that seemed reasonable and coherent (even if it is actually wrong) |