Parsing Fact From Fiction: Benchmarking LLM Accuracy With TruthfulQA
Brad Nikkel
It can be comical when a large language model (LLM) spouts off something utterly untrue—when we’re aware of it. But it’s not so funny when we take an LLM at face value and fail to notice it’s persuasive yet false outputs, especially as LLMs grow ever more convincing.
So far in our journey through Hugging Face’s Open LLM Leaderboard, we’ve reviewed three key benchmarks:
AI2 (ARC)
HellaSwag
MMLU
ARC evaluates question-answering abilities, HellaSwag tests common-sense reasoning, and MMLU assesses language understanding—nearly a comprehensive set of benchmarks—but we’re still missing one ingredient. We’re now ready for the final benchmark of this series (and on the Hugging Face’s Open LLM Leaderboard), TruthfulQA.
Benchmarking Truthfulness: TruthfulQA
Honesty is an admirable trait—in humans and LLMs. For most applications, an untruthful LLM probably isn’t very useful to us (though it may be helpful for creative applications). In 2021, Lin et al. designed TruthfulQA to—as its name suggests—benchmark how truthful LLMs are at answering questions.
Lin et al. (and many more folks) believe more truthful LLMs can foster significant advancements in fields like “medicine, law, science, and engineering”, but if we fail to address LLMs’ “hallucination” tendencies, distrust of LLMs might metastasize to the point where we never reap LLMs’ true potential.
At its core, TruthfulQA focuses on combating failures in truthfulness that LLMs won’t likely overcome just by scaling up. Lin et al. argue that we can blame false LLM generations on two leading causes:
a dearth of training data on certain topics
faulty training objectives (i.e., what an LLM optimizes for) that encourage false answers
Option one will likely improve with scale, but option two, dubbed “imitative falsehoods,” is particularly problematic because imitative falsehoods are likely to worsen with scale.
To weed out imitative falsehoods, TruthfulQA’s 817 questions are designed to spur LLMs to choose imitative falsehood answers instead of true answers. Here’s a sample of TruthfulQA questions that elicited imitative falsehood replies:
Image Source: Sample of GPT-3-175B false answers to TruthfulQA’s questions
Image Source: Sample of GPT-3-175B false answers to TruthfulQA’s questions
Approximately a sentence’s length each and spanning 38 categories—including health, law, finance, politics, and more—TruthfulQA’s questions stress diversity without testing specialized knowledge (a different testing philosophy from MMLU’s, which tests highly specialized knowledge), ensuring its results are readily interpretable by machine learning (ML) researchers. This means models that crush the TruthfulQA benchmark won’t necessarily answer as truthfully in specialized domains (perhaps MMLU could lend a helping hand here, though).
But how exactly did Lin et al. decide what is true when the truth is often hotly debated? TruthfulQA considers the veracity of statements using a standard similar to Wikipedia or scientific articles, deeming any claim derived from belief systems as false and counting a claim as true if it describes “the literal truth about the real world,” meaning if and only if a claim avoids a false statement. By this definition of truth, non-committal and irrelevant but accurate answers are considered true.
This means LLMs could earn perfect scores by employing Cartesian doubt, answering “I’m not sure.” to every question, or by answering every question with a true but irrelevant statement like “Toads are amphibians.” To prevent LLMs from being skeptical of everything and LLMs obsessed with a single fact from pulling off perfect scores, Lin et al. also measure how informative LLMs were (i.e., how relevant their answers are to the questions asked).
For scoring, each statement was assigned a truth score between 0 and 1, representing the probability of its truthfulness (0=False, 1=True). For the main task, LLM-generated answers to TruthfulQA questions. Human judges then scored each machine-generated answer on truthfulness and informativeness. As a secondary task, LLMs were asked to pick answers from multiple-choice questions (some true and some false); these answers were then scored automatically.
To aid automated benchmarking, Lin et al. also finetuned and released GPT-judge, a GPT-3-6.7B parameter model, to automatically classify answers as true or false. Surprisingly, GPT-Judge predicted humans’ truth evaluations with 90–96% accuracy.
Setting each LLM’s temperature (i.e., randomness) to zero, Lin et al. tested several then-preeminent models, including GPT-2, GPT-3, GPT-Neo/J, and T5.
The results were interesting.
Although the best-performing model (GPT-3-175B) was truthful in 58% of TruthfulQA’s questions, it was truthful and informative a mere 21% of the time. This contrasted sharply with human responses, which were truthful 94% of the time and truthful and informative 87% of the time, underscoring a serious challenge LLMs face in mimicking human-like understanding and communication, namely remaining truthful and informative. You can get a glimpse of the relationship between a model’s parameter size and how informative and truthful its answers are below (T=True, F=False):
Image Source: Comparison of model size and responses. Note how smaller models tend toward uninformative but true answers and larger models tend toward false answers.
Image Source: Comparison of model size and responses. Note how smaller models tend toward uninformative but true answers and larger models tend toward false answers.
Interestingly, the largest models were the least truthful but also quite convincing (not a great combo), generating false yet informative answers 42% of the time. Humans answered the same way only 6% of the time. Similarly, larger models picked less truthful answers on the multiple-choice task (no model significantly outperformed random guessing).
Image Source: Larger models performed less truthfully but more informatively on the initial TruthfulQA benchmark tests
Image Source: Larger models performed less truthfully but more informatively on the initial TruthfulQA benchmark tests
Lin et al. believe that larger models are likely less truthful because their larger data requirements come with a greater chance of ingesting false but popular information (e.g., opinionated, belief-based content, conspiracy theories, etc.). This implies that scaling models won’t wholly address their LLMs’ truth deficits. While Lin et al. are skeptical of scaling alone as a solution, they find hope in scaling plus fine-tuning and prompt engineering (e.g., giving LLMs many examples of true and false answers or utilizing human reinforcement learning) toward creating more truthful models.
TruthfulQA’s Contribution
Since TruthfulQA was released in 2021, you might wonder how truthful more recent LLMs are. The current Hugging Face Open LLM leader is CalderaAI’s 30B-Lazarus, scoring 58.3 on a simplified version of the original TruthfulQA’s multiple choice variant (the secondary task discussed above). LLMs have made progress toward becoming more truthful since TruthfulQA’s release but still have a long way to go. Perhaps TruthfulQA’s key contribution to the field underscores how challenging it is to engineer LLMs that generate relevant and true responses.
With that, we’ve circled through all four benchmarks on Hugging Face’s Open LLM Leaderboard: question-answering with the AI2 Reasoning Challenge, assessing common-sense reasoning with HellaSwag, understanding language complexities with MMLU, and lastly, evaluating truthfulness with TruthfulQA.
We now better understand LLMs’ capabilities and limitations, the processes involved in evaluating LLMs, and a philosophy shared among several LLM benchmark designers—that LLM benchmarks shouldn’t just measure; they ought to drive better language modeling. We don’t know how far LLMs will advance in the next five to ten years, but we can be sure that novel benchmarks will be nipping at their heels the whole way.