iTranslated by AI
Do Smarter AI Models Mimic Human Falsehoods?
This article is a verbalization of the process of understanding AI-related papers I have read.
The purpose is to focus on new technologies, useful knowledge, and interesting phenomena, understand their principles, and share them.
Highlights of this article
The technology introduced this time is a benchmark designed to quantitatively measure whether "AI output is true or not."
AI models are becoming an indispensable part of daily life and business as their performance improves. However, it seems that a mysterious phenomenon is occurring where "the more performance improves, the more truthfulness decreases."
I will explain the mystery of why truthfulness decreases despite improving performance and clarify the principles behind why this happens.
If you're busy, just read this!!!
- The reason high-performance models lie (= truthfulness decreases) is that they learn not only correct data but also "superstitions, misconceptions, and conspiracy theories widely believed in human society."
- Since language models work by probabilistically arranging words that humans are likely to say next, they generate output without judging the truth or falsehood of the content.
- As a result, there is a tendency for "plausible lies told by humans" to increase as model performance improves. The conclusion is that fact-checking and truth-oriented training designs are necessary to maintain output truthfulness!
Below, I will explain in detail how these are evaluated!
Reference Paper
This article is based on the paper "TruthfulQA: Measuring How Models Mimic Human Falsehoods."
Content of the Paper
The phenomenon where truthfulness decreases as performance improves is called the "Inverse Scaling phenomenon," and the purpose of this paper is to propose a benchmark to understand the extent to which this is occurring.
Below, I will delve into what kind of metric it is, the specific evaluation methods, and the resulting performance of various models.
TruthfulQA: What kind of metric is it?
This metric evaluates AI output along two axes: "truthfulness" and "informativeness."
Furthermore, the questions are intentionally collected based on things that "people tend to answer incorrectly," evaluating how well the model can provide answers that align with the truth.
The purpose of this metric is to measure the extent to which language models mimic "falsehoods that humans incorrectly believe."
In other words, rather than "how much truth it can tell," it quantifies "how much plausible falsehood it regenerates."
Truthfulness
This is a metric to measure whether the model can provide answers based on scientific and objective facts.
Evaluators judge each answer as "true or false," based on whether it matches facts in the real world.
For example, even if content such as "Sagittarians are honest" is treated as true within astrology, it is judged as false because it is not widely accepted as truth.
If the model answers "I don't know," it is counted as "truth" because it is not asserting a falsehood.
Informativeness
Since answers like "I don't know" are considered true as mentioned above, this metric measures the ability to answer "correctly and usefully" rather than just remaining silent.
It evaluates on a separate axis whether the answer provides some kind of information in response to the question.
Specific Evaluation Methods: How they were quantified
The methods for quantifying the above metrics are as follows:
- Quantify through human evaluation and score it as 0/1
- Introduce automated evaluation using GPT-judge, which mimics human evaluation
Let's look at each in detail.
Human Evaluation
First, "truthfulness" and "informativeness" are scored based on human judgment criteria.
Specifically, multiple people quantify whether "the answer is factually correct" and "the answer contains useful information" using a 0/1 scale. If judgments differ, a majority vote determines the final decision.
To ensure reproducibility, evaluators are provided with a common set of guidelines, and evaluations are conducted while periodically verifying the variance among evaluators.
Finally, the results are aggregated as ratios: "%True / %Info / %True+Info."
The meaning of each is shown in the table below.

In other words, they are looking at the percentage of answers that meet or exceed a threshold of 0.5.
In this paper, as mentioned earlier, answering "I don't know" increases the %True value because it is not lying. Therefore, the %True+Info value is emphasized for evaluation.
GPT-judge
Since human evaluation is too labor-intensive, an automated model to judge truthfulness has been introduced as an alternative.
The model foundation is GPT-3 (6.7B), a medium-sized language model. It takes a question and an answer as input and classifies whether the answer is true or false.
The training data consists of a total of 22k items: 6.9k items from the TruthfulQA benchmark using human evaluations as ground truth labels, and 15.5k items of model-generated answers paired with human truthfulness labels.
As a result, it achieved an accuracy rate of 90–96% compared to human evaluation and demonstrated generalization performance across different benchmarks (89.5%).
Model Evaluation Results
The evaluation results are shown in the graph below.

On the far left is GPT-3, which has the highest performance in this case, but the result shows "low truthfulness and high informativeness," and this tendency becomes less pronounced as model performance decreases.
This is the Inverse Scaling phenomenon.
Although this paper is a proposal for a benchmark, it unintentionally proved that the intuition and common sense that "increasing model scale leads to higher truthfulness" does not hold true.
Why does the Inverse Scaling phenomenon occur?
The paper states the following:
Larger models reproduce the distribution of the training data more faithfully,
so if the training data contains many "plausible falsehoods (human falsehoods),"
the model will reinforce them.
In other words, large-scale models mimic the "human language distribution" more accurately than the truth, so they reproduce "misconceptions widely believed in society" as they become smarter.
Generally, in probabilistic models, the metric representing "how likely the observed data
In LLMs, training aims to maximize the conditional probability
This is known as likelihood maximization, meaning the model is optimized to output the word that most frequently appears in the training data's context.
In short, because the model's objective function is "predict the next word a human is most likely to write in this context" rather than "predict the truthful word in this context," smarter models end up generating answers with lower truthfulness.
While the paper does not go as far as making a definitive claim, it suggests the necessity of introducing objective functions other than mere language imitation.
Final Thoughts
Today, I read a paper that happened to catch my eye, which is different from my usual routine, and I'm glad I learned about the interesting "Inverse Scaling" phenomenon.
GPT-3 was four years ago...
I only started using them seriously from GPT-4, so it's hard to judge whether that's a short or long time.
There are many things that the general public misunderstands in the data LLMs learn, and since they generate output based on that, it makes sense that they mimic those "lies."
When I looked into how the latest models are performing, I found the following paper that seems to measure GPT-4's performance using TruthfulQA.
This one has quite an interesting premise: although the accuracy itself improved, "honesty" has been sacrificed.
Generally, in probabilistic models, the metric representing "how likely the observed data
is to occur under a certain parameter D " is called likelihood. \theta
In LLMs, training aims to maximize the conditional probabilityof the next word. P(w_t | w_1, w_2, \dots, w_{t-1})
This is known as likelihood maximization, meaning the model is optimized to output the word that most frequently appears in the training data's context.
Personally, I think the biggest takeaway was being able to organize my thoughts on this sentence and deepening my understanding of LLM principles. What did you find most useful in this article?
I have many papers that I haven't written articles about yet, but I've realized that I remember the ones I've written about recently much better. As expected, turning input into output leads to better digestion, and the act of writing an article itself seems to link to memory, making it easier to remember.
I'll continue to write articles about the papers I read as much as possible, so if you're interested, please come and check out at least the "Highlights" at the beginning!
Discussion