Subverting cognition: AI large models are unreliable, and the larger they are, the less reliable they are? !

The larger the parameters of an artificial intelligence (AI) model, the more accurate and trustworthy the answers it generates will be?

Not necessarily!

Recently, a study published in the authoritative scientific journal Nature showed that compared with small-parameter models, large-parameter models will not admit their "ignorance" and are more likely to generate wrong answers .

What’s noteworthy is that people are not very good at detecting these errors .

This research comes from the Polytechnic University of Valencia team and its collaborators. After studying GPT, LLaMA and BLOOM series of large language models (LLM), they found that

Although, as expected due to some fine-tuning methods (such as RLFH), LLMs with larger parameter sizes generate more accurate answers , especially on complex tasks, their overall reliability is lower .

Among all inaccurate responses, the proportion of wrong responses increased, and even more low-level errors occurred on some simple tasks . For example, GPT-4 made 15% more errors than some small models when dealing with simple addition and crossword puzzles. This is because the model is less likely to avoid answering questions - such as admitting that it does not know or changing the subject.

The above results show that large parameter models may be at risk of overfitting or misestimation on simple tasks, making them less reliable.

Model expansion brings "capability contrast"

In this work, the researchers explored the impact of three core intertwined elements, namely difficulty consistency, task avoidance, and cue stability, on the reliability of LLM from the perspective of human users' interaction with LLM.

Professor José Hernández Orallo, corresponding author of the study, said: “ The reliability of the language model does not match the human perception of the difficulty of the task . The model can solve PhD-level math problems, but at the same time it may make mistakes on simple addition.”

The research team compared the performance of the three major model series , GPT, LLaMA, and BLOOM, in different tasks, especially in digital calculation, word games, geographical knowledge, basic and advanced scientific problems, and information conversion. By analyzing the accuracy, error rate, and avoidance behavior of these tasks, the ability contrast phenomenon brought about by model expansion was revealed.

1. Difficulty paradox: "The simpler it is, the more mistakes you make?"

A surprising key finding is that the model's performance improved significantly when faced with complex tasks, but the error rate on simple tasks increased significantly. This phenomenon is called "Difficulty Inconsistency", that is, the expanded model gradually improved its accuracy on complex tasks, but was prone to errors on simple tasks.

Taking the addition task as an example, although the model can solve complex multi-digit addition, it frequently makes mistakes in simple two-digit addition. For example, the accuracy of all LLaMA models on the simplest task does not exceed 60%, while in some more difficult tasks, they perform relatively well.

This phenomenon is also particularly prominent in the GPT model. In particular, when dealing with simple addition and word puzzle tasks, the optimized model is more likely to give wrong answers. The research team pointed out that this phenomenon shows that the expansion of the current model may be too focused on complex tasks and ignore simple tasks.

Figure | Key indicators of GPT, LLaMA and BLOOM models

This result overturns the traditional perception of LLM, indicating that the extended model does not always lead to comprehensive improvements, and raises questions about its reliability in practical applications.

2. Error rate and avoidance behavior - "overconfidence"

In addition to the difficulty inconsistency phenomenon, the study also revealed a subtle relationship between avoidance behavior and error rate in the optimized model.

Avoidance behavior refers to the model choosing not to answer or giving an inappropriate response when it cannot answer a question correctly.

When the model was not optimized, avoidance behavior was common, that is, when the model was unsure of the answer, it would often choose "no answer" or provide vague responses. However, after expansion and optimization, the model significantly reduced avoidance behavior and gave more seemingly "reasonable" but actually wrong answers.

This means that while some optimization methods make the model more "confident" and reduce avoidance behavior, the error rate increases. This phenomenon is particularly evident in models such as GPT-4 and GPT-3.5-turbo, where scale expansion has not brought the expected stability. Compared with the LLaMA and BLOOM models, this trend is less obvious, but it also exists.

Figure | The performance of GPT and LLaMA models improves as difficulty increases

The research team said that this phenomenon is closely related to the excessive trust that users have in the model, especially when users are faced with seemingly simple tasks.

“This can lead to frustration for users who initially rely too heavily on the model,” said Lexin Zhou, the paper’s first author. “Also, unlike humans, the tendency to avoid providing answers does not increase with difficulty. For example, humans tend to avoid giving feedback on questions that are beyond their capabilities. This puts the onus on users to discover errors during their interactions with the model.”

3. Do prompt words bring stability or traps ?

The study analyzed the model's sensitivity to cue words, specifically whether there are "safety zones" for certain cues.

The results show that as the model size increases, the model becomes more sensitive to different natural language expressions and can better cope with fine-tuning of wording. However, even after expansion and optimization, the model still has inconsistent performance on tasks of different difficulty levels. Moreover, the accuracy of the model's answers fluctuates under different expressions.

The study found that people's perception of difficulty is inconsistent. Yael Moros Daval, one of the authors of the paper, said: "Do the models fail where we expect? We found that the models tend to be less accurate on tasks that humans consider difficult, but even on simple tasks, they are not 100% accurate. This means that there is no 'safe zone' where you can trust the model to work perfectly."

Specifically, the unoptimized GPT and LLaMA models showed high sensitivity to the choice of cue words, especially in simple tasks. If the cue words are chosen properly, the performance of the model will be improved; while the optimized models have improved cue word sensitivity and more stable performance, but there is also a certain degree of variability.

The optimized models are more stable in terms of prompt changes and have a higher accuracy rate than the original models, but they perform poorly in terms of consistency and caution with human judgment difficulty.

Figure | Scaling analysis of LLaMA, BLOOM series and unstructured GPT models

The study found that when users’ difficulty expectations are inconsistent with the model’s output, especially for simple tasks, both the model and the user’s incorrect supervision will increase, and human supervision cannot compensate for these problems.

Although human expectations of task difficulty can be used as a predictor of model correctness, the model still has errors on simple tasks; the expansion and optimization of model size not only reduces avoidance behavior, but also leads to an increase in error rate, and avoidance behavior is not related to task difficulty; even if the model is expanded and optimized, the need for cue engineering still exists, and the improvement of cue performance does not increase monotonically with difficulty.

This research not only reveals the key blind spots in the expansion of large models, but also provides a new direction for the future development of AI - finding the best balance between model size and task difficulty may be the real key to the evolution of intelligence.

"Ultimately, LLMs become increasingly unreliable from a human perspective, and user supervision to correct errors is not a solution because we tend to rely too much on the model to identify incorrect results at different levels of difficulty," said Wout Schellaert, one of the paper's authors. "Thus, fundamental changes are needed in the design and development of general artificial intelligence (AGI) , especially for high-stakes applications where predicting the performance of language models and detecting their errors is critical."

Shortcomings and Prospects

Although this study has achieved important results in revealing the cue sensitivity of LLM and the impact of extension and optimization on performance, there are still some limitations.

First, the participants in this study were mostly non-experts, which requires caution when interpreting the calibration difficulty values. For some benchmark datasets, non-experts may not be able to solve a large number of problems, and the purpose of the study is to capture the expected difficulty of the general population to enable comparable analyses across all datasets.

Furthermore, the “natural” cues used in this study were collected from diverse sources, but data on the frequency with which these cues appear in real-world scenarios were not obtained.

At the same time, this study only covers some models, especially those that rely on external tools or complex reasoning techniques. This limits the understanding of the dynamic performance of LLM in more complex scenarios and cannot fully evaluate the potential and problems of different models.

In addition, the study only covers some model families, especially those models that rely on external tools or complex reasoning techniques. This limits the understanding of the dynamic performance of LLM in more complex scenarios and makes it impossible to fully evaluate the potential and problems of different model families.

The researchers said that they will further expand the data set on human difficulty expectations and output supervision in order to introduce these higher-quality data into model training and train supervisors through AI to improve the model optimization process.

In key areas such as healthcare, the model can improve its avoidance ability by designing a refusal option or combining it with an external AI supervisor, ultimately enabling the LLM to demonstrate reliability and consistency that is more in line with human expectations.

Author: Tian Xiaoting

<<: Why do some people see the "devil's face"? It may be related to "visual impairment"...

>>: The trend of "Four Uncles" playing with skewers has spread to children! Will primary school students get tenosynovitis from playing with skewers?

[Creative Cultivation Program] Fish are the evidence that the Tarim Basin actually "drifted" from the equator

Blog

Mercedes-Benz, Land Rover and Cadillac are among the world's ten worst cars. Why are these god cars not perfect in your eyes?

Blog

Cartoon | Beware of the “invisible killer” in winter!

Blog

The little secret of the park: making the city "alive"

Blog

Do modern people really need to gain weight in autumn? Compared with gaining weight in autumn, what should we do in autumn?

Blog

Exclusive interview with Tencent's Liu Yafei: Challenges of mobile game operation and maintenance in complex environments

Blog

Is the Earth getting crooked? This may be because humans are over-extracting groundwater

Recommend

The best Android system ever: Talking about the past and present of CyanogenMod

At the end of 2016, the well-known third-party An...

5 ways to effectively drive users to download apps in the App Store!

If you want to gain users in the App Store, the r...

Beware! The "baby walking gadgets" that parents often use when taking their kids out may cause injuries to their children

After the beginning of autumn, the weather gradua...

Subverting cognition: AI large models are unreliable, and the larger they are, the less reliable they are? !

[Creative Cultivation Program] Fish are the evidence that the Tarim Basin actually "drifted" from the equator

Mercedes-Benz, Land Rover and Cadillac are among the world's ten worst cars. Why are these god cars not perfect in your eyes?

Cartoon | Beware of the “invisible killer” in winter!

The little secret of the park: making the city "alive"

Do modern people really need to gain weight in autumn? Compared with gaining weight in autumn, what should we do in autumn?

Exclusive interview with Tencent's Liu Yafei: Challenges of mobile game operation and maintenance in complex environments

Is the Earth getting crooked? This may be because humans are over-extracting groundwater

Due to too many bugs, Apple plans to make major changes to the development model of iOS 14

Dianping Product Analysis Report

8000 words to deconstruct the 618 brand marketing gameplay

Recommend

The best Android system ever: Talking about the past and present of CyanogenMod

5 ways to effectively drive users to download apps in the App Store!

Beware! The "baby walking gadgets" that parents often use when taking their kids out may cause injuries to their children

Short video advertising operation monetization conversion method!

Understanding Apple Watch in one picture

Douyin e-commerce standardized product selection guide (recommended collection)

Liang Jinghong's Color Design Principles [HD]

The journey of snail noodles is not only stinky and fragrant, but also has these 3 highlights

Tesla's future is in China? 40% of Tesla's global sales in 2022 may come from China

Taking Hema Fresh as an example, let’s look at the online and offline integrated marketing strategy!

The secret of high ROI advertising—data analysis and optimization

China Passenger Car Association: Pickup Truck Market Analysis in May 2022

Some people say that people on a light-speed spaceship will not age, so are human cells still active?

Reverse time and rejuvenate! Is the Turritopsis immortal?

Three principles to make users willing to share