The World Health Organization's (WHO) artificial intelligence health resource assistant SARAH listed fake names and addresses of non-existent clinics in San Francisco. Meta’s short-lived science chatbot, Galactica , fabricated academic papers and generated Wikipedia articles on the history of space bears. In February, Air Canada was ordered to comply with a refund policy fabricated by its customer service chatbot. Last year, a lawyer was fined for submitting court documents filled with false judicial opinions and legal citations that were fabricated by ChatGPT. … Nowadays, it is not uncommon to see examples of large language models (LLMs) making up nonsense, but the problem is that they are very good at making up nonsense in a serious manner, and most of the fabricated content looks like the truth, making it difficult to distinguish between the real and the fake. In some cases, it can be laughed off as a joke, but once it involves professional fields such as law and medicine, it may have very serious consequences . How to effectively and quickly detect hallucinations in large models has become a hot research topic that technology companies and research institutions at home and abroad are competing for. Now, a new method proposed by the Oxford University team can help us quickly detect hallucinations in large models - they try to quantify the degree to which an LLM produces hallucinations, so as to determine how faithful the generated content is to the provided source content, thereby improving the accuracy of its question answering . The research team said their method can identify "confabulation" in LLM-generated personal profiles and answers to topics such as trivia, general knowledge and life sciences. This research is significant because it provides a general method for detecting LLM hallucinations without the need for human supervision or domain-specific knowledge . This helps users understand the limitations of LLM and promotes its application in various fields. The related research paper, titled “Detecting Hallucinations in Large Language Models Using Semantic Entropy”, has been published in the authoritative scientific journal Nature. In a News & Views article published alongside the article, Professor Karin Verspoor, Dean of the School of Computing Technologies at RMIT University, pointed out that the task being completed by one LLM and evaluated by a third LLM was tantamount to "fighting fire with fire" . But she also wrote, " Using an LLM to evaluate an LLM-based method seems to be circular and may be biased. " However, the authors point out that their method is expected to help users understand in which cases the use of LLM answers requires caution, which also means that the credibility of LLM can be improved in more application scenarios. How to quantify the degree of hallucination in LLM? Let’s first understand how the illusion of a large model is created. LLM is designed to generate new content. When you ask a chatbot some questions, its answer is not all from the database to find ready-made information, but also needs to be generated through a lot of digital calculations. These models generate text by predicting the next word in a sentence. There are hundreds of millions of numbers inside the model, like a giant spreadsheet, recording the probability of occurrence between words. During the model training process, these values are constantly adjusted so that its predictions match the language patterns in the massive amount of text on the Internet. Therefore, the large language model is actually a "statistical slot machine" that generates text based on statistical probability. When the joystick moves, a word appears. Most existing methods for detecting LLM hallucinations rely on supervised learning, which requires a large amount of labeled data and is difficult to generalize to new domains. In this study, the research team used the semantic entropy method, which does not require labeled data and performs well on multiple datasets and tasks. Semantic entropy is a method to measure the potential semantic uncertainty in text generated by a language model. It evaluates the reliability of model predictions by considering the changes in the meaning of words and sentences in different contexts. The method detects “confabulation” — a subcategory of “hallucination” that refers to inaccurate and arbitrary content, often when the LLM lacks certain knowledge. The method takes into account the subtleties of language and how responses can be expressed in different ways and thus have different meanings. Figure|Brief introduction to semantic entropy and fictional content detection As shown in the figure above, the traditional entropy-based uncertainty measure has limitations in identifying the exact answer. For example, it considers "Paris", "This is Paris", and "Paris, the capital of France" as different answers. However, when it comes to language tasks, these answers are different but have the same meaning, so this approach is obviously not applicable. The semantic entropy method clusters answers with the same meaning before calculating the entropy. Low semantic entropy means that the large language model has a high degree of certainty about the meaning of its content. In addition, the semantic entropy method can effectively detect fictional content in long paragraphs. The research team first decomposed the generated long answers into several small fact units. Then, for each small fact, the LLM generates a series of questions that may be related to it. The original LLM then provides M potential answers for these questions. Next, the research team calculated the semantic entropy of the answers to these questions, including the original small facts themselves. A high average semantic entropy indicates that the questions related to the small fact may contain fictional components. Here, because the generated answers usually convey the same meaning even though the wording is significantly different, the semantic entropy successfully classifies Fact 1 as non-fictional content, which traditional entropy methods may ignore. The research team compared semantic entropy with other detection methods mainly in the following two aspects. 1. Detecting fictitious content in Q&A and math problems Figure | Detecting fictional content in sentence length generation. As can be seen from the above figure, Semantic Entropy outperforms all baseline methods. Semantic Entropy shows better performance on both AUROC and AURAC, indicating that it can more accurately predict LLM errors and improve the accuracy of the model when it refuses to answer questions. 2. Detecting Fiction in Biographies Figure | Detecting GPT-4 fictional content in paragraph-length biographies. As shown in the figure above, the discrete variant of the semantic entropy estimator outperforms the baseline methods in both AUROC and AURAC metrics (scores on the y-axis). Both AUROC and AURAC are significantly higher than both baselines. Semantic entropy is more accurate when answering more than 80% of the questions. The P(True) baseline has better accuracy than semantic entropy for the remaining answers only when rejecting the top 20% of answers that are most likely to be fictitious. Shortcomings and Prospects The probabilistic approach proposed by the research team takes semantic equivalence into account and successfully identifies a key class of hallucinations - hallucinations that arise from lack of LLM knowledge. These hallucinations are at the core of many current failures and will continue to be a problem even as models continue to improve, as humans cannot fully supervise all contexts and cases. Fabrication is particularly prominent in the question-answering domain, but it also occurs in other domains. It is worth noting that the semantic entropy method used in this study does not need to rely on specific domain knowledge, which indicates that similar progress can be achieved in more application scenarios such as abstract summarization. In addition, extending the method to other input variants, such as restatements or counterfactual scenarios, not only provides the possibility for cross-checking, but also realizes scalable supervision in the form of debate. This shows that the method has wide applicability and flexibility. The success of semantic entropy in detecting errors further verifies the potential of LLM in "knowing what you don't know", which may actually be better than previous studies have revealed. However, the semantic entropy method mainly targets hallucinations caused by insufficient LLM knowledge, such as making something out of nothing or misattributing something to someone else. It may not work well for other types of hallucinations, such as those caused by incorrect training data or model design flaws . In addition, the semantic clustering process relies on natural language inference tools, whose accuracy will also affect the estimation of semantic entropy. In the future, the researchers hope to further explore the application of the semantic entropy method in more fields and combine it with other methods to improve the reliability and credibility of LLM . For example, they can study how to combine the semantic entropy method with other techniques, such as adversarial training and reinforcement learning, to further improve the performance of LLM. In addition, they will also explore how to combine the semantic entropy method with other indicators to more comprehensively evaluate the credibility of LLM. But it’s important to realize that as long as LLMs are based on probability, there’s going to be a certain amount of randomness in what they generate. Roll 100 dice, you get one pattern, roll them again, you get another pattern . Even if the dice are weighted to generate certain patterns more often, as LLMs do, you still won’t get the exact same results every time. Even if it’s only wrong once in every thousand or hundred thousand times, that’s a lot of errors when you consider how many times this technology is used every day. The more accurate these models are, the easier it is to let our guard down. What do you think about the illusion of large models? References: https://www.nature.com/articles/s41586-024-07421-0 https://www.technologyreview.com/2023/12/19/1084505/generative-ai-artificial-intelligence-bias-jobs-copyright-misinformation/ |
<<: Don’t wear these colors of swimsuits when swimming in summer, it’s really dangerous!
Truly smart marketers never take business from su...
© Wellcome Collection Leviathan Press: Sleep para...
Google announced that it was abandoning its contr...
Planning: Li Peiyuan and Zhang Chao Produced by: ...
[[140491]] Google today open-sourced a low-power ...
The customization of core keywords for hospital w...
It is said on the Internet that the hairpin-makin...
What skills are needed for data operations ? How ...
What does a 70-story high bridge pier look like? ...
I despise clickbait titles, but sometimes it’s mo...
If you choose the right promotion channel , you w...
I have recently been researching referrals, which...
This article mainly focuses on the marketing rese...
The topic of user growth has been the most discus...
From the introduction of the WeChat update, we ca...