Can large language models reason? | AI Nasi

Large language models are useful, but they cannot solve fundamental problems in AI, including reasoning.

Written by Wang Pei (Department of Computer Science, Temple University, USA)

Since ChatGPT came out two years ago, various large language models have repeatedly refreshed people's cognition, so that even "general artificial intelligence is coming" has changed from a madman's ravings to a cliché, and is no longer enough to be used as a "shocking" title. Even people who are used to the ever-changing situation here can't help but feel that it is a world away. This year's Nobel Prize was awarded twice to the artificial neural network technology behind it, which can be said to be "a raging fire, a blooming flower".

But at the same time, the voices of doubt that have always existed within the academic community have also begun to increase in volume. The latest cover article of the "Artificial Intelligence Magazine" (a member publication of AAAI, the world's largest artificial intelligence academic organization) bluntly stated that the research on "explainable artificial intelligence" has fallen into a quagmire**[1] , and "difficult to explain" is a criticized point of deep neural networks. A recent research report by Apple researchers even claimed that large language models cannot perform logical reasoning at all [2]**, which caused an uproar.

Arguments from both sides

The debate over whether deep neural networks can reason has been going on for several years. An article from Apple Research**[2] evaluated the reasoning ability of large language models in the field of mathematics. The material was a group of "math word problems" that had been "fine-tuned" and could already be solved well by large language models. Using the problems that we are familiar with in elementary school as examples, the modifications included (1) replacing the proper nouns (like changing the question about "Xiao Hong" to "Xiao Ming"), (2) changing the numbers (like changing "3.5 hours" to "2.8 hours"), and (3) adding irrelevant information (like adding a description of "Xiao Ming fishing" to the question about "Xiao Hong climbing a mountain"). Although this modification did not involve the logical structure of these math problems, it caused a significant drop in the accuracy of the answers. The conclusion of the article is that the large language model neither understands the math concepts in these problems nor can it perform logical reasoning. Instead, it simply compares the problems it faces with the problems in the training data. Therefore, even those correct answers only reflect the system's memory and matching ability, rather than its logical reasoning ability. When I reviewed ChatGPT last year [3]**, I said that it does not have the ability to perform logical reasoning, because the quality of its conclusions depends on the amount of relevant training data, so it can only be regarded as a summary of a large number of people's reasoning processes - "there is nothing special about it, it just comes with practice." This is also confirmed by the evaluation results in[2].

But this evaluation result is not enough to settle the debate. Those who believe that large language models can reason generally reason in this way: "Some problems are solved by people through reasoning, so their solution requires reasoning ability. Now large language models solve these problems, so they can reason." According to this view, large language models have shown reasoning ability far beyond that of ordinary people on many problems. At this time, it is inevitable to "generalize" and "nitpick" to say that it cannot reason because of some wrong conclusions. With the rapid development of related technologies, how do we know that the next version will not be able to plug these loopholes? Hasn't OpenAI already listed "reasoning" as its current main direction?

In the debate so far, the main means of evidence for both sides is to find various cases of success or failure of large language models in reasoning. The advantage of this approach is that the evidence is specific and verifiable, but there is always a sense of seeing the whole picture through the tube. To what extent do these successes and failures reveal the general reasoning ability of the system, and how many of the current defects can be overcome by future research and development?

What is "reasoning"?

Someone once said that many arguments actually stem from different understandings of basic concepts, and this is exactly why many of my previous articles started with conceptual analysis - not because I like to quibble, but because it is impossible to get to the core of the argument without doing so.

"Reasoning" is usually described as "the process of deriving a new judgment (conclusion) from a known judgment (premise)," but if "deriving" is not further limited, it is obviously too broad. Reading the premise in reverse is definitely not reasoning. The "deriving" here certainly means "deducing correctly", but the problem lies here: what standard is used to determine whether it is "correct" or not?

There are two distinct scholarly traditions in the study of reasoning.

Logic and mathematics study normative theories and models of reasoning, with the goal of establishing the correctness (also known as "validity") of reasoning on a universal standard that embodies rationality. The traditional standard for validity of reasoning is "fidelity", that is, ensuring that true conclusions are drawn from true premises, and a logical system is composed of reasoning rules that meet this standard. These rules are abstract and only concern the form of premises and conclusions, but not their content. For example, as mentioned in [3], the correctness of inferring the conclusion "A is C" from the premises "A is B" and "B is C" does not depend on what the letters represent.

Psychology studies descriptive theories and models of reasoning, and its goal is to summarize the laws that are actually followed in human reasoning activities. In this way, the "correctness" in it is similar to other empirical sciences, which means "theoretical predictions are consistent with actual observations."

Although these two types of theories have some similar conclusions (it would be troublesome if they were completely different), the differences between them have long been well known. A typical example is the "Watson selection task" that I introduced in [4], which I will not repeat here.

Both traditions are reflected in artificial intelligence research. From the beginning, the reasoning research in artificial intelligence was based on normative theories represented by mathematical logic, but in order to get closer to the actual thinking of human beings, various "corrections" were tried and some success was achieved, although in general it was still too idealistic and insufficient to deal with various complex practical problems.

In contrast, reasoning in deep learning (including large language models) can be said to have borrowed the spirit of descriptive theory to a certain extent, that is, the behavior of the model is determined according to people's reasoning practice (rather than abstract rational principles). Even so, the specific practices are still completely different from psychology. Like psychology and logic, psychology regards a reasoning process as consisting of a series of reasoning steps, each of which has its own laws that can be identified and studied, but this regularity is reflected in people's actual behavior and may not be uniformly explained by some abstract principles (such as "fidelity"). Since the reasoning behavior in the neural network model is trained and generated in an "end-to-end" manner using the premises and conclusions when people solve practical problems (often reflected in the order of sentences in large language models), skipping the intermediate steps, the standard of its correctness is "whether people draw the same conclusion from the given premise", and it does not care much about how these conclusions are generated step by step. Supported by the powerful information processing capabilities of computers and massive training data, this reasoning model has achieved remarkable success, but at the same time, it also has several criticized aspects:

End-to-end training abandons control over the intermediate steps, making the process and results difficult to understand.

The reliance on training data leads to problems such as "bias" and "overfitting" in generalizing conclusions.

When there is insufficient training data, it is difficult to guarantee the accuracy of the answer by guessing the answer based on the statistical similarity with the samples.

Since these problems are determined by the "nature" of the neural network model, they cannot be completely solved by technical means. For example, the recently popular "thinking chain" reflects the efforts to fill in the intermediate steps, but most of the "links" in this "chain" are still reasoning processes that can be further decomposed, rather than basic reasoning steps, and its correctness standards are still established by training data, so it is not universal (domain-independent). This time, the evaluation of Apple's research department requires this descriptive model to solve reasoning problems in normative theory (mathematics), so it is not surprising that it performs poorly.

Nature and nurture

Behind the various superficial differences between the normative model and the descriptive model of reasoning are different views on the innate and acquired factors of intelligence (or "cognition", "thinking", etc.). Although everyone agrees that both are indispensable, there are still different views on their respective roles. The reasoning rules in the normative model are basically determined innately (although the reasoning premises used can be acquired later), while the reasoning rules in the descriptive model can come from acquired training (although the algorithm followed by the training is given innately). Specifically for the neural network model, "reasoning" is regarded as the relationship between the "known" and the "conclusion" of the problem, and no longer limits the generation process from the known to the conclusion. This approach greatly simplifies the construction and application process of this model (only training data needs to be provided, without explaining the solution to the problem), which is an important reason for its success, but it is also the root cause of the problems mentioned above.

In addition to the reasoning model, this different treatment of innate and acquired factors also appears in the language model. In the study of natural language understanding, the "rule school" (Chomsky school) that initially dominated believed that language ability (especially grammatical structure) was basically innate, and acquired learning only played the role of "stimulating potential", while the "statistical school" that currently prevailed (with neural network models as the main implementation method) believed that "everything can be learned", and the only innate component required was the ability to generalize the training data (reflected in the learning algorithm).

If we trace the source further, this emphasis on innate and acquired factors in different fields (not limited to the reasoning and language mentioned above) can be said to reflect rationalism and empiricism in philosophy, and the relationship between the two is not as simple as who is right and who is wrong, nor can it be fooled by "organic unity". For AI system designers, the most important decision includes distinguishing which mechanisms and content should be designed in advance and which should be left to training and education. Systems that purely follow rationalism are often too rigid and unable to handle the complexity of the environment, while systems that purely follow empiricism are often confined to past experience fragments and it is difficult to ensure the universality of judgments. Using the rules of reasoning as an analogy, the former is like solving problems entirely by deductive reasoning, which has the advantage of accuracy and reliability ("fidelity"), but it is helpless beyond the scope of the preset premise, while the latter is like solving problems entirely by analogical reasoning, which has the advantage of flexibility (if you don't care about pulling and forcing, everything can be compared), but often falls into a self-contradictory situation.

When compared with human intelligence, I believe that the (innate) design of artificial intelligence systems should follow rational principles close to those of humans, but their specific behaviors should be based on their own (acquired) experience rather than trying to completely copy human behavior. In the reasoning model "Nas" that I designed (see my previous column), the design reflects the reasoning rules abstracted from human reasoning behavior, without expecting the system itself to learn them. On the other hand, let the system's beliefs, desires, and concepts come entirely from the system's own experience (including sensory motor experience and verbal communication experience), rather than relying on pre-implanted "truths" or "facts". Simply put, the design of Nas is an attempt to achieve intelligence with a set of reasoning rules similar to human innate logic as meta-logic. I am not saying that there is a set of symbolic reasoning rules in the human brain, but that our natural reasoning process has rules to follow, and these rules can be organized into symbolic reasoning rules without losing their basic characteristics. Here, "logic" in the general sense and the specific "logic system" must be distinguished. Logic has been studying universally valid reasoning and argumentation norms since its inception, and this is also what we mean when we judge whether a statement is "logical". As for defining "validity of reasoning" as "fidelity" and describing it specifically as a rule system in symbolic language, it is a specific understanding of reasoning norms. Even if all existing logical systems are unsatisfactory, "human reasoning has no rules to speak of" is not an inevitable conclusion. If this is true, why can we still understand and accept a large number of reasoning processes and results of others (including ancient people and foreigners) to a certain extent?

Based on the belief that "the reasoning of intelligent systems follows universal rules", Nass's reasoning mechanism is designed as a normative model, and the correctness of its conclusions is determined according to the rational principles based on Nass, rather than based on popular human views as the standard of right and wrong. However, unlike traditional normative models, Nass's design presupposes that the system needs to adapt to the environment under conditions of relatively insufficient knowledge and resources. Therefore, the basis for judging the correctness of a specific conclusion is the system's past experience, rather than objective facts or future experience. In this way, Nass is also a descriptive model in terms of knowledge content, but it summarizes its own experience rather than human experience. The result of this is that Nass has similarities with various traditional reasoning models, but they all have fundamental differences.

Compared with large language models, Nass’s reasoning rules are determined during the design process and have nothing to do with system experience or application areas. Since these rules come from the need to “adapt to the environment under conditions of relatively insufficient knowledge and resources”, and the human reasoning mechanism has evolved to meet this need, Nass’s reasoning process and results have a lot of similarities with humans, so they are explainable in principle (although it will not be easy for complex problems). Since Nass’s reasoning conclusions come from system experience, its empirical limitations will certainly cause bias and misjudgment, but this defect in knowledge content does not mean a defect in the system’s reasoning ability.

Since the "innate logic" (called "non-axiomatic logic", see [5]) followed by Nass is different from mathematical logic and does not contain mathematics, the system still needs to learn to master these theories, and this learning is carried out using its innate logic, which is completely different from the training of artificial neural networks. If Nass is allowed to do mathematical word problems after learning the corresponding courses, it may also make various mistakes, but these mistakes will be closer to those made by elementary school students rather than those made by large language models. Since Nass's research and development has not yet reached the level where it can be tested, this can be regarded as a prediction that has yet to be tested.

The source of reasoning ability

According to the above analysis, large language models can be regarded as a special descriptive reasoning model, which completes certain reasoning tasks by summarizing the corresponding human behaviors. It is not completely wrong to call this ability "reasoning", but it is more accurate to say that they "cannot reason, only pattern matching", because they do regard a task that humans need to complete through step-by-step reasoning as an end-to-end mapping (a function from input to output), and complete the task by matching with known mapping relationships. Although these two processes have a large overlap in the scope of problem solving, their differences should not be ignored. If we insist on extending the scope of application of the word "reasoning", we should also say that large language models "can reason, but do not follow any logic." Some people believe that artificial intelligence has a different logic from humans, but to prove this, it is necessary to place its reasoning rules on more basic rational principles (such as "fidelity" and "adaptation"), and I have not seen such an argument so far.

Not all problem-solving processes can be called "reasoning". Intuitively speaking, it is necessary to "infer" step by step, and each step must be "reasonable". Of course, this "literal meaning" is not a definition, but solving problems by just reciting or looking up answers is definitely not reasoning, although these answers may have been obtained by predecessors through reasoning. Large language models are certainly not as simple as reciting or looking up, but they are even further away from the traditional understanding of "reasoning" as "gradually generating answers from knowns according to reasonable rules or patterns", which is why they are difficult to explain or "cannot reason, only pattern matching". For practical applications, their "reasoning ability" is sufficient for some needs, but not for others. In particular, it cannot be considered that this is the realization of the "reasoning" function of intelligent systems. Even the reasoning research in psychology cannot be done completely according to the method of large language models, not to mention logic and mathematics. Large language models are still useful in these disciplines, but they are for other purposes (such as summarizing existing research results).

This is not to say that large language models cannot learn logical and mathematical knowledge. The "knowledge" in an information system usually exists at two levels, generally referred to as "object-level knowledge" and "meta-level knowledge". Specifically, in the traditional reasoning system, the knowledge that serves as the premise and conclusion of reasoning belongs to the former, which usually exists in the form of statements and can be added, deleted and modified during the operation of the system, while the knowledge embodied in the reasoning rules belongs to the latter, which usually exists in the form of programs and remains unchanged during the operation of the system. In the large language model, the parameters that can be adjusted during the training process correspond to object knowledge, and the algorithm that completes this adjustment corresponds to meta-knowledge. In connection with the previous discussion, it can be said that meta-knowledge is basically innate, while object knowledge is acquired.

These two kinds of knowledge can influence each other and replace or transform each other to a certain extent. We can learn a logic and reason according to this logic, but this acquired logic cannot completely replace our innate "meta-logic", that is, the laws that human reasoning activities naturally follow. Even people who have never received any logic education still generally follow this logic. On the other hand, even logicians and mathematicians cannot use their theoretical knowledge (such as first-order predicate logic or probability theory) to completely regulate their reasoning activities in daily life. We can certainly teach the large language model any set of logic, including the one followed by Nath, but this is only "object knowledge" for the large language model. It can answer queries based on this, but it cannot completely regulate its reasoning activities based on this, just like people may be able to recite a certain theory by heart, but they cannot always use it to guide their actions.

Our experience can influence our thinking activities, but it cannot determine all the processes involved. The main reason is that the control of "meta-knowledge" cannot reach the same level as "object knowledge". Similarly, we can teach a large language model a different set of learning algorithms through training, but we cannot replace its inherent learning algorithm.

Even if we cannot manipulate the laws of our own thinking, why can't we eliminate the distinction between "object knowledge" and "meta-knowledge" in the computer systems we design? Can we let some artificial neural networks adjust their own learning algorithms, or let Nas adjust its own reasoning rules based on experience? This is indeed possible to a certain extent, but it may not be a good idea (for example, it will destroy the consistency of the system itself), and it is impossible to do it completely (for example, it requires "meta-meta-knowledge" to modify "meta-knowledge"). Since this topic is beyond the focus of this article, it will not be further expanded.

If the "inherent logic" of intelligent systems cannot be summarized from their own experience, then where does this meta-knowledge of humans come from? Although I believe that intelligent systems can be designed, this does not mean that I think human intelligence is also the result of some kind of design. On the contrary, the "reasoning view" embodied in Nash (reasoning is concept substitution, and concepts are abstractions of experience fragments, so adaptive systems can use reasoning to apply past experience to solve problems in the current situation) can be found in animal intelligence. Therefore, the meta-knowledge of intelligent systems may come from design or evolution, but I don't think it is more feasible to obtain artificial intelligence through evolution than to design them (although it is still worth considering as a supplementary means). This issue has been discussed in [6], so I won't say more here.

In summary, my basic assessment of large language models remains the same as in [3]: they are useful, but cannot solve basic problems in artificial intelligence, including reasoning.

References

[1] XAI is in trouble, Rosina O Weber et al., AI Magazine, 45:300-316, Fall 2024

[2] GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, Iman Mirzadeh et al., arXiv:2410.05229v1, Oct. 2024

[3] In-depth analysis: Will ChatGPT and its successors become general artificial intelligence? Wang Pei, Fanpu, March 15, 2023

[4] AI is rational, and humans are irrational. Is this really the case? Wang Pei, Fanpu, July 14, 2021

[5] What kind of logic is this? Wang Pei, Science and Technology Review, August 10, 2016

[6] Outline of Intelligence Theory, Wang Pei, Shanghai Science and Technology Education Press, September 2022

Special Tips

1. Go to the "Featured Column" at the bottom of the menu of the "Fanpu" WeChat public account to read a series of popular science articles on different topics.

2. Fanpu provides a function to search articles by month. Follow the official account and reply with the four-digit year + month, such as "1903", to get the article index for March 2019, and so on.

Copyright statement: Personal forwarding is welcome. Any form of media or organization is not allowed to reprint or excerpt without authorization. For reprint authorization, please contact the backstage of the "Fanpu" WeChat public account.

<<: What is sound and where does it come from?

>>: More than half of people don't know they have diabetes! Be alert if you have these symptoms

A different "cow" experience: Enjoy the TV version of "People's Niu Niu"

Is your drinking capacity getting better with practice? Is drinking good for your health? Medical experts: This is a real addiction + a real scam!

Blog

What thoughts can we draw from the most noteworthy crisis public relations cases in 2017?

Yiche: In May 2024, the monthly sales of seven Japanese models exceeded 10,000 units, with the highest monthly sales of 28,000 units

A very realistic question is, can Japanese cars s...

Google Translate APP is banned! App Store rankings are rising rapidly.

Maybe everyone knows it. On March 29, Google upda...

Can large language models reason? | AI Nasi

A different "cow" experience: Enjoy the TV version of "People's Niu Niu"

iOS 9 Split Screen Multitasking: Getting Started (Chinese Version)

The sun just erupted an X-class flare. What impact will it have on the earth?

Neuroscience tells you: How to cultivate user habits?

Is your drinking capacity getting better with practice? Is drinking good for your health? Medical experts: This is a real addiction + a real scam!

What thoughts can we draw from the most noteworthy crisis public relations cases in 2017?

B-end designers come to see! Let us take you to understand the design concept of "B-end C-ization"

Why is it so difficult to achieve "Internet freedom" at an altitude of 10,000 meters?

What important product information did Apple announce at the press conference?

2022 Chengdu Tea Tasting Contact Information Local veteran driver personally experienced and shared with you

Recommend

How to improve SEM conversion rate?

The entire process of live streaming with goods on Douyin!

After the long Spring Festival holiday, how can we quickly get back to work?

Baidu information flow is not growing? Learn about multidimensional data analysis!

Le Xiaobao Story Optical Machine Disassembly Review

14 Best-selling School Bags Review: Which One Can Protect Your Backbone? Which One Is More Durable?

3 steps to increase conversion rate of wedding photography!

On the border between our country and this country, there stands a golden eagle

3 laws of hit products!

99% of product operators get it wrong: The 2/8 rule may not stand up to scrutiny

Short-term leader tactics

Carotid artery plaques found in physical examination? After getting the report, look at these 3 indicators first

What’s the secret to quickly gaining followers on Douyin? A must-see for operations!

Yiche: In May 2024, the monthly sales of seven Japanese models exceeded 10,000 units, with the highest monthly sales of 28,000 units

Google Translate APP is banned! App Store rankings are rising rapidly.