AI Fun Facts丨The largest model, cognitive ability is inferior to that of the elderly?

AI Fun Facts丨The largest model, cognitive ability is inferior to that of the elderly?

The British Medical Journal, a top medical journal, recently published an interesting study in which the research team used test questions to assess the cognitive abilities of the elderly and early symptoms of dementia to test AI. As a result, several top AIs showed symptoms of mild cognitive impairment similar to humans. Moreover, the early versions of these AI models, like aging humans, performed worse in the tests and even showed "forgetfulness". This result triggered in-depth thinking by the research team.

Written by | Ren

With the rapid development of AI technology, its progress is refreshing people's cognition almost every day. Many people are wondering whether AI will replace human doctors in the near future?

However, an interesting study recently published in The BMJ brought us an unexpected discovery: it turns out that AI can exhibit symptoms similar to mild cognitive impairment in humans.

Screenshot of the paper | Source: The BMJ

This discovery can't help but make people smile, and it also triggers people's in-depth thinking about the capabilities of AI.

In this study led by a research team from Hadassah Medical Center in Israel, researchers used the Montreal Cognitive Assessment (MoCA) and pro-tester tests to evaluate the cognitive abilities of five common large language models, including OpenAI's ChatGPT 4 and ChatGPT-4o+, Google's Gemini 1.0 and 1.5, and Anthropic's Claude 3.5 Sonnet.

MoCA score of AI model | Source: Paper

The Montreal Cognitive Assessment scale is commonly used to assess cognitive abilities and early dementia symptoms in the elderly, with a full score of 30, and 26 points or more is considered normal. The test instructions given by the research team to the AI ​​model are exactly the same as those given to human patients. All scores strictly follow official guidelines and are evaluated by a practicing neurologist.

First, let me say the conclusion: Among all the AI ​​models tested, ChatGPT 4o performed the best, but it only reached the passing score of 26. ChatGPT4 and Claude followed closely behind, both with 25 points. The most surprising thing is that Google Gemini 1.0 only got a low score of 16 points.

According to the scoring criteria, except for GPT 4o, which excels in reasoning, the remaining models are equivalent to the performance level of humans with mild cognitive impairment. Interestingly, the study also found that early versions of these AI models (such as Gemini 1.0), like aging humans, performed worse in the test, which triggered in-depth thinking by the research team.

Test scores of AI models, which generally perform poorly in visual-spatial ability tests | Source: Paper

In-depth analysis of the evaluation results revealed that the large language model showed obvious strengths and weaknesses in different evaluation items. They performed well in tasks such as naming, attention, language, and abstract thinking. However, in tests involving visual space and executive functions, they showed mild cognitive impairment similar to that of humans.

For example, they performed poorly in tasks such as the line connection test (which requires connecting circled numbers and letters in sequence) and the clock drawing test (drawing a clock pattern at a specified time), and some of the error patterns they exhibited were even quite similar to those of patients with certain types of cognitive impairment.

In the line connection test and cube drawing test, A and F are the correct answers, B and G are the answers given by humans, and the rest are the answers given by the AI ​​model. | Image source: paper

Clock drawing test, marked as drawing a clock set to 10:11. A is the answer given by humans, B is the answer drawn by an Alzheimer's patient, and the rest are the answers of the AI ​​model. The closest answers are G and H given by GPT-4 and GPT-4o, but the pointers of the two pictures point to the wrong time. | Image source: paper

What’s more interesting is that the Gemini model also showed “forgetfulness” in the memory test, and was completely unable to remember the sequence of five words given before in a task called “delayed recall task”. This phenomenon is strikingly similar to the performance of patients with early cognitive impairment in humans, and may be related to the lack of a mechanism similar to human working memory in the AI ​​model.

In further visual-spatial tests, when faced with test materials such as the Navon figure, the cookie theft scene, and the Poppelreuter figure, the AI ​​model performed poorly in integrating local and overall information, identifying objects in complex scenes, and understanding emotional cues.

For example, in the Navon graphic test, most models can only identify local elements but have difficulty grasping the overall structure, which reflects their deficiencies in abstract thinking and information integration capabilities.

Navon figure test, the upper part of the big H and big S letters are composed of the corresponding small H and small S elements, while the lower part of the big H and big S are composed of the opposite small elements, in order to assess the overall and local processing of visual perception and attention. | Source: Paper

Additionally, in the cookie stealing test (taken from the BDAE Boston Diagnostic Aphasia Examination), while all models were able to partially describe what was happening in the scene, none of them mentioned that the little boy in the picture was about to fall, which in actual tests with human subjects is often a sign of emotional apathy and loss of empathy, and is one of the symptoms of frontotemporal dementia (FTD).

Stealing cookies picture test | Source: Paper

However, the researchers also pointed out that although the AI ​​model has difficulty completing tasks that require visual execution and abstract reasoning, it performs very well in tasks that require text analysis and abstract reasoning (such as similarity testing).

From the perspective of technical principles, large language models are based on complex neural network architectures and simulate human language behavior by learning from massive amounts of data. However, this architecture has many flaws when faced with cognitive tasks that require deep understanding and flexible processing.

In part, this divergence has to do with the way we train AI models: The training data we currently use focuses primarily on language and symbol processing, while understanding spatial relationships and planning for multi-step tasks are relatively under-trained.

The difficulty AI models face in dealing with visual-spatial problems also stems from the way they extract features and recognize patterns from data, which cannot grasp spatial relationships and object features as accurately as the human brain.

Finally, in the classic Stroop test, only GPT-4o succeeded in the more complex second stage, while the other models all failed.

This test measures the effect of interference on the subjects' reaction time by combining the name of a color with the color of the font. In the second stage, the test questions are to display the name of a color in a color other than the color it represents, such as the word "red" in blue ink. Compared with when the word and its color are consistent, the subjects take longer to identify the color of the text and are more likely to make mistakes in the identification process.

In the second phase of the Stroop experiment, there was a mismatch between the color name and the font color. | Source: Paper

It is worth noting that the study also found that the "age" factor of the large language model is related to its cognitive performance. The "age" here does not refer to the passage of time in the true sense, but to the iteration of the model version.

Taking ChatGPT-4 and ChatGPT-4o as examples, the old version of ChatGPT-4 scored slightly lower than the new version in the MoCA test. There is also a significant score difference between Gemini 1.0 and Gemini 1.5, and the old version scored lower.

This may suggest that as the model is updated and developed, its cognitive ability may be improved, but the trend of this change and the underlying mechanism are currently unclear.

The findings of this study are thought-provoking. Since ChatGPT was first made available to the public in 2022, the performance of AI models in the medical field has been a hot topic.

There are many early studies showing that AI models outperform human doctors in multiple professional medical exams, including the European Core Cardiology Exam (EECC), the Israeli Residency Exam, the Turkish Thoracic Surgery Theory Exam, and the German Obstetrics and Gynecology Exam. Even in the professional exam for neurologists, AI models have shown the ability to surpass humans, which makes many specialists anxious.

However, the cognitive defects of AI models revealed by the latest research have made us see its practical limitations. Medical care is not only a technology, but also an art that requires humanistic care and empathy. The methods and approaches of medical practice are deeply rooted in human experience and empathy, rather than just a series of cold technical operations.

Even as the technology advances, some fundamental limitations of AI models may persist. For example, AI’s inadequacy in visual abstraction capabilities is critical for interacting with patients during clinical assessments. As the research team put it: “Not only are neurologists unlikely to be replaced by AI in the short term, on the contrary, they may soon face a new type of ‘patient’ – an AI model that exhibits cognitive impairment.”

This research result also sounded the alarm for the application of AI models in the medical field. When faced with AI systems that may have cognitive defects, patients will inevitably have doubts, especially in critical medical scenarios involving complex disease diagnosis and treatment decisions. Patients are more inclined to rely on the experience and judgment of human doctors and regard AI as an auxiliary tool rather than a decision maker.

At the same time, from the perspective of diagnostic accuracy, the deficiencies of AI models in visual-spatial processing and abstract reasoning may lead to deviations in their interpretation of medical images and clinical data, leading to the risk of misdiagnosis or delayed treatment.

However, the researchers also admit that there are essential differences between the human brain and AI models, and this comparative study still has its limitations. In addition, the rationality and accuracy of applying cognitive tests designed specifically for humans to AI are also questionable. Perhaps we need to develop new methods that are more suitable for evaluating AI systems. But it is undeniable that AI models generally perform poorly in visual abstraction and executive functions.

Understanding the cognitive deficiencies of AI models is critical to developing responsible AI development strategies. We need to maintain a clear understanding of AI capabilities and build reasonable expectations while promoting technological progress.

Looking ahead, improving the empathy and situational understanding capabilities of AI models may become the focus of future research and development. Rather than saying that AI will completely replace human doctors or other professions, it is more likely that the future will be a new pattern in which human intelligence and AI complement each other.

After all, in an era where even AI can show "cognitive impairment", the uniqueness of human beings deserves more recognition. While embracing technological progress, we must not forget the uniqueness of human cognitive and emotional abilities.

Note: The cover image of this article comes from the copyright library. Reprinting and using it may cause copyright disputes.

Special Tips

1. Go to the "Featured Column" at the bottom of the menu of the "Fanpu" WeChat public account to read a series of popular science articles on different topics.

2. Fanpu provides a function to search articles by month. Follow the official account and reply with the four-digit year + month, such as "1903", to get the article index for March 2019, and so on.

Copyright statement: Personal forwarding is welcome. Any form of media or organization is not allowed to reprint or excerpt without authorization. For reprint authorization, please contact the backstage of the "Fanpu" WeChat public account.

<<:  A little-known fact: The feces produced by an elephant can make 115 sheets of A4 paper.

>>:  Spring Festival travel is coming, unlock the safety of high-speed rail electricity use →

Recommend

12 sales promotion tricks, learn them!

Last week, a friend on Zhihu asked Su Yan a quest...

Next-generation smart SUV Aion V starts pre-sale at RMB 170,000

On April 27, with the theme of "Wonderful Ni...

Tips to improve landing page registration conversion rate!

Without further ado, here is a picture. This is t...

Real estate operations should be implemented like this!

The concept of "big operation" is not n...

Calendar: Holidays over Body: No, not yet

The Spring Festival holiday was like a tornado Th...

Major brands are leveraging New Year’s Day posters, take them!

Every year end Various festivals have also become...

Only 3 seconds! Explosive combustion! Many people carry it with them

Recently, a fire broke out in a residential area ...