Completely defeated by human doctors! AI clinical decision-making is sloppy and unsafe, with the lowest accuracy rate being only 13%

Completely defeated by human doctors! AI clinical decision-making is sloppy and unsafe, with the lowest accuracy rate being only 13%

Will human doctors be laid off because of large models such as ChatGPT?

This worry is not unfounded. After all, Google's large model (Med-PaLM 2) has easily passed the US Medical Licensing Examination and reached the level of medical experts.

However, a recent study shows that in clinical terms, human doctors are far superior to current artificial intelligence (AI) models and there is no need to worry too much about personal "unemployment issues."

The related research paper, titled "Evaluation and mitigation of the limitations of large language models in clinical decision-making", was recently published in the scientific journal Nature Medicine.


The study found that even the most advanced large language models (LLMs) cannot make accurate diagnoses for all patients and perform significantly worse than human doctors.

The doctors' diagnosis was correct 89% of the time, while the LLM's was correct only 73% of the time. In one extreme case (cholecystitis), the LLM was correct only 13%.

Even more surprising is that as LLMs learn more about a case, their diagnostic accuracy decreases, and they sometimes even order tests that could pose serious health risks to the patient.

How does an LLM perform as an emergency physician?

Although LLMs can easily pass the US Medical Licensing Examination, the medical licensing examination and clinical case challenges are suitable for testing only the general medical knowledge of candidates, which is far less difficult than the daily complex clinical decision-making tasks .

Clinical decision making is a multistep process that requires the collection and integration of data from different sources and the ongoing evaluation of facts to reach evidence-based patient diagnostic and treatment decisions.

To further study the potential of LLM in clinical diagnosis, a research team from the Technical University of Munich and its collaborators created a dataset covering 2,400 real patient cases and four common abdominal diseases (appendicitis, pancreatitis, cholecystitis, and diverticulitis) based on the Medical Information Market Intensive Care Database (MIMIC-IV), simulating a realistic clinical environment and reproducing the process from emergency to treatment , thereby evaluating its applicability as a clinical decision maker.

Figure | Dataset source and evaluation framework. The dataset is derived from real cases in the MIMIC-IV database and contains comprehensive electronic health record data recorded during hospitalization. The evaluation framework reflects a realistic clinical environment and comprehensively evaluates LLM from multiple criteria, including diagnostic accuracy, compliance with diagnostic and treatment guidelines, consistency in following instructions, ability to interpret laboratory results, and robustness to changes in instructions, information volume, and information order. ICD, International Classification of Diseases; CT, computed tomography; US, ultrasound; MRCP, magnetic resonance pancreaticobiliary imaging.

The research team tested Llama 2 and its derivatives, including general-purpose versions (such as Llama 2 Chat, Open Assistant, WizardLM) and models aligned to the medical domain (such as Clinical Camel and Meditron).

Due to privacy issues and data usage agreements of MIMIC data, the data cannot be used for external APIs such as OpenAI or Google, so ChatGPT, GPT-4 and Med-PaLM could not be tested. It is worth noting that Llama 2, Clinical Camel and Meditron have achieved the same or even better performance than ChatGPT in the medical licensing exam and biomedical question answering test.

The test control group consisted of four physicians from two countries with different years of emergency department experience (2, 3, 4, and 29 years respectively). The results showed that LLM performed far worse than human doctors in clinical diagnosis.

1. The diagnostic performance of LLM is significantly lower than that of clinical

The results from doctors showed that the current LLM was significantly inferior to doctors in overall performance for all diseases (P < 0.001), with a difference in diagnostic accuracy of 16%-25% . Although the model performed well in the diagnosis of simple appendicitis, it performed poorly in the diagnosis of other pathologies such as cholecystitis. In particular, the Meditron model failed in the diagnosis of cholecystitis, often diagnosing patients as "gallstones."

The professional medicine LLM did not perform significantly better than the other models overall , and its performance dropped further when the LLM was required to collect all the information on its own.

Figure | Diagnostic accuracy under full information provision conditions. Data are based on a subset of MIMIC-CDM-FI (n=80), with the mean diagnostic accuracy shown above each bar graph and the vertical line indicating the standard deviation. The mean performance of LLM was significantly worse (P < 0.001), especially in cholecystitis (P < 0.001) and diverticulitis (P < 0.001).

Figure | Diagnostic accuracy in autonomous clinical decision-making scenarios. Compared with the scenario of full information provision, the overall accuracy of model judgment has decreased significantly. LLM performed best in diagnosing appendicitis, but performed poorly in the three pathologies of cholecystitis, diverticulitis and pancreatitis.

2. LLM’s clinical decision making is hasty and unsafe

The research team found that LLMs performed poorly in following diagnostic guidelines and were prone to missing important physical information about patients . In addition, there was a lack of consistency in ordering necessary laboratory tests for patients. LLMs also had significant shortcomings in interpreting laboratory results. This shows that they made hasty diagnoses without fully understanding the patient's case, posing a serious risk to the patient's health.

Figure | LLM recommended treatment evaluation. The desired treatment is determined based on clinical guidelines and the treatment actually received by patients in the dataset. Of the 808 patients, Llama 2 Chat correctly diagnosed 603. Of these 603 patients, Llama 2 Chat correctly recommended appendectomy in 97.5% of cases.

3. LLM still requires a lot of clinical supervision from doctors

Additionally, all current LLMs perform poorly in following basic medical guidance , making errors in 2-4 cases and fabricating nonexistent guidance in 2-5 cases.

Figure | Performance of LLM with different amounts of data. The study compared the performance of each model using all diagnostic information versus using only a single diagnostic test and history of present illness. For almost all diseases, providing all information did not lead to the best performance in the MIMIC-CDM-FI dataset. This suggests that LLM is unable to focus on key facts and performance degrades when too much information is provided.

The study also showed that the order of information that gave each model the best performance was different for each pathology, further complicating the difficulty of subsequent optimization of the models. They could not be reliably performed without extensive physician supervision and prior evaluation. Overall, they had detailed flaws in following instructions, the order in which they processed information, and the processing of related information, so they required a lot of clinical supervision to ensure they operated correctly.

Although the study found various problems with LLM in clinical diagnosis, LLM still has great prospects in medicine and is likely to be more suitable for diagnosis based on medical history and test results. The research team believes that there is room for further development of this research work in the following two aspects :

Model validation and testing: Further research should focus on more comprehensive validation and testing of the LLM to ensure its validity in real clinical settings.

Multidisciplinary collaboration: It is recommended that AI experts work closely with clinicians to jointly develop and optimize LLMs that are applicable to clinical practice and solve problems in practical applications.

How is AI disrupting healthcare?

Not only the above-mentioned study, but also a team from the National Institutes of Health (NIH) and its collaborators found similar problems - when answering 207 image challenge questions, although GPT-4V scored high in selecting the correct diagnosis, it often made mistakes in describing medical images and explaining the reasons behind the diagnosis .

Although AI is currently far inferior to human professional doctors, its research and application in the medical industry has always been an important "battlefield" for domestic and foreign technology companies and scientific research universities to compete.

For example, the medical AI model Med-PaLM2 released by Google has powerful diagnostic and treatment capabilities. It is also the first large model to reach the "expert" level in the MedQA test set.

The " Agent Hospital " proposed by the Tsinghua University research team can simulate the entire process of treating diseases. Its core goal is to allow doctor agents to learn how to treat diseases in a simulated environment, and even continuously accumulate experience from successful and failed cases to achieve self-evolution.

Harvard Medical School has led the development of a visual language general AI assistant for human pathology - PathChat , which can correctly identify diseases from biopsy sections in nearly 90% of cases, outperforming general AI models and professional medical models currently on the market, such as GPT-4V.

Figure|Instruction fine-tuning dataset and PathChat construction

Recently, OpenAI CEO Sam Altman participated in the establishment of a new company, Thrive AI Health, which aims to use AI technology to help people improve their daily habits and reduce the mortality rate of chronic diseases.

They claim that hyper-personalized AI technology can effectively improve people's living habits, thereby preventing and managing chronic diseases, reducing the medical economic burden, and improving people's overall health.

Today, the application of AI in the medical industry has gradually transitioned from the initial experimental stage to the practical application stage, but it may still have a long way to go before it can help clinicians enhance their capabilities, improve clinical decision-making, or even directly replace them.

<<:  Are you afraid of the "devil's gnawing" that is more deadly than toothache and the "hurricane" on your toes?

>>:  Is steam eye mask a waste of money?

Recommend

Pinduoduo Product Analysis Report

Pinduoduo has developed rapidly since it appeared...

Without realizing it, we “eat” 40,000 bananas every year…

Produced by: Science Popularization China Author:...

Why Android is better than iOS? The open and free Android is more powerful.

When buying an Android phone, users actually have...

I can't write MVP architecture after reading it. I'm kneeling on the washboard

In order to earn a monthly salary of 18,000 yuan,...

A brief analysis of the principles of mobile rendering

Author| Shang Huaijun Rendering on a computer or ...

Event promotion: A collection of 54 event cases for the 2019 Spring Festival!

During the Spring Festival, major Internet platfo...

“Grumpy when waking up”, what are you angry about?

Not getting satisfactory results: Angry! Not gett...

Huawei AppGallery Connect Salon Guangzhou Station successfully concluded

On December 18, the Huawei AppGallery Connect Stu...

Why I don't like working at a mainstream tech company

[[153327]] When I was young, I screwed up. The si...