GPT-4, Google Gemini Fall Short in Breast Imaging Classification

Use of publicly available large language models (LLMs) resulted in changes in breast imaging reports classification that could have a negative effect on patient management, according to a new international study published today in the journal Radiology, a journal of the Radiological Society of North America (RSNA). The study findings underscore the need to regulate these LLMs in scenarios that require high-level medical reasoning, researchers said.

LLMs are a type of artificial intelligence (AI) widely used today for a variety of purposes. In radiology, LLMs have already been tested in a wide variety of clinical tasks, from processing radiology request forms to providing imaging recommendations and diagnosis support.

Publicly available generic LLMs like ChatGPT (GPT 3.5 and GPT-4) and Google Gemini (formerly Bard) have shown promising results in some tasks. Importantly, however, they are less successful at more complex tasks requiring a higher level of reasoning and deeper clinical knowledge, such as providing imaging recommendations. Users seeking medical advice may not always understand the limitations of these untrained programs.

"Evaluating the abilities of generic LLMs remains important as these tools are the most readily available and may unjustifiably be used by both patients and non-radiologist physicians seeking a second opinion," said study co-lead author Andrea Cozzi, M.D., Ph.D., radiology resident and post-doctoral research fellow at the Imaging Institute of Southern Switzerland, Ente Ospedaliero Cantonale, in Lugano, Switzerland.

Dr. Cozzi and colleagues set out to test the generic LLMs on a task that pertains to daily clinical routine but where the depth of medical reasoning is high and where the use of languages other than English would further stress LLMs capabilities. They focused on the agreement between human readers and LLMs for the assignment of Breast Imaging Reporting and Data System (BI-RADS) categories, a widely used system to describe and classify breast lesions.

The Swiss researchers partnered with an American team from Memorial Sloan Kettering Cancer Center in New York City and a Dutch team at the Netherlands Cancer Institute in Amsterdam.

The study included BI-RADS classifications of 2,400 breast imaging reports written in English, Italian and Dutch. Three LLMs - GPT-3.5, GPT-4 and Google Bard (now renamed Google Gemini) - assigned BI-RADS categories using only the findings described by the original radiologists. The researchers then compared the performance of the LLMs with that of board-certified breast radiologists.

The agreement for BI-RADS category assignments between human readers was almost perfect. However, the agreement between humans and the LLMs was only moderate. Most importantly, the researchers also observed a high percentage of discordant category assignments that would result in negative changes in patient management. This raises several concerns about the potential consequences of placing too much reliance on these widely available LLMs.

According to Dr. Cozzi, the results highlight the need for regulation of LLMs when there is a highly likely possibility that users may ask them health-care-related questions of varying depth and complexity.

"The results of this study add to the growing body of evidence that reminds us of the need to carefully understand and highlight the pros and cons of LLM use in health care," he said. "These programs can be a wonderful tool for many tasks but should be used wisely. Patients need to be aware of the intrinsic shortcomings of these tools, and that they may receive incomplete or even utterly wrong replies to complex questions."

Cozzi A, Pinker K, Hidber A, Zhang T, Bonomo L, Lo Gullo R, Christianson B, Curti M, Rizzo S, Del Grande F, Mann RM, Schiaffino S.
BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study.
Radiology. 2024 Apr;311(1):e232133. doi: 10.1148/radiol.232133

Most Popular Now

European Artificial Intelligence Act Com…

The European Artificial Intelligence Act (AI Act), the world's first comprehensive regulation on artificial intelligence, enters into force. The AI Act is designed to ensure that AI developed and used...

Patient Safety must be Central to the De…

An EPR system brings together different patient information in one place, making it easier to access for healthcare professionals. This information can include patients' own notes, test results, observations by...

Generative AI can Not yet Reliably Read …

It may someday be possible to use Large Language Models (LLM) to automatically read clinical notes in medical records and reliably and efficiently extract relevant information to support patient care...

ChatGPT Shows Promise in Answering Patie…

The groundbreaking ChatGPT chatbot shows potential as a time-saving tool for responding to patient questions sent to the urologist's office, suggests a study in the September issue of Urology Practice®...

Survey: Most Americans Comfortable with …

Artificial intelligence (AI) is all around us - from smart home devices to entertainment and social media algorithms. But is AI okay in healthcare? A new national survey commissioned by...

AI can Help Rule out Abnormal Pathology …

A commercial artificial intelligence (AI) tool used off-label was effective at excluding pathology and had equal or lower rates of critical misses on chest X-ray than radiologists, according to a...

What Does the EU's Recent AI Act Me…

The European Union's law on artificial intelligence came into force on 1 August. The new AI Act essentially regulates what artificial intelligence can and cannot do in the EU. A...

AI Spots Cancer and Viral Infections at …

Researchers at the Centre for Genomic Regulation (CRG), the University of the Basque Country (UPV/EHU), Donostia International Physics Center (DIPC) and the Fundación Biofisica Bizkaia (FBB, located in Biofisika Institute)...

Video Gaming Improves Mental Well-Being

A pioneering study titled "Causal effect of video gaming on mental well-being in Japan 2020-2022," published in Nature Human Behaviour, has conducted the most comprehensive investigation to date on the...

New Diabetes Research Links Blood Glucos…

As part of its ongoing exploration of vocal biomarkers and the role they can play in enhancing health outcomes, Klick Labs published a new study in Scientific Reports - confirming...

Machine learning helps identify rheumato…

A machine-learning tool created by Weill Cornell Medicine and Hospital for Special Surgery (HSS) investigators can help distinguish subtypes of rheumatoid arthritis (RA), which may help scientists find ways to...

New AI Software could Make Diagnosing De…

Although Alzheimer's is the most common cause of dementia - a catchall term for cognitive deficits that impact daily living, like the loss of memory or language - it's not...