Fine-Tuned LLMs Boost Error Detection in Radiology Reports

A type of artificial intelligence (AI) called fine-tuned large language models (LLMs) greatly enhances error detection in radiology reports, according to a new study published in Radiology, a journal of the Radiological Society of North America (RSNA). Researchers said the findings point to an important role for this technology in medical proofreading.

Radiology reports are crucial for optimal patient care. Their accuracy can be compromised by factors like errors in speech recognition software, variability in perceptual and interpretive processes and cognitive biases. These errors can lead to incorrect diagnoses or delayed treatments, making the need for accurate reports urgent.

LLMs like ChatGPT are advanced generative AI models that are trained on vast amounts of text to generate human language. While they offer great potential in proofreading, their application in the medical field, particularly in detecting errors within radiology reports, remains underexplored.

To bridge this gap in knowledge, researchers evaluated fine-tuned LLMs for detecting errors in radiology reports during medical proofreading. A fine-tuned LLM is a pre-trained language model that is further trained on domain-specific data.

"Initially, LLMs are trained on large-scale public data to learn general language patterns and knowledge," said study senior author Yifan Peng, Ph.D., from the Department of Population Health Sciences at Weill Cornell Medicine in New York City. "Fine-tuning occurs as the next step, where the model undergoes additional training using smaller, targeted datasets relevant to particular tasks."

To test the model, Dr. Peng and colleagues built a dataset with two parts. The first consisted of 1,656 synthetic reports, including 828 error-free reports and 828 reports with errors. The second part comprised 614 reports, including 307 error-free reports from MIMIC-CXR, a large, publicly available database of chest X-rays, and 307 synthetic reports with errors.

The researchers used the synthetic reports to boost the amount of training data and fulfill the data-hungry needs of LLM fine-tuning.

"Synthetic reports can also increase the coverage and diversity, balance out the cases and reduce the annotation costs," said the study's first author, Cong Sun, Ph.D., from Dr. Peng's lab. "In radiology, or more broadly, the clinical domain, synthetic reports allow safe data-sharing without compromising patient privacy."

The researchers found that the fine-tuned model outperformed both GPT-4 and BiomedBERT, a natural language processing tool for biomedical research.

"The LLM that was fine-tuned on both MIMIC-CXR and synthetic reports demonstrated strong performance in the error detection tasks," Dr. Sun said. "It meets our expectations and highlights the potential for developing lightweight, fine-tuned LLM specifically for medical proofreading applications."

The study provided evidence that LLMs can assist in detecting various types of errors, including transcription errors and left/right errors, which refer to misidentification or misinterpretation of directions or sides in text or images.

The use of synthetic data in AI model building has raised concerns of bias in the data. Dr. Peng and colleagues took steps to minimize this by using diverse and representative samples of real-world data to generate the synthetic data. However, they acknowledged that synthetic errors may not fully capture the complexity of real-world errors in radiology reports. Future work could include a systematic evaluation of how bias introduced by synthetic errors affects model performance.

The researchers hope to study fine-tuning's ability to reduce radiologists' cognitive load and enhance patient care and find out if fine-tuning would degrade the model's ability to generate reasoning explanations.

"We are excited to keep exploring innovative strategies to enhance the reasoning capabilities of fine-tuned LLMs in medical proofreading tasks," Dr. Peng said. "Our goal is to develop transparent and understandable models that radiologists can confidently trust and fully embrace."

Sun C, Teichman K, Zhou Y, Critelli B, Nauheim D, Keir G, Wang X, Zhong J, Flanders AE, Shih G, Peng Y.
Generative Large Language Models Trained for Detecting Errors in Radiology Reports.
Radiology. 2025 May;315(2):e242575. doi: 10.1148/radiol.242575

Most Popular Now

AI Catches One-Third of Interval Breast …

An AI algorithm for breast cancer screening has potential to enhance the performance of digital breast tomosynthesis (DBT), reducing interval cancers by up to one-third, according to a study published...

Great plan: Now We need to Get Real abou…

The government's big plan for the 10 Year Health Plan for the NHS laid out a big role for delivery. However, the Highland Marketing advisory board felt the missing implementation...

Researchers Create 'Virtual Scienti…

There may be a new artificial intelligence-driven tool to turbocharge scientific discovery: virtual labs. Modeled after a well-established Stanford School of Medicine research group, the virtual lab is complete with an...

From WebMD to AI Chatbots: How Innovatio…

A new research article published in the Journal of Participatory Medicine unveils how successive waves of digital technology innovation have empowered patients, fostering a more collaborative and responsive health care...

New AI Tool Accelerates mRNA-Based Treat…

A new artificial intelligence (AI) model can improve the process of drug and vaccine discovery by predicting how efficiently specific mRNA sequences will produce proteins, both generally and in various...

Can Amazon Alexa or Google Home Help Det…

Computer scientists at the University of Rochester have developed an AI-powered, speech-based screening tool that can help people assess whether they are showing signs of Parkinson’s disease, the fastest growing...

AI also Assesses Dutch Mammograms Better…

AI is detecting tumors more often and earlier in the Dutch breast cancer screening program. Those tumors can then be treated at an earlier stage. This has been demonstrated by...

RSNA AI Challenge Models can Independent…

Algorithms submitted for an AI Challenge hosted by the Radiological Society of North America (RSNA) have shown excellent performance for detecting breast cancers on mammography images, increasing screening sensitivity while...

AI could Help Emergency Rooms Predict Ad…

Artificial intelligence (AI) can help emergency department (ED) teams better anticipate which patients will need hospital admission, hours earlier than is currently possible, according to a multi-hospital study by the...

Head-to-Head Against AI, Pharmacy Studen…

Students pursuing a Doctor of Pharmacy degree routinely take - and pass - rigorous exams to prove competency in several areas. Can ChatGPT accurately answer the same questions? A new...

NHS Active 10 Walking Tracker Users are …

Users of the NHS Active 10 app, designed to encourage people to become more active, immediately increased their amount of brisk and non-brisk walking upon using the app, according to...

The Human Touch of Doctors will Still be…

AI-based medicine will revolutionise care including for Alzheimer’s and diabetes, predicts a technology expert, but it must be accessible to all patients. Healing with Artificial Intelligence, written by technology expert Daniele...