Fine-Tuned LLMs Boost Error Detection in Radiology Reports

A type of artificial intelligence (AI) called fine-tuned large language models (LLMs) greatly enhances error detection in radiology reports, according to a new study published in Radiology, a journal of the Radiological Society of North America (RSNA). Researchers said the findings point to an important role for this technology in medical proofreading.

Radiology reports are crucial for optimal patient care. Their accuracy can be compromised by factors like errors in speech recognition software, variability in perceptual and interpretive processes and cognitive biases. These errors can lead to incorrect diagnoses or delayed treatments, making the need for accurate reports urgent.

LLMs like ChatGPT are advanced generative AI models that are trained on vast amounts of text to generate human language. While they offer great potential in proofreading, their application in the medical field, particularly in detecting errors within radiology reports, remains underexplored.

To bridge this gap in knowledge, researchers evaluated fine-tuned LLMs for detecting errors in radiology reports during medical proofreading. A fine-tuned LLM is a pre-trained language model that is further trained on domain-specific data.

"Initially, LLMs are trained on large-scale public data to learn general language patterns and knowledge," said study senior author Yifan Peng, Ph.D., from the Department of Population Health Sciences at Weill Cornell Medicine in New York City. "Fine-tuning occurs as the next step, where the model undergoes additional training using smaller, targeted datasets relevant to particular tasks."

To test the model, Dr. Peng and colleagues built a dataset with two parts. The first consisted of 1,656 synthetic reports, including 828 error-free reports and 828 reports with errors. The second part comprised 614 reports, including 307 error-free reports from MIMIC-CXR, a large, publicly available database of chest X-rays, and 307 synthetic reports with errors.

The researchers used the synthetic reports to boost the amount of training data and fulfill the data-hungry needs of LLM fine-tuning.

"Synthetic reports can also increase the coverage and diversity, balance out the cases and reduce the annotation costs," said the study's first author, Cong Sun, Ph.D., from Dr. Peng's lab. "In radiology, or more broadly, the clinical domain, synthetic reports allow safe data-sharing without compromising patient privacy."

The researchers found that the fine-tuned model outperformed both GPT-4 and BiomedBERT, a natural language processing tool for biomedical research.

"The LLM that was fine-tuned on both MIMIC-CXR and synthetic reports demonstrated strong performance in the error detection tasks," Dr. Sun said. "It meets our expectations and highlights the potential for developing lightweight, fine-tuned LLM specifically for medical proofreading applications."

The study provided evidence that LLMs can assist in detecting various types of errors, including transcription errors and left/right errors, which refer to misidentification or misinterpretation of directions or sides in text or images.

The use of synthetic data in AI model building has raised concerns of bias in the data. Dr. Peng and colleagues took steps to minimize this by using diverse and representative samples of real-world data to generate the synthetic data. However, they acknowledged that synthetic errors may not fully capture the complexity of real-world errors in radiology reports. Future work could include a systematic evaluation of how bias introduced by synthetic errors affects model performance.

The researchers hope to study fine-tuning's ability to reduce radiologists' cognitive load and enhance patient care and find out if fine-tuning would degrade the model's ability to generate reasoning explanations.

"We are excited to keep exploring innovative strategies to enhance the reasoning capabilities of fine-tuned LLMs in medical proofreading tasks," Dr. Peng said. "Our goal is to develop transparent and understandable models that radiologists can confidently trust and fully embrace."

Sun C, Teichman K, Zhou Y, Critelli B, Nauheim D, Keir G, Wang X, Zhong J, Flanders AE, Shih G, Peng Y.
Generative Large Language Models Trained for Detecting Errors in Radiology Reports.
Radiology. 2025 May;315(2):e242575. doi: 10.1148/radiol.242575

Most Popular Now

Study Finds One-Year Change on CT Scans …

Researchers at National Jewish Health have shown that subtle increases in lung scarring, detected by an artificial intelligence-based tool on CT scans taken one year apart, are associated with disease...

Yousif's Story with Sectra and The …

Embarking on healthcare technology career after leaving his home as a refugee during his teenage years, Yousif is passionate about making a difference. He reflects on an apprenticeship in which...

New AI Tools Help Scientists Track How D…

Artificial intelligence (AI) can solve problems at remarkable speed, but it’s the people developing the algorithms who are truly driving discovery. At The University of Texas at Arlington, data scientists...

AI Tool Offers Deep Insight into the Imm…

Researchers explore the human immune system by looking at the active components, namely the various genes and cells involved. But there is a broad range of these, and observations necessarily...

New Antibiotic Targets IBD - and AI Pred…

Researchers at McMaster University and the Massachusetts Institute of Technology (MIT) have made two scientific breakthroughs at once: they not only discovered a brand-new antibiotic that targets inflammatory bowel diseases...

Highland to Help Companies Seize 'N…

Health tech growth partner Highland has today revealed its new identity - reflecting a sharper focus as it helps health tech companies to find market opportunities, convince target audiences, and...