Stanford Medicine Study Suggests Physician's Medical Decisions Benefit from Chatbot

Artificial intelligence-powered chatbots are getting pretty good at diagnosing some diseases, even when they are complex. But how do chatbots do when guiding treatment and care after the diagnosis? For example, how long before surgery should a patient stop taking prescribed blood thinners? Should a patient's treatment protocol change if they've had adverse reactions to similar drugs in the past? These sorts of questions don't have a textbook right or wrong answer - it's up to physicians to use their judgment.

Jonathan H. Chen, MD, PhD, assistant professor of medicine, and a team of researchers are exploring whether chatbots, a type of large language model, or LLM, can effectively answer such nuanced questions, and whether physicians supported by chatbots perform better.

The answers, it turns out, are yes and yes. The research team tested how a chatbot performed when faced with a variety of clinical crossroads. A chatbot on its own outperformed doctors who could access only an internet search and medical references, but armed with their own LLM, the doctors, from multiple regions and institutions across the United States, kept up with the chatbots.

"For years I've said that, when combined, human plus computer is going to do better than either one by itself," Chen said. "I think this study challenges us to think about that more critically and ask ourselves, 'What is a computer good at? What is a human good at?' We may need to rethink where we use and combine those skills and for which tasks we recruit AI."

A study detailing these results published in Nature Medicine on Feb. 5. Chen and Adam Rodman, MD, assistant professor at Harvard University, are co-senior authors. Postdoctoral scholars Ethan Goh, MD, and Robert Gallo, MD, are co-lead author.

In October 2024, Chen and Goh led a team that ran a study, published in JAMA Network Open, that tested how the chatbot performed when diagnosing diseases and that found its accuracy was higher than that of doctors, even if they were using a chatbot. The current paper digs into the squishier side of medicine, evaluating chatbot and physician performance on questions that fall into a category called "clinical management reasoning."

Goh explains the difference like this: Imagine you’re using a map app on your phone to guide you to a certain destination. Using an LLM to diagnose a disease is sort of like using the map to pinpoint the correct location. How you get there is the management reasoning part - do you take backroads because there’s traffic? Stay the course, bumper to bumper? Or wait and hope the roads clear up?

In a medical context, these decisions can get tricky. Say a doctor incidentally discovers a hospitalized patient has a sizeable mass in the upper part of the lung. What would the next steps be? The doctor (or chatbot) should recognize that a large nodule in the upper lobe of the lung statistically has a high chance of spreading throughout the body. The doctor could immediately take a biopsy of the mass, schedule the procedure for a later date or order imaging to try to learn more.

Determining which approach is best suited for the patient comes down to a host of details, starting with the patient’s known preferences. Are they reticent to undergo an invasive procedure? Does the patient’s history show a lack of following up on appointments? Is the hospital’s health system reliable when organizing follow-up appointments? What about referrals? These types of contextual factors are crucial to consider, Chen said.

The team designed a trial to study clinical management reasoning performance in three groups: the chatbot alone, 46 doctors with chatbot support, and 46 doctors with access only to internet search and medical references. They selected five de-identified patient cases and gave them to the chatbot and to the doctors, all of whom provided a written response that detailed what they would do in each case, why and what they considered when making the decision.

In addition, the researchers tapped a group of board-certified doctors to create a rubric that would qualify a medical judgment or decision as appropriately assessed. The decisions were then scored against the rubric.

To the team's surprise, the chatbot outperformed the doctors who had access only to the internet and medical references, ticking more items on the rubric than the doctors did. But the doctors who were paired with a chatbot performed as well as the chatbot alone.

Exactly what gave the physician-chatbot collaboration a boost is up for debate. Does using the LLM force doctors to be more thoughtful about the case? Or is the LLM providing guidance that the doctors wouldn't have thought of on their own? It's a future direction of exploration, Chen said.

The positive outcomes for chatbots and physicians paired with chatbots beg an ever-popular question: Are AI doctors on their way?

"Perhaps it's a point in AI’s favor," Chen said. But rather than replacing physicians, the results suggest that doctors might want to welcome a chatbot assist. "This doesn't mean patients should skip the doctor and go straight to chatbots. Don't do that," he said. "There's a lot of good information out there, but there's also bad information. The skill we all have to develop is discerning what's credible and what's not right. That's more important now than ever."

Researchers from VA Palo Alto Health Care System, Beth Israel Deaconess Medical Center, Harvard University, University of Minnesota, University of Virginia, Microsoft and Kaiser contributed to this work.

The study was funded by the Gordon and Betty Moore Foundation, the Stanford Clinical Excellence Research Center and the VA Advanced Fellowship in Medical Informatics.

Stanford's Department of Medicine also supported the work.

Goh E, Gallo RJ, Strong E, Weng Y, Kerman H, Freed JA, Cool JA, Kanjee Z, Lane KP, Parsons AS, Ahuja N, Horvitz E, Yang D, Milstein A, Olson APJ, Hom J, Chen JH, Rodman A.
GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial.
Nat Med. 2025 Feb 5. doi: 10.1038/s41591-024-03456-y

Most Popular Now

Integrating Care Records is Good. Using …

Opinion Article by Dr Paul Deffley, Chief Medical Officer, Alcidion. A single patient record already exists in the NHS. Or at least, that’s a perception shared by many. A survey of...

Should AI Chatbots Replace Your Therapis…

The new study exposes the dangerous flaws in using artificial intelligence (AI) chatbots for mental health support. For the first time, the researchers evaluated these AI systems against clinical standards...

AI could Help Pathologists Match Cancer …

A new study by researchers at the Icahn School of Medicine at Mount Sinai, Memorial Sloan Kettering Cancer Center, and collaborators, suggests that artificial intelligence (AI) could significantly improve how...

AI Detects Early Signs of Osteoporosis f…

Investigators have developed an artificial intelligence-assisted diagnostic system that can estimate bone mineral density in both the lumbar spine and the femur of the upper leg, based on X-ray images...

AI Model Converts Hospital Records into …

UCLA researchers have developed an AI system that turns fragmented electronic health records (EHR) normally in tables into readable narratives, allowing artificial intelligence to make sense of complex patient histories...

AI Sharpens Pathologists' Interpret…

Pathologists' examinations of tissue samples from skin cancer tumours improved when they were assisted by an AI tool. The assessments became more consistent and patients' prognoses were described more accurately...

AI Tool Detects Surgical Site Infections…

A team of Mayo Clinic researchers has developed an artificial intelligence (AI) system that can detect surgical site infections (SSIs) with high accuracy from patient-submitted postoperative wound photos, potentially transforming...

Forging a Novel Therapeutic Path for Pat…

Rett syndrome is a devastating rare genetic childhood disorder primarily affecting girls. Merely 1 out of 10,000 girls are born with it and much fewer boys. It is caused by...

Mayo Clinic's AI Tool Identifies 9 …

Mayo Clinic researchers have developed a new artificial intelligence (AI) tool that helps clinicians identify brain activity patterns linked to nine types of dementia, including Alzheimer's disease, using a single...

AI Detects Fatty Liver Disease with Ches…

Fatty liver disease, caused by the accumulation of fat in the liver, is estimated to affect one in four people worldwide. If left untreated, it can lead to serious complications...

AI Matches Doctors in Mapping Lung Tumor…

In radiation therapy, precision can save lives. Oncologists must carefully map the size and location of a tumor before delivering high-dose radiation to destroy cancer cells while sparing healthy tissue...

Meet Your Digital Twin

Before an important meeting or when a big decision needs to be made, we often mentally run through various scenarios before settling on the best course of action. But when...