Human-AI Collectives Make the Most Accurate Medical Diagnoses

Diagnostic errors are among the most serious problems in everyday medical practice. AI systems - especially large language models (LLMs) like ChatGPT-4, Gemini, or Claude 3 - offer new ways to efficiently support medical diagnoses. Yet these systems also entail considerable risks - for example, they can "hallucinate" and generate false information. In addition, they reproduce existing social or medical biases and make mistakes that are often perplexing to humans.

An international research team, led by the Max Planck Institute for Human Development and in collaboration with partners from the Human Diagnosis Project (San Francisco) and the Institute of Cognitive Sciences and Technologies of the Italian National Research Council (CNR-ISTC Rome), investigated how humans and AI can best collaborate. The result: hybrid diagnostic collectives - groups consisting of human experts and AI systems - are significantly more accurate than collectives consisting solely of humans or AI. This holds particularly for complex, open-ended diagnostic questions with numerous possible solutions, rather than simple yes/no decisions. "Our results show that cooperation between humans and AI models has great potential to improve patient safety," says lead author Nikolas Zöller, postdoctoral researcher at the Center for Adaptive Rationality of the Max Planck Institute for Human Development.

The researchers used data from the Human Diagnosis Project, which provides clinical vignettes - short descriptions of medical case studies - along with the correct diagnoses. Using more than 2,100 of these vignettes, the study compared the diagnoses made by medical professionals with those of five leading AI models. In the central experiment, various diagnostic collectives were simulated: individuals, human collectives, AI models, and mixed human-AI collectives. In total, the researchers analyzed more than 40,000 diagnoses. Each was classified and evaluated according to international medical standards (SNOMED CT).

The study shows that combining multiple AI models improved diagnostic quality. On average, the AI collectives outperformed 85% of human diagnosticians. However, there were numerous cases in which humans performed better. Interestingly, when AI failed, humans often knew the correct diagnosis.

The biggest surprise was that combining both worlds led to a significant increase in accuracy. Even adding a single AI model to a group of human diagnosticians - or vice versa - substantially improved the result. The most reliable outcomes came from collective decisions involving multiple humans and multiple AIs. The explanation is that humans and AI make systematically different errors. When AI failed, a human professional could compensate for the mistake - and vice versa. This so-called error complementarity makes hybrid collectives so powerful. "It's not about replacing humans with machines. Rather, we should view artificial intelligence as a complementary tool that unfolds its full potential in collective decision-making," says co-author Stefan Herzog, Senior Research Scientist at the Max Planck Institute for Human Development.

However, the researchers also emphasize the limitations of their work. The study only considered text-based case vignettes - not actual patients in real clinical settings. Whether the results can be transferred directly to practice remains a questions for future studies to address. Likewise, the study focused solely on diagnosis, not treatment, and a correct diagnosis does not necessarily guarantee an optimal treatment.

It also remains uncertain how AI-based support systems will be accepted in practice by medical staff and patients. The potential risks of bias and discrimination by both AI and humans, particularly in relation to ethnic, social, or gender differences, likewise require further research.

The study is part of the Hybrid Human Artificial Collective Intelligence in Open-Ended Decision Making (HACID) project, funded under Horizon Europe, which aims to promote the development of future clinical decision-support systems through the smart integration of human and machine intelligence. The researchers see particular potential in regions where access to medical care is limited. Hybrid human–AI collectives could make a crucial contribution to greater healthcare equity in such areas.

"The approach can also be transferred to other critical areas - such as the legal system, disaster response, or climate policy - anywhere that complex, high-risk decisions are needed. For example, the HACID project is also developing tools to enhance decision-making in climate adaptation" says Vito Trianni, co-author and coordinator of the HACID project.

In brief:

  • Hybrid diagnostic collectives consisting of humans and AI make significantly more accurate diagnoses than either medical professionals or AI systems alone - because they make systematically different errors that cancel each other out.
  • The study analyzed over 40,000 diagnoses made by humans and machines in response to more than 2,100 realistic clinical vignettes.
  • Adding an AI model to a human collective - or vice versa - noticeably improved diagnostic quality; hybrid collective decisions made by several humans and machines achieved the best results.
  • These findings highlight the potential for greater patient safety and more equitable healthcare, especially in underserved regions. However, further research is needed on practical implementation and ethical considerations.

Zöller N, Berger J, Lin I, Fu N, Komarneni J, Barabucci G, Laskowski K, Shia V, Harack B, Chu EA, Trianni V, Kurvers RHJM, Herzog SM.
Human-AI collectives most accurately diagnose clinical vignettes.
Proc Natl Acad Sci U S A. 2025 Jun 17;122(24):e2426153122. doi: 10.1073/pnas.2426153122

Most Popular Now

Using Data and AI to Create Better Healt…

Academic medical centers could transform patient care by adopting principles from learning health systems principles, according to researchers from Weill Cornell Medicine and the University of California, San Diego. In...

AI Medical Receptionist Modernizing Doct…

A virtual medical receptionist named "Cassie," developed through research at Texas A&M University, is transforming the way patients interact with health care providers. Cassie is a digital-human assistant created by Humanate...

AI Tool Set to Transform Characterisatio…

A multinational team of researchers, co-led by the Garvan Institute of Medical Research, has developed and tested a new AI tool to better characterise the diversity of individual cells within...

Integrating Care Records is Good. Using …

Opinion Article by Dr Paul Deffley, Chief Medical Officer, Alcidion. A single patient record already exists in the NHS. Or at least, that’s a perception shared by many. A survey of...

AI could Help Pathologists Match Cancer …

A new study by researchers at the Icahn School of Medicine at Mount Sinai, Memorial Sloan Kettering Cancer Center, and collaborators, suggests that artificial intelligence (AI) could significantly improve how...

Should AI Chatbots Replace Your Therapis…

The new study exposes the dangerous flaws in using artificial intelligence (AI) chatbots for mental health support. For the first time, the researchers evaluated these AI systems against clinical standards...

AI Model Converts Hospital Records into …

UCLA researchers have developed an AI system that turns fragmented electronic health records (EHR) normally in tables into readable narratives, allowing artificial intelligence to make sense of complex patient histories...

AI Detects Early Signs of Osteoporosis f…

Investigators have developed an artificial intelligence-assisted diagnostic system that can estimate bone mineral density in both the lumbar spine and the femur of the upper leg, based on X-ray images...

Mayo Clinic's AI Tool Identifies 9 …

Mayo Clinic researchers have developed a new artificial intelligence (AI) tool that helps clinicians identify brain activity patterns linked to nine types of dementia, including Alzheimer's disease, using a single...

Forging a Novel Therapeutic Path for Pat…

Rett syndrome is a devastating rare genetic childhood disorder primarily affecting girls. Merely 1 out of 10,000 girls are born with it and much fewer boys. It is caused by...

AI Sharpens Pathologists' Interpret…

Pathologists' examinations of tissue samples from skin cancer tumours improved when they were assisted by an AI tool. The assessments became more consistent and patients' prognoses were described more accurately...

AI Matches Doctors in Mapping Lung Tumor…

In radiation therapy, precision can save lives. Oncologists must carefully map the size and location of a tumor before delivering high-dose radiation to destroy cancer cells while sparing healthy tissue...