Almost All Leading AI Chatbots Show Signs of Cognitive Decline

Almost all leading large language models or "chatbots" show signs of mild cognitive impairment in tests widely used to spot early signs of dementia, finds a study in the Christmas issue of The BMJ.

The results also show that "older" versions of chatbots, like older patients, tend to perform worse on the tests. The authors say these findings "challenge the assumption that artificial intelligence will soon replace human doctors."

Huge advances in the field of artificial intelligence have led to a flurry of excited and fearful speculation as to whether chatbots can surpass human physicians.

Several studies have shown large language models (LLMs) to be remarkably adept at a range of medical diagnostic tasks, but their susceptibility to human impairments such as cognitive decline have not yet been examined.

To fill this knowledge gap, researchers assessed the cognitive abilities of the leading, publicly available LLMs - ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 "Sonnet" (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet) - using the Montreal Cognitive Assessment (MoCA) test.

The MoCA test is widely used to detect cognitive impairment and early signs of dementia, usually in older adults. Through a number of short tasks and questions, it assesses abilities including attention, memory, language, visuospatial skills, and executive functions. The maximum score is 30 points, with a score of 26 or above generally considered normal.

The instructions given to the LLMs for each task were the same as those given to human patients. Scoring followed official guidelines and was evaluated by a practising neurologist.

ChatGPT 4o achieved the highest score on the MoCA test (26 out of 30), followed by ChatGPT 4 and Claude (25 out of 30), with Gemini 1.0 scoring lowest (16 out of 30).

All chatbots showed poor performance in visuospatial skills and executive tasks, such as the trail making task (connecting encircled numbers and letters in ascending order) and the clock drawing test (drawing a clock face showing a specific time). Gemini models failed at the delayed recall task (remembering a five word sequence).

Most other tasks, including naming, attention, language, and abstraction were performed well by all chatbots.

But in further visuospatial tests, chatbots were unable to show empathy or accurately interpret complex visual scenes. Only ChatGPT 4o succeeded in the incongruent stage of the Stroop test, which uses combinations of colour names and font colours to measure how interference affects reaction time.

These are observational findings and the authors acknowledge the essential differences between the human brain and large language models.

However, they point out that the uniform failure of all large language models in tasks requiring visual abstraction and executive function highlights a significant area of weakness that could impede their use in clinical settings.

As such, they conclude: "Not only are neurologists unlikely to be replaced by large language models any time soon, but our findings suggest that they may soon find themselves treating new, virtual patients - artificial intelligence models presenting with cognitive impairment."

Dayan R, Uliel B, Koplewitz G.
Age against the machine-susceptibility of large language models to cognitive impairment: cross sectional analysis.
BMJ. 2024 Dec 19;387:e081948. doi: 10.1136/bmj-2024-081948

Most Popular Now

Open Medical Works with Moray's Dig…

Open Medical is working with the Digital Health & Care Innovation Centre’s Rural Centre of Excellence on a referral management plan, as part of a research and development scheme to...

Generative AI on Track to Shape the Futu…

Using advanced artificial intelligence (AI), researchers have developed a novel method to make drug development faster and more efficient. In a new paper, Xia Ning, lead author of the study and...

AI could Help Improve Early Detection of…

A new study led by investigators at the UCLA Health Jonsson Comprehensive Cancer Center suggests that artificial intelligence (AI) could help detect interval breast cancers - those that develop between...

AI-Human Task-Sharing could Cut Mammogra…

The most effective way to harness the power of artificial intelligence (AI) when screening for breast cancer may be through collaboration with human radiologists - not by wholesale replacing them...

Reorganisation, Consolidation, and Cuts:…

NHS England has been downsized and abolished. Integrated care boards have been told to change function, consolidate, and deliver savings. Trusts are planning big cuts. The Highland Marketing advisory board...

Siemens Healthineers infection Control S…

Klinikum Region Hannover (KRH) has commissioned Siemens Healthineers to install infection control system (ICS) at the Klinikum Siloah hospital. The ICS aims to effectively tackle nosocomial infections and increase patient...

AI Tool Uses Face Photos to Estimate Bio…

Eyes may be the window to the soul, but a person's biological age could be reflected in their facial characteristics. Investigators from Mass General Brigham developed a deep learning algorithm...

Philips Future Health Index 2025 Report …

Royal Philips (NYSE: PHG, AEX: PHIA), a global leader in health technology, today unveiled its 2025 Future Health Index U.S. report, "Building trust in healthcare AI," spotlighting the state of...

AI-Powered Precision: Unlocking the Futu…

A team of researchers from the Department of Diagnostic and Therapeutic Ultrasonography at the Tianjin Medical University Cancer Institute & Hospital, have published a review in Cancer Biology & Medicine...

AI Model Improves Delirium Prediction, L…

An artificial intelligence (AI) model improved outcomes in hospitalized patients by quadrupling the rate of detection and treatment of delirium. The model identifies patients at high risk for delirium and...

Building Trust in Artificial Intelligenc…

A new review, published in the peer-reviewed journal AI in Precision Oncology, explores the multifaceted reasons behind the skepticism surrounding artificial intelligence (AI) technologies in healthcare and advocates for approaches...

SALSA: A New AI Tool for the Automated a…

Investigators of the Vall d'Hebron Institute of Oncology's (VHIO) Radiomics Group, led by Raquel Perez-Lopez, have developed SALSA (System for Automatic Liver tumor Segmentation And detection), a fully automated deep...