Utilizing approximately 10,000 dermatoscopic images, the study focused on three common skin diseases: melanoma, melanocytic nevi, and benign keratosis-like lesions. Results revealed that while ChatGPT-4 and LLaVA outperformed most traditional deep learning models overall, ChatGPT-4 showed greater fairness across demographic groups, whereas LLaVA exhibited significant sex-related biases.
Dr. Wan emphasized, "While large language models like ChatGPT-4 and LLaVA demonstrate clear potential in dermatology, we must address the observed biases, particularly across sex and age groups, to ensure these technologies are safe and effective for all patients."
The team plans further research incorporating additional demographic variables like skin tone to comprehensively evaluate the fairness and reliability of AI models in clinical scenarios. This research provides critical guidance for developing more equitable and trustworthy medical AI systems.
Wan Z, Guo Y, Bao S, Wang Q, Malin BA.
Evaluating Sex and Age Biases in Multimodal Large Language Models for Skin Disease Identification from Dermatoscopic Images.
Health Data Sci. 2025 Apr 1;5:0256. doi: 10.34133/hds.0256