ABSTRACT
PURPOSE
This study aimed to evaluate the performance of large language models (LLMs) and multimodal LLMs in interpreting the Breast Imaging Reporting and Data System (BI-RADS) categories and providing clinical management recommendations for breast radiology in text-based and visual questions.
METHODS
This cross-sectional observational study involved two steps. In the first step, we compared ten LLMs (namely ChatGPT 4o, ChatGPT 4, ChatGPT 3.5, Google Gemini 1.5 Pro, Google Gemini 1.0, Microsoft Copilot, Perplexity, Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Opus 200K), general radiologists, and a breast radiologist using 100 text-based multiple-choice questions (MCQs) related to the BI-RADS Atlas 5th edition. In the second step, we assessed the performance of five multimodal LLMs (ChatGPT 4o, ChatGPT 4V, Claude 3.5 Sonnet, Claude 3 Opus, and Google Gemini 1.5 Pro) in assigning BI-RADS categories and providing clinical management recommendations on 100 breast ultrasound images. The comparison of correct answers and accuracy by question types was analyzed using McNemar’s and chi-squared tests. Management scores were analyzed using the Kruskal–Wallis and Wilcoxon tests.
RESULTS
Claude 3.5 Sonnet achieved the highest accuracy in text-based MCQs (90%), followed by ChatGPT 4o (89%), outperforming all other LLMs and general radiologists (78% and 76%) (P < 0.05), except for the Claude 3 Opus models and the breast radiologist (82%) (P > 0.05). Lower-performing LLMs included Google Gemini 1.0 (61%) and ChatGPT 3.5 (60%). Performance across different categories of showed no significant variation among LLMs or radiologists (P > 0.05). For breast ultrasound images, Claude 3.5 Sonnet achieved 59% accuracy, significantly higher than other multimodal LLMs (P < 0.05). Management recommendations were evaluated using a 3-point Likert scale, with Claude 3.5 Sonnet scoring the highest (mean: 2.12 ± 0.97) (P < 0.05). Accuracy varied significantly across BI-RADS categories, except Claude 3 Opus (P < 0.05). Gemini 1.5 Pro failed to answer any BI-RADS 5 questions correctly. Similarly, ChatGPT 4V failed to answer any BI-RADS 1 questions correctly, making them the least accurate in these categories (P < 0.05).
CONCLUSION
Although LLMs such as Claude 3.5 Sonnet and ChatGPT 4o show promise in text-based BI-RADS assessments, their limitations in visual diagnostics suggest they should be used cautiously and under radiologists’ supervision to avoid misdiagnoses.
CLINICAL SIGNIFICANCE
This study demonstrates that while LLMs exhibit strong capabilities in text-based BI-RADS assessments, their visual diagnostic abilities are currently limited, necessitating further development and cautious application in clinical practice.
Main points
• This study evaluated the performance of large language models (LLMs) and multimodal LLMs in interpreting the Breast Imaging Reporting and Data System categories and providing clinical management recommendations. The evaluation involved two steps: assessing LLMs on text-based multiple-choice questions (MCQs) and evaluating multimodal LLMs on breast ultrasound images.
• Claude 3.5 Sonnet and ChatGPT 4o achieved high accuracy rates of 90% and 89%, respectively, in text-based MCQs, outperforming general radiologists, who had accuracy rates of 78% and 76%. This demonstrates the strong potential of these advanced LLMs in supporting and enhancing the diagnostic accuracy of radiologists in text-based assessments.
• Multimodal LLMs showed lower accuracy in evaluating breast ultrasound images, with Claude 3.5 Sonnet achieving only 59% accuracy. This highlights a critical limitation in their current ability to handle visual diagnostic tasks effectively compared with text-based assessments.
• The study underscores the necessity for further development of multimodal LLMs to improve their visual diagnostic capabilities. Until these improvements are realized, the use of multimodal LLMs in clinical practice should be closely supervised by experienced radiologists to prevent potential misdiagnoses and ensure patient safety.
The emergence of large language models (LLMs) marks a transformative milestone in the development of artificial intelligence (AI). These models offer unprecedented potential for understanding and generating human-like text by leveraging extensive datasets. This technological advancement holds significant promise for application in medicine.1, 2 As radiology increasingly relies on the interpretation of complex imaging data, the integration of advanced AI tools, such as LLMs, becomes crucial to enhance diagnostic accuracy and streamline workflows. LLMs have demonstrated remarkable performance in various realms of radiology, including testing radiological knowledge in different board-style examinations, simplifying radiology reports, and providing patient information.3-7
Recent studies have also explored the potential of LLMs specifically in breast imaging, where their capabilities show particular promise.8-10 For instance, Rao et al.9 evaluated the performance of two well-known LLMs, ChatGPT 3.5 and ChatGPT 4, in adhering to the American College of Radiology (ACR) eligibility criteria for breast pain and breast cancer screening, revealing impressive accuracy rates of 88.9% and 98.4%, respectively. These findings highlight the potential of LLMs as supportive tools in breast imaging, which is especially relevant given the ongoing radiologist shortages and the increasing volume of imaging studies.11, 12 Despite these advancements, it is crucial to acknowledge the limitations and challenges associated with LLMs, including their susceptibility to generating plausible-sounding but incorrect answers (hallucinations).13
The Breast Imaging Reporting and Data System (BI-RADS) Atlas, released in its latest edition in 2013, has provided standardized nomenclature, report organization, assessment structure, and a classification system for mammography, ultrasound, and magnetic resonance imaging (MRI) of the breast.14 The BI-RADS Atlas is crucial for radiologists as it standardizes breast imaging terminology and reporting, ensuring clear communication and consistent, accurate patient management.15
While the BI-RADS Atlas offers a standardized approach to breast imaging, recent research has begun exploring how LLMs can further enhance radiological assessment and reporting accuracy. Haver et al.16 demonstrated that ChatGPT 4 accurately predicted the BI-RADS category in 73.6% of 250 fictitious breast imaging reports. Cozzi et al.17 evaluated the concordance between different LLMs (ChatGPT 3.5, ChatGPT 4, and Google Bard) and radiologists across 2,400 reports in three different languages, revealing a moderate agreement (Gwet’s agreement coefficient: 0.52–0.42). Despite the growing emphasis on the importance of LLMs in breast imaging, there is a significant gap in the literature regarding the evaluation of multimodal LLMs’ performance on breast ultrasound images. Additionally, no studies compare LLMs’ knowledge of BI-RADS Atlas with that of radiologists. Hence, the first aim of this study is to evaluate the performance of nine large LLMs compared with breast and general radiologists on text-based multiple-choice questions (MCQs) related to the BI-RADS Atlas, 5th edition. The second aim is to assess the capability of five multimodal LLMs in assigning BI-RADS categories and providing clinical management recommendations for breast ultrasound images.
Methods
Study design
This cross-sectional observational study had two steps. In the first step, it compared different LLMs, namely ChatGPT 4o, ChatGPT 4, ChatGPT 3.5, Google Gemini 1.5 Pro, Google Gemini 1.0, Microsoft Copilot, Perplexity, Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Opus 200K, along with the responses of two general radiologists and a breast radiologist in answering MCQs regarding the 5th edition of the BI-RADS Atlas.
In the second step, the study compared different multimodal LLMs, namely ChatGPT 4o, ChatGPT 4V, Claude 3.5 Sonnet, Claude 3 Opus, and Google Gemini 1.5 Pro. This step focused on determining the correct BI-RADS category and clinical management by evaluating breast ultrasound images. An overview of the workflow is shown in Figure 1.
The study did not require ethics committee approval as it relied solely on fictional MCQs and a publicly available breast ultrasound dataset that had no identifiable patient information. Its design conformed to the principles articulated in the Standards for Reporting Diagnostic Accuracy Studies statement.18
Data collection for breast multiple-choice questions
The ACR published the 5th edition of the BI-RADS Atlas in 2013 to standardize terminology and reporting organization in breast radiology.14 A total of 100 MCQs were prepared and categorized using the information in this atlas related to ultrasound, mammography, MRI, and general BI-RADS knowledge by general radiologist 3 (Y.C.G.). Each question had four choices, with only one correct answer and three distractors. The distractors were carefully chosen to be reasonable and related to the question. Each question was formulated to be clear and focused on a single concept to assess breast radiology knowledge. The questions were categorized according to the BI-RADS Atlas sections as follows: 16 on breast ultrasound, 39 on mammography, 22 on breast MRI, and 23 on general BI-RADS knowledge. All created MCQs are listed in Supplementary Material 1.
Design of input–output procedures and performance evaluation for large language models
The input prompt was initiated as follows: “I am working on a breast radiology quiz and will provide you MCQs. Act like a radiology professor with 30 years of expertise in breast imaging. Please indicate the correct answer. There is only one correct answer.” This prompt was presented in April 2024 on eight distinct platforms with default parameters: OpenAI’s ChatGPT 4 and 3.5 (https://chat.openai.com), Google Gemini 1.5 Pro and 1.0 (https://gemini.google.com/), Microsoft Copilot (https://copilot.microsoft.com) (Balanced), Perplexity (https://perplexity.ai), Claude 3 Opus (https://claude.ai), and Claude 3 Opus 200K (https://poe.com). The same prompt was presented to OpenAI’s ChatGPT 4o (https://chat.openai.com) in May 2024 and Claude 3.5 Sonnet (https://claude.ai) in July 2024 (Figure 2). Specific settings, such as temperature and randomness, were left at their default values unless specified otherwise by the platform.
The MCQs were sequentially added to the same chat session by copying and pasting from the MCQs list. Each LLM was presented with 100 questions by general radiologist 3, and the responses were evaluated. It is crucial to note that the employed LLMs were not pre-trained with a specific prompt or question set for this study. Each question was asked in a single chat session, without opening a new chat tab for individual inquiries.
Radiologist 3 evaluated LLMs’ answers according to the correct answer list, marking them either correct (1) or incorrect (0).
Radiologists performance evaluation for breast multiple-choice questions
Two European Board of Radiology-certified junior general radiologists–radiologist 1 (T.C.) with 6 years of experience, and radiologist 2 (E.Ç.) with 6 years of experience–and a breast radiologist (L.G.K.) with 13 years of experience, independently assessed the MCQs blindly using their computers. All three answered questions in different sessions. Upon completion of all questions, radiologist 3 evaluated each other’s answers according to the correct answer list, marking them either correct (1) or incorrect (0).
Multimodal large language models and visual breast ultrasound questions
The publicly available Breast Ultrasound Images dataset was utilized to assess the performance of multimodal LLMs with breast ultrasound images.19 This dataset comprises 780 images classified as normal, benign, and malignant, sourced from 600 women aged 25–75 years. The images were acquired using the LOGIQ E9 ultrasound system [General Electric (GE) Healthcare, Wauwatosa, WI, USA] and the LOGIQ E9 Agile ultrasound system [General Electric (GE) Healthcare, Wauwatosa, WI, USA] at Baheya Hospital in Cairo, Egypt. The transducers used were 1–5 MHz on an ML6-15-D Matrix linear probe, and the images were stored in PNG format with dimensions of 500 × 500 pixels.19
The breast radiologist selected 20 images for each BI-RADS category from 1 to 5, resulting in a total of 100 images. These BI-RADS categories served as a reference standard. These images were presented to five different multimodal LLMs: Claude 3.5 Sonnet, Claude 3 Opus, Google Gemini 1.5 Vision Pro, ChatGPT 4o, and ChatGPT 4V.
For each image, the multimodal LLMs received the following prompt: “I am working on a breast radiology quiz and will provide you with breast ultrasound images. Please act as a radiology professor with 30 years of expertise in breast imaging. Evaluate the images and assign only one correct BI-RADS category from BI-RADS 1 to BI-RADS 5 according to the 5th edition of the BI-RADS Atlas. Lastly, provide clinical management recommendations for each category according to the same Atlas” (Figure 3).
This evaluation was conducted in July 2024, with each image presented along with the prompt, using the multimodal LLMs’ default parameters.
The management recommendations provided by the multimodal LLMs, based on the BI-RADS categories, were evaluated using a 3-point Likert scale defined as the Management Score:
• 3 points: Correct management recommendations according to the BI-RADS category
• 2 points: Partially correct management recommendations according to the BI-RADS category
• 1 point: Completely incorrect management recommendations according to the BI-RADS category
Radiologist 3 provided the images and prompts to the multimodal LLMs and recorded their responses. The accuracy of these responses for BI-RADS categories was classified as correct (1) or incorrect (0), and clinical management recommendations were scored using the Management Score by the breast radiologist.
Statistical analysis
The distribution of variables was assessed using the Kolmogorov–Smirnov test. Descriptive statistics were represented using percentages. Non-parametric tests were employed to compare quantitative data due to the nature of the data distribution. The Kruskal–Wallis test was used to compare quantitative data, and Tamhane’s T2 test was employed for multiple post-hoc comparisons following the initial Kruskal–Wallis test. McNemar’s test was used to compare the proportion of correct responses between different questions. The chi-squared test was used to compare the correct answers by question types. The Wilcoxon test was used to compare the Management Scores of multimodal LLMs. The SPSS 26.0 (IBM, USA) package was used for statistical analyses, and statistical significance was set at P < 0.05.
Results
Accuracy of large language models on text-based breast multiple-choice questions
The highest success among the LLMs was achieved by Claude 3.5 Sonnet with an accuracy rate of 90%. ChatGPT 4o ranked second with an accuracy rate of 89%, followed by Claude 3 Opus 200K with an accuracy rate of 84%. Subsequently, Claude 3 Opus had an accuracy rate of 82%, and ChatGPT 4 had an accuracy rate of 79%. The diagnostic accuracy of the breast radiologist was 82%, radiologist 1 was 78%, radiologist 2 was 76%. Google Gemini 1.5 Pro had a 67% accuracy rate, and Microsoft Copilot with a 65% accuracy rate, while both Google Gemini 1.0 and Perplexity scored 61%, and ChatGPT 3.5 scored 60% accuracy (Figure 4).
Claude 3.5 Sonnet achieved the highest accuracy rate among the evaluated LLMs, outperforming most models with a statistically significant difference (P < 0.05), except when compared with ChatGPT 4o and Claude 3 Opus. Both Claude 3.5 Sonnet and ChatGPT 4o also surpassed the accuracy of the general radiologists (P < 0.05), although their performance was comparable with that of the breast radiologists (P > 0.05). Additionally, no significant differences were observed between the breast radiologist and general radiologists (P > 0.05).
When comparing the LLMs Claude 3 Opus 200K, Claude 3 Opus, and ChatGPT 4 with the radiologists, there were no statistically significant differences (P > 0.05); however, these models showed significant superiority over lower-performing LLMs, namely Google Gemini 1.5 Pro, Microsoft Copilot, and ChatGPT 3.5 (P < 0.001). No significant differences were found between the performances of the LLMs and radiologists across different question categories (P > 0.05). Detailed comparisons of the performance between radiologists and LLMs are shown in Table 1, while the performance across question categories is illustrated in Figure 5 and Table 2.
Accuracy of multimodal large language models on visual breast ultrasound questions
In a visual test consisting of 100 questions on breast ultrasound images, Claude 3.5 Sonnet achieved an accuracy rate of 59%, ChatGPT 4o 39%, Google Gemini 1.5 Pro 31%, ChatGPT 4V 20%, and Claude 3 Opus 19% (Figure 6). The performance of Claude 3.5 Sonnet was significantly higher than that of the other multimodal LLMs (P < 0.05). While there was no significant difference in performance between ChatGPT 4o and Google Gemini 1.5 Pro (P = 0.067), Claude 3 Opus and ChatGPT 4V had significantly lower performance (P < 0.05) (Table 3).
The accuracy rates of each model by BI-RADS categories were analyzed using the chi-squared test. The statistical analysis revealed that only Claude 3 Opus’s accuracy rate did not vary by BI-RADS categories (P = 0.992); for other models, accuracy rates showed significant variation by category (P < 0.05) (Table 4).
In post-hoc tests:
• Claude 3.5 Sonnet had a higher accuracy rate for BI-RADS 5 questions (85%) compared with other categories (P = 0.001), while its accuracy rate for BI-RADS 1 questions (35%) was lower compared with other categories (P = 0.001).
• Google Gemini 1.5 Pro’s accuracy rate for BI-RADS 5 questions (0%) was lower compared with other categories (P < 0.001).
• ChatGPT 4V had a higher accuracy rate for BI-RADS 5 questions (45%) compared with other categories (P = 0.001), but a lower accuracy rate for BI-RADS 1 questions (0%) (P = 0.012).
• ChatGPT 4o had a higher accuracy rate for BI-RADS 2 questions (65%) compared with other categories (P = 0.007) (Figure 7).
Accuracy of multimodal large language models on clinical management recommendations
The mean Management Score of Claude 3.5 Sonnet (mean: 2.12 ± 0.97) was significantly superior to that of all other multimodal LLMs (P < 0.05). The mean Management Score of ChatGPT 4o (mean: 1.78 ± 0.98) was not significantly different from Google Gemini 1.5 Pro (mean: 1.64 ± 0.93), but it outperformed ChatGPT 4V (mean: 1.40 ± 0.80) and Claude 3 Opus (mean: 1.42 ± 0.81) (P < 0.05). The details of the Management Score are given in Supplementary Material 2.
Discussion
This study aimed to evaluate the performance of LLMs and multimodal LLMs in breast radiology knowledge. The most striking finding of our study is that although LLMs excel at text-based questions, their performance in evaluating real-life case images is not as successful. Multimodal LLMs fall short compared with their text-based counterparts. Considering that real clinical cases are often complex and diagnoses are made through visual assessment by physicians, multimodal LLMs have not yet demonstrated sufficient performance to be used as clinical decision support systems in real-world settings.
Claude 3.5 Sonnet demonstrated the highest accuracy rate, achieving 90% in answering BI-RADS Atlas 5th edition questions. Following closely were ChatGPT 4o and Claude 3 Opus 200k with accuracy rates of 89% and 84%, respectively, while ChatGPT 4 achieved an accuracy rate of 79%. Among the radiologists, the breast radiologist exhibited the best performance with an accuracy rate of 82%, followed by general radiologist 1 with 78%, and general radiologist 2 with 76%. Claude 3.5 Sonnet demonstrated superior performance compared with all other LLMs, except for ChatGPT 4o and Claude 3 Opus models (P < 0.05). The performance of Claude 3.5 Sonnet and ChatGPT 4o did not show a significant difference from that of the breast radiologist (P > 0.05), but it notably outperformed both general radiologists (P < 0.05).
No statistically significant difference was found between ChatGPT 4o, Claude 3 Opus 200k, Claude 3 Opus, and ChatGPT 4 (P > 0.05). These LLMs, along with both the breast and general radiologists, performed significantly better than ChatGPT 3.5, Google Gemini 1.5 Pro, Google Gemini 1.0, and Perplexity (P < 0.05).
While interpreting real-life breast ultrasound images, Claude 3.5 Sonnet achieved an accuracy rate of 59%, ChatGPT 4o 39%, Google Gemini 1.5 Pro 31%, ChatGPT 4V 20%, and Claude 3 Opus 19%. Claude 3.5 Sonnet outperforms all the other multimodal LLMs (P < 0.05). The diagnostic performance of multimodal LLMs significantly differs with the BI-RADS category, except Claude 3 Opus. Claude 3.5 Sonnet (85%) and Chat GPT 4V (45%) showed superior performance in the BI-RADS 5 category (P = 0.001), while Google Gemini 1.5 Pro showed a higher accuracy rate (65%) for BI-RADS 2 questions (P = 0.007). Gemini 1.5 Pro did not correctly answer any questions in the BI-RADS 5 category, and ChatGPT 4V did not correctly answer any questions in the BI-RADS 1 category, making them the least accurate in these respective categories (P < 0.05).
In the Management Score, which compares the recommendations of multimodal LLMs according to BI-RADS categories, Claude 3.5 Sonnet (mean: 2.12 ± 0.97) outperformed all other multimodal LLMs (P < 0.05).
Notably, our study is the first to evaluate the diagnostic performance of multimodal LLMs breast radiology visual cases. Moreover, this study is the first to demonstrate the performance of the newly released Claude 3.5 Sonnet and ChatGPT 4o in breast radiology. Furthermore, there are currently no other studies that have evaluated the proficiency of different LLMs in breast radiology MCQs, both in internal comparisons and when compared with radiologists.
Multimodal LLMs, such as Claude 3.5 Sonnet and ChatGPT 4o, may perform better than a breast radiologist on text-based questions, but they can make critical errors when questions involve images that impact clinical management. For example, Gemini 1.5 Pro failed to recognize any cases in the BI-RADS 5 category, and Claude 3 Opus could not identify any normal images in the BI-RADS 0 category. This finding suggests that using multimodal LLMs without an experienced radiologist in clinical practice could lead to misdiagnoses, either missing critical conditions or misinterpreting normal findings as pathological.
On the other hand, the superior performance of LLMs on text-based questions compared with general radiologists suggests that they could serve as a supportive tool, especially for junior radiologists. They can aid in the correct use of BI-RADS nomenclature and proper classification.
When multimodal LLMs correctly identify an image and assign an appropriate BI-RADS score, their management recommendations for patients closely align with the BI-RADS categories. Therefore, their success with text-based questions indicates that if they can visually determine the correct BI-RADS category, they are likely to provide accurate clinical management advice.
The variability in LLM text-based performance may be due to differences in training designs, such as different datasets, model architectures, and fine-tuning techniques.20 LLMs such as Microsoft Copilot, Google Gemini 1.0, and Perplexity, which have internet access, sometimes provide arbitrary answers based on non-scientific information they reference.21 This could explain their lower performance compared with other LLMs. ChatGPT and Claude 3 Opus models are trained on closed datasets, and it is unclear whether the BI-RADS Atlas was used in their training. Memorization may contribute to their high performance.
Several studies have explored the performance of LLMs on text-based radiology questions.22, 23 For instance, Almeida et al.22 found that ChatGPT 4 achieved a 76% accuracy rate on mammography questions during the Brazilian radiology board examination, compared with 65% for ChatGPT 3.5. Our study showed higher accuracy rates, with ChatGPT 4 at 79% and ChatGPT 4o at 89%, suggesting that the difference in question difficulty may account for this variance. Furthermore, ChatGPT 4 demonstrated a general accuracy rate of 58.5%, surpassing that of 2nd-year radiology residents (52.8%) but falling short of 3rd-year residents (61.9%) in the ACR Diagnostic Radiology In-Training (DXIT) examination.23 However, with only 10 breast radiology questions, the DXIT exam may not fully capture overall performance in this specialty. In contrast, our study’s focus on a comprehensive set of BI-RADS Atlas questions resulted in higher accuracy rates, underscoring that LLM performance is greatly influenced by both the specificity and quantity of the questions.
Rao et al.9 observed that ChatGPT 4 outperformed ChatGPT 3.5 on select-all-that-apply questions related to breast pain and cancer screening, with both models performing better on these MCQs than on open-ended ones. This aligns with our findings, where the use of MCQs with a single correct answer likely contributed to the elevated success rates of LLMs. In a different context, Haver et al.24 demonstrated ChatGPT’s ability to simplify responses to frequently asked questions about breast cancer prevention and screening, achieving a 92% simplification rate. Our study, which focused on more technical and specific questions, showed that ChatGPT 4 had an accuracy rate of 79%, while ChatGPT 4o performed even better, with an accuracy rate of 89%.
When comparing the performance and readability of different LLMs, Tepe and Emekli25 found that responses generated by Gemini 1.0 and Microsoft Copilot achieved higher readability scores (P < 0.001), whereas ChatGPT 4 demonstrated superior accuracy (P < 0.001). Our study confirmed these results, showing that ChatGPT 4 outperformed both Gemini 1.0 and Microsoft Copilot in terms of accuracy. Similarly, Griewing et al.26 noted a 58.8% concordance between breast tumor board decisions and those generated by ChatGPT 3.5 and 4, with Sorin et al.27 reporting a 70% agreement for ChatGPT 3.5. These findings suggest a partial alignment between LLMs and radiologists in clinical decision-making, though the variations in performance are likely due to differences in study designs and the prompts used. These studies collectively suggest that although LLMs show promise, their current performance may not yet be adequate for seamless integration into clinical decision support systems.
The challenges LLMs face in interpreting visual questions are evident in several studies.28-30 Horiuchi et al.30 conducted a study involving 106 musculoskeletal radiology cases, comparing the performance of ChatGPT 4 on text-based questions with ChatGPT 4V on visual questions. ChatGPT 4 correctly answered 46 out of 106 questions, significantly outperforming ChatGPT 4V, which correctly answered only 9 out of 106 (P < 0.001). Similarly, Dehdab et al.28 evaluated ChatGPT 4V’s performance on chest computed tomography slices across 60 different cases, including coronavirus disease-2019, non-small cell lung cancer, and control cases, finding an overall diagnostic accuracy of 56.76%, with variability depending on the case type.
In breast radiology, Haver et al.29 compared ChatGPT 4V’s performance on 151 mammography images from the ACR BI-RADS Atlas, reporting an accuracy rate of 28.5% (43/151). Although ChatGPT 4V correctly identified more than 50% of cases involving mass shape, architectural distortion, and associated features, it performed poorly on calcifications, intramammary lymph nodes, skin lesions, and solitary dilated ducts, with less than 15% correct responses.29 In our study, ChatGPT 4V similarly showed low performance, correctly answering only 20% of breast ultrasound questions. Notably, it had an accuracy rate of 45% (9/20) for BI-RADS 5 lesions but failed to correctly identify any BI-RADS 0 lesions (0/20), indicating a tendency to misinterpret normal parenchymal tissue as pathology.
Nonetheless, as LLMs and multimodal LLMs continue to rapidly evolve and newer, more advanced models emerge, they are poised to become supportive tools for radiologists in the future. However, ethical considerations, such as ensuring patient privacy and obtaining informed consent from patients involved in the integration of LLMs into clinical decision support systems, are paramount.31 Moreover, the lack of transparency in the decision-making mechanisms of LLMs during the diagnostic process is a significant concern.32 Therefore, it is imperative that LLMs and multimodal LLMs are utilized under the supervision of a responsible radiologist to ensure their contribution to the diagnostic process aligns with the highest standards of patient care and safety.
An intriguing finding of our study is the notable performance of the recently introduced Claude 3.5 Sonnet, which closely rivals ChatGPT 4o. This suggests that the Claude models hold promise in the medical domain as well. Furthermore, our study contributes significantly to the existing literature by evaluating the performance of various LLMs, including both free and paid versions, alongside radiologists in the realm of breast radiology.
While our study offers valuable insights into LLMs’ and multimodal LLMs’ understanding of the BI-RADS Atlas, it does have limitations. First, the number of text-based questions was limited and presented in an MCQ format. Considering LLMs capacity to handle open-ended questions in real clinical scenarios, their performance may better reflect real-world situations with such questions. Further research comparing LLM performance on both open-ended and MCQs is warranted. Second, in our study evaluating multimodal LLMs’ performance, we only used breast ultrasound images. Further research should include ultrasound, mammography, and MRI images to better understand the comprehensive capabilities of multimodal LLMs across different imaging modalities. Last, our study employed a single prompt to assess the performances, highlighting the need for research into the impact of different prompts and various prompt settings on LLMs’ performance in breast radiology.
In conclusion, although LLMs such as Claude 3.5 Sonnet and ChatGPT 4o show potential in supporting radiologists with text-based BI-RADS assessments, their current limitations in visual diagnostics suggest that these tools should be used with caution and under the supervision of experienced radiologists to avoid misdiagnoses.
Conflict of interest disclosure
The authors declared no conflicts of interest.