ABSTRACT
With the ongoing revolution of artificial intelligence (AI) in medicine, the impact of AI in radiology is more pronounced than ever. An increasing number of technical and clinical AI-focused studies are published each day. As these tools inevitably affect patient care and physician practices, it is crucial that radiologists become more familiar with the leading strategies and underlying principles of AI. Multimodal AI models can combine both imaging and clinical metadata and are quickly becoming a popular approach that is being integrated into the medical ecosystem. This narrative review covers major concepts of multimodal AI through the lens of recent literature. We discuss emerging frameworks, including graph neural networks, which allow for explicit learning from non-Euclidean relationships, and transformers, which allow for parallel computation that scales, highlighting existing literature and advocating for a focus on emerging architectures. We also identify key pitfalls in current studies, including issues with taxonomy, data scarcity, and bias. By informing radiologists and biomedical AI experts about existing practices and challenges, we hope to guide the next wave of imaging-based multimodal AI research.
Main points
• As multimodal artificial intelligence (AI) becomes increasingly integrated into the field of radiology, it is imperative that radiologists become familiar with the existing frameworks, applications, and analyses of such tools.
• Conventional approaches to multimodal AI integration have shown improvement over unimodal approaches in their ability to translate accurately to the clinic.
• Cutting-edge approaches for multimodal biomedical AI applications, such as transformers and graph neural networks, can integrate time series and non-Euclidean biomedical data.
• Key pitfalls of the multimodal biomedical AI landscape include inconsistent taxonomy, a lack of foundational models using varied large-scale representative data sources, and a mismatch between the healthcare arena and the necessary curation of data for AI models.
Artificial Intelligence (AI) is revolutionizing everyday life with its advanced capabilities in image processing, textual analysis, and more. Though this technology has only recently gained widespread public attention, its origins are not new. Research into neural networks began in the early to mid-20th century,1 making it surprising that mainstream models, such as ChatGPT, which are now frequently cited in scientific literature, have only recently captured public interest.2 Comparable to the emergence of computers in the 1940s, modern AI possesses a long-standing mathematical foundation but is still in its infancy.
The field of radiology is data-heavy, signal-rich, and technology-focused, making it a prime target for building AI applications. Thus, it is crucial that radiologists stay informed about methodological and clinical trends in AI. Radiologists routinely review large amounts of signal-rich data in a multimodal manner, making them well-suited to leverage AI and medical data to enhance diagnostic accuracy. At its core, AI is an extremely thorough pattern-detection system, capable of recognizing patterns beyond human capability for certain tasks. In medical imaging, which is nowadays very commonly used and results in work overload for practicing radiologists, AI has the potential to be a robust support tool within the radiology medical ecosystem. However, the introduction of AI raises ethical dilemmas3 and security concerns,4 including data leakage, automated medical decisions, biased data, and clinical impact.
While there is a growing body of literature on biomedical AI, much remains unexplored, particularly in the translation to medical applications. There has been a noticeable shift towards multimodal algorithms that incorporate imaging data with at least one other modality. Nevertheless, literature leveraging multimodal imaging data and clinical co-variates remains relatively sparse. For this reason, existing reviews on the topic have generally focused on 1) unimodal AI for imaging alone5-7or 2) general multimodal deep learning, which is becoming an increasingly heterogeneous field.8-10 This review aims to explore multimodal AI in radiology comprehensively by examining both imaging and clinical variables. Throughout, we assess the methodology and clinical translation to inform future directions and organize approaches within the field.
Modern frameworks and multi-modality fusion techniques
The first focus of this study is the cutting-edge methodologies for multimodal AI. These frameworks are increasingly recognized as impactful approaches in advancing healthcare analytics due to their ability to interpret and integrate disparate forms of medical data, similar to the daily tasks of physicians. For detailed definitions and explanations of key terminology, a glossary of key terms with definitions is provided (Table 1). Central frameworks aim to model the relationship between data and corresponding clinical outcomes. Transformer-based models and graph neural networks (GNNs) have demonstrated remarkable promise in combining clinical notes,11-13 imaging data,14-16 and genomic information,17-20enhancing patient care through personalized and precise predictions and recommendations (Figure 1).
Transformers
Initially conceived for natural language processing, transformers have been adapted for other unimodal input data, such as imaging and genomics, and now, for multimodal tasks in healthcare. These models uniquely focus on different data components as needed and are adept at handling sequential data.21 They also employ self-attention mechanisms, allowing for the assignment of weighted importance to different parts of input data, regardless of order. This implementation is especially beneficial for free text or genomic sequencing data, where the significance of a feature greatly depends on its context. These mechanisms have been extended to consider temporal dependencies in electronic health records (EHRs), enabling the model to discern which historical medical events are most predictive of future outcomes.22
Transformers are particularly revolutionary, unlike typical recurrent neural networks, in that they employ a parallelized approach, which allows for scalable computation.23 Recurrent neural networks are a popular type of model that handle information sequentially and cannot do so in parallel.24 This foundational difference has led to transformers being the basis for large language models, such as BERT25 and ChatGPT, but their application in medicine remains largely unexplored.25, 26 Literature using transformer-based multimodal predictions consistently finds that transformer models outperform typical recurrent or unimodal models.27-30
Despite the success of transformers, most literature features single-case applications, where a particular transformer architecture is optimized for a single clinical outcome.31 A good example of an impactful application of transformers by Yu et al.32 presents a framework to learn from imaging, clinical, and genetic information to set a new benchmark for diagnosing Alzheimer’s disease (area under the receiver operator characteristic curve of 0.993). This work shows how transformers may be able to aid in unifying information across modalities for comprehensive learning in a specific disease space.
The literature on their broader optimization for various clinical or radiology tasks is limited. Khader et al.33 propose a transferrable large-scale transformer approach, showing that it outperforms existing multimodal approaches leveraging convolutional neural networks (CNNs). They attribute their improvement to a novel technical approach, which selectively limits interactions between data inputs. They demonstrate the generalizability of their model by showing improvement across various decisions, including heart failure and respiratory disease prediction, and domains, including fundoscopy images and chest radiographs paired with non-imaging data.33
With the increasing popularity of multimodal data and models, there is a need for technical approaches that are transferrable and widely applicable for clinical use.
Graph neural networks
Although transformer-based models excel at capturing dependencies in sequential data,34 their architecture does not inherently account for non-Euclidean structures present in multimodal healthcare data.23 This gap has led to significant interest in GNNs, which model the data in a graph-structured format. This is particularly relevant to multimodal imaging data, where the relationships and dependencies between data points, such as between an anatomical structure in imaging and a genetic marker or clinical parameters, are not inherently grid-like and could be more accurately represented by graphical connections (Figure 2).
GNNs extend the concept of convolution from regular grids to graphs, with convolutional operations that aggregate feature information from a node’s neighbors.35 This approach captures global structural information. Unlike CNNs, where the same filter is applied uniformly across an image or matrix, GNNs adaptively learn how to weight the influence of neighboring nodes, making them adept at handling irregular data that does not conform to a fixed grid.36
This novelty is rooted in the ability of GNNs to learn from non-Euclidean data, which is crucial for integrating different types of medical information.37 They can explicitly model the complex relationships between modalities, rather than attempting to map them in grid-like structures, such as CNNs, which may not fully take the structure into account38 and could introduce biases related to artificial adjacency in grid formatting. Although exciting work has been taking place recently in medical imaging with GNNs, the bulk of multimodal literature continues to focus on CNNs, requiring tabular fusion in many cases.39 There are several methodologies for fusing modalities.40 However, without a graphical approach, there is potential for misinterpretation of the data’s relationship when arbitrarily fused in a tabular format. For example, appending an image with a clinical parameter could falsely imply that parameters are adjacent to the imaging features. In contrast, with a GNN, this relationship can be modeled via nodes in a graphical representation, rather than being appended.
Despite the potential and applicability of GNNs, literature leveraging them in the medical space is scarce, likely due to their novelty and the varying custom methods for graphical construction posing a challenge. One study in the oncologic radiology space used a GNN to predict regional lymph node metastasis in esophageal squamous cell carcinoma patients.41 In their work, Ding et al.41constructed a graph by mapping learned embeddings across image features and clinical parameters into a feature space, treating them each as nodes. They then used a graphical attention mechanism to learn the weights of the edges connecting the nodes. In another study, Gao et al.20 used a completely different method for construction to predict the survival of cancer patients using gene expression data. They constructed a graph by considering each patient’s primary modality encoding (which could be imaging, though they did not use imaging) as a node, with each gene also as a node. Edge weights were then determined by the level of gene expression for each patient and connected to the primary nodes. In a third study, Lyu et al.42 demonstrate a successful GNN for predicting drug interactions by building graphs drawing edges between drugs and drug-related entities (such as targets or transporters). These three examples illustrate the complexity of graph construction and the custom nature of GNN methodology, which may explain the scarcity of literature on the topic despite its promise for relating multimodal data and encodings.
Modality fusion techniques
Despite the emergence of architectures such as GNNs, which can more deliberately represent data interactions, almost all medical data, whether imaging, molecular, or other signals, can be tabulated. Thus, various fusion techniques (methods for concatenating signals or information) are far more commonly used in multimodal literature.9 Fusion techniques can broadly be categorized as early, intermediate/joint, or late fusion. In simple terms, early fusion means that the information is combined before learning via AI occurs, joint fusion means some learning happens before and after combining the two modalities, and late fusion means no learning happens after combining information. Therefore, it can be considered that late fusion aggregates learned information from the two modalities to make a prediction, whereas joint fusion allows for the modalities to interact, and for components of each to have complex relationships in making a prediction. More technically, early fusion generally involves concatenating input modalities into a single vector before feeding them into a model for training. These input modalities can be extracted features or raw data. Joint or intermediate fusion involves concatenating independently learned features prior to further learning. Late fusion generally refers to complete or almost complete learning occurring independently before concatenating vectors for a final activation and prediction. There has also been an emergence of “sketch” fusion, which is similar to early fusion, but rather than concatenation, modalities are translated to a common space. Schematics of early, joint, and late fusion pipelines are presented in Figure 3.
There is a rich and growing base of multimodal models using fusion to combine tabulated free speech,43 genomic,44, 45 or clinical covariate data with images for diagnostics. Kumar et al.43 combined X-ray images with audio data consisting of respiratory sounds and coughs for the diagnosis of coronavirus disease 2019. As a result, they showed that early detection is possible with 98.91% accuracy by fusing chest X-ray and cough models. There is limited consensus on the optimal fusion technique, perhaps due to variations in dataset quality, interactions between data sources, or the learning architectures. With many variables at play, developing a comprehensive approach to machine learning fusion, even for a single data type or disease case, becomes challenging. Each fusion modality may have advantages or disadvantages depending on the application, data set, and model architecture. Often, the best approach is to try all three and compare results. Conceptually, however, the pros and cons primarily depend on the concept of confounding variables. Consider the example of a hypothetical model for lung cancer outcome prediction where there are two modalities, one being clinical risk factors, such as cigarette consumption and obesity, and the second being genomic data. If these two modalities are believed to be additive and independent (non-confounding), the requirement may be for the AI to learn from them separately. In this case, late fusion may be appropriate. If it is believed there is significant crosstalk between the variables (the relationship between them is confounding), early or joint fusion may be more appropriate. Early fusion may be more appropriate when using smaller-scale genomic variant data that checks for a set of known variants that increase risk. Conversely, joint fusion may be more appropriate if the model is expected to learn variants of risk from a large amount of genomic sequencing data. Regardless, it is difficult to determine the optimal fusion strategy from the data alone and often worth exploring multiple approaches.
Although early fusion appears to be the most common fusion type across a variety of fields using imaging or imaging features combined with other modalities,9, 46-51there are also numerous studies using joint52-54and late fusion.55 The optimal fusion technique likely depends on the data source, architecture, and other specifics, making consensus challenging. It is important that researchers explore multiple fusion options when designing a multimodal model because, unfortunately, there are no guidelines for multimodal data fusion at this point in the field’s development.
In addition to these common concatenation techniques, there are many other examples of statistical integration methods. When it comes to GNNs, these integration methods can be customized to the relationship between specific modalities and datasets, as previously discussed. There are also many more methods outside the scope of this review, particularly pertaining to other omics data types. For example, mixOmics is a popular package for the integration and analysis of multi-omics data.56 Other cutting-edge examples of multi-omics statistical integration frameworks include Data Integration Analysis for Biomarker discovery using Latent cOmponents (DIABLO) and xMWAS.57, 58
Current status of multimodal imaging work
The existing literature on multimodal AI contains numerous examples of successful multimodal integrations boasting impressive degrees of accuracy and proposed clinical translations.59-69 These publications are promising and show the potential for multimodal AI implementation to improve patient outcomes. As the field progresses, there is an increase in highly curated large-scale data sets, paving the way for foundational models.29, 70 Nevertheless, much of the work in this space and its ability to translate to the clinic is limited by its siloed application, inconsistent taxonomy, and data scarcity.
Multimodal taxonomy
In the broad field of oncology, it is common for physicians to utilize multiple imaging channels to visualize abnormalities and make decisions. It follows that AI models leveraging multiple imaging sequences may be useful for tasks such as detection or segmentation. This raises the question: should combining two images be considered multimodal? Here, attention is drawn to the terms multimodal and multichannel. These terms are used in different and overlapping contexts across multiple disease spaces. In prostate cancer imaging literature, for example, the detection and segmentation of clinically significant prostate cancer are common goals often labeled as “multimodal” when merely integrating multiple magnetic resonance imaging (MRI) sequences, without incorporating fundamentally different data types.71-75 Similar inconsistencies stand across the larger oncology field including, but not limited to, brain cancer,76 lung cancer,77, 78 and breast cancer.79
The authors suggest that a multimodal model should combine conceptually different modes of information, whereas multichannel may be more appropriate for technically different (but categorically equivalent or similar) modes, as would be the case in fusing two radiologic images, such as multiple MRI sequences or computed tomography (CT) and MRI. Using this loose idea of “conceptually different images”, one may consider combining digital histopathology images with radiomics as multimodal,80 but the examples above (of fusing two radiologic images) would likely be considered multichannel and unimodal. In the authors’ work with deep learning in the prostate cancer space, these image fusion models have been referred to as multichannel rather than multimodal.81, 82 With this pattern being evident across disease spaces, there is a need to clarify the taxonomy as the term “multimodal” becomes increasingly imprecise.
Generalizable models with transferrable application
The multimodal AI space is rapidly expanding but remains ultra-specific, hindering the transition of findings into general practices. Building models that translate across regions and hospitals without bias may be better explored through foundational models that 1) apply to multiple disease spaces, 2) inform future methodological decision-making by outlining the evidence for engineering decisions or by demonstrating that a method is effective beyond a single isolated case, and 3) prove multicenter validation for clinical use with resistance to bias.
This trend is becoming apparent as the unimodal clinical AI space becomes increasingly saturated, and the most impactful publications focus on foundational models through novel technical innovations, such as with DINO,83 DINOv2,84 and iBOT,85 increasingly large datasets, and self-supervised learning to leverage unannotated data.86This generalizability has yet to become commonplace in multimodal AI, except for some key examples. For instance, Khader et al.29 provide a compelling case for multimodal transformers by analyzing 25 conditions using imaging and non-imaging patient data from the Medical Information Mart for Intensive Care (MIMIC), instead of evaluating a single disease case. This publication is an impressive example of using up-to-date methodologies (namely, transformers), baseline comparison to alternate approaches for the same dataset, and analysis of various conditions. They observed improvement through multimodal use across all disease cases and reported appropriate statistical evaluation. Unfortunately, it is not common practice for multimodal papers to present statistics compared with a baseline unimodal model or present evidence of value in including both modalities. Rather, such papers often present a means to an end. Instead, Khader et al.29 provided a case for a specific method, informing how future researchers should proceed while highlighting multiple translational impacts.
Another example study pushing towards generalizable multimodal approaches is proposed by Soenksen et al.70 They propose and assess a model for Holistic AI in Medicine (HAIM) to support the general development and testing of a variety of multimodal AI systems. Leveraging the MIMIC database, they demonstrate improvement in predicting various healthcare operations including lung lesion detection, 48-hour mortality, and edema. They find that all multimodal inputs improve performance across all predictions. However, there is no statistical analysis presented to inform us which of these tasks shows a statistically significant difference. This work pushes the medical field towards cutting-edge and generalizable multimodal work and emphasizes the need to develop a standard of comparison in the field.70
It is noteworthy, but not coincidental, that both models discussed above leverage the same MIMIC database. The MIMIC database is a publicly available repository of EHRs from the Beth Israel Deaconess Medical Center.87, 88 Though each publication attempts to draw data from multiple sources, this highlights the issue of database bias in designing multimodal algorithms.
Dataset curation
Database bias can manifest in various ways. Based on an analysis of the existing terrain, several examples of bias where the field may be at risk are discussed. As other reviews8, 89and even the original MIMIC-IV publication88have stated, data in hospitals today is typically stored in systems not conducive to or able to support research, especially data science research. Built for security and far behind modern standards for user interface design, storage, and ease of access, it is not uncommon to find scanned versions of electronic medical records as PDF-format files, equivalent information stored in various locations at different hospitals, and logging methods varying between physicians. In other words, there is a significant mismatch between the data format resulting from existing data collection practices across healthcare facilities and the data format necessary for appropriate AI development. These mismatches make it quite challenging to curate datasets such as MIMIC, which require careful planning, financial investment, and an industry-wide shift in how medical data is collected and stored. As a result, models are at risk of being overtrained on the limited existing AI-friendly data.
By using a single center or focusing on training with the handful of carefully curated datasets available, models can “learn” to treat all patients as they would in those specific settings and time periods, regardless of the quality of care one receives at their own institution and the clinical environment of which they are a part. Clinical outcomes can vary significantly depending on the surgeon, environmental exposures, or technology available. For example, patients at the best hospitals in the country may have different outcomes from average hospitals and therefore should be treated differently. Beyond social determinants of health, from a technical perspective, considering that MRI or CT scanners may differ across the country, a model may inadvertently learn that image quality is associated with outcomes or be unable to accurately assess certain images. As with comparing baseline unimodal models, there is a need for guidelines to assess and mitigate bias in AI as it becomes more widespread. Although there are examples of papers identifying or discussing bias,90-93 few propose analytical frameworks to address or measure bias in AI.94 Such publications are varied, and none have become standard practice in the field. Few clinical papers assess bias in clinically specific AI models. Though not multimodal, a machine learning approach was proposed by Chandran et al.95to predict lung cancer risk using the cross-area under the receiver operator characteristic curve to measure disparities in performance by race and ethnicity. They identify key failures in the model’s ability to determine risk for Asian and Hispanic individuals compared with White and non-Hispanic individuals. The mismatch between the clinical environment and AI-friendly data storage requirements results not only in bias but also makes bias reduction challenging, as curating “representative” data from centers across the entire country is a huge undertaking. The more representative the training data is of the setting in which it is applied, the lower the risk of biased decisions. With evidence that multimodal AI may be more accurate for some AI applications27,59-69,80,93 and that multimodal work is more challenging to curate consistently across institutions, researchers and physicians face the decision of how to build and employ AI tools when smaller multimodal sample sizes promise improved overall accuracy, but smaller sample size may increase risk of bias.
This concern is currently pressing and needs to be addressed. One systematic review on GNNs based on EHRs reported that out of 50 papers reviewed, 23 used MIMIC-III and 6 MIMIC-IV.96 With the increasing prevalence of AI research and rapid translation of tools to the clinic, there is a need for a change in how data is stored and collected by healthcare providers across the country. Continuing to develop AI tools on the available pool of high-quality curated datasets, such as MIMIC,87, 88 the UK Biobank,97 EMBED,98 and the Scottish Medical Imaging Archive,99 is risky as tools may be carelessly applied to populations with differing clinical environments or health outcomes. Further, with the medical field being a rapidly changing ecosystem, models and datasets can quickly become less relevant to the current medical system.
Considering the dynamic medical environment and its quickly changing technology and guidelines, AI and the data on which it is trained will have to change as quickly as the clinic. One must be incredibly mindful of the dynamic nature of data when training an AI algorithm. Here, “dynamic” can take on a double meaning. Data can be dynamic in that its surrounding clinical environment changes as knowledge and technology develop. It can also be dynamic in that the information itself changes as a product of aging or biological changes. For example, considering genomics are stable over time, it is unclear what the significance is of their integration with dynamic data, such as an imaging phenotype or proteomics, which can change over a person’s life. Imaging data or radiomics data have been integrated both with stable omics100 and dynamic omics for multimodal AI.101 Regardless of the biomedical data and if it is dynamic, a change in how the data is collected from, and impacts, the clinic may be just as impactful on the creation of impactful AI as the data itself.
AI tools have the potential to both combat and exacerbate biases by providing evidence-based recommendations. Radiologists and other physicians must understand emerging and existing methods in the field, as well as the importance of data set curation, as they are often the ones making final decisions about how these tools will be used and how they will impact the patient. By being aware of the potential for AI to exacerbate biases, radiologists are relied upon to view these tools as exactly what they are: physician support tools. Even if a tool has a proven record of being more accurate than the average physician at, for example, detecting lesions on a certain type of scan, there will still be mistakes, and physicians will need to be able to use these AI tools without catering to their biases. It is difficult to predict exactly what the role of radiologists will be in the future of using and developing AI, but the reality is that it will play a role. The greater the degree to which these tools are understood is, the easier it will be for physicians to interact with them in a way that improves health. On the flip side, a greater understanding among physicians will allow them to conduct their clinic in a way that is conducive to storing data for training strong bias-mitigated models.
Future directions
Multimodal AI will inevitably continue to develop and be explored through the methodologies, foundational models, and translational integrations discussed in this review and beyond. Despite exploring highly developed architectures, methods, and techniques in image processing AI, such as fusion models, transformers, and GNNs, the medical field lags in using up-to-date AI innovations and struggles with consistency in taxonomy, evaluation metrics, and methodology, even within the same disease spaces.
The lack of common practices, which will develop and change as the field matures, severely limits progress and translation. It becomes difficult to generalize conclusions from one publication to the next and across methodologies. Standout publications in the multimodal AI space are characterized by their ability to generalize as foundational models with transferrable applications, incorporate physician perspectives with clear and broad clinical utility, and carefully evaluate baseline models using thorough and appropriate evaluation and statistics.
An even more pressing limitation in developing multimodal AI tools with biomedical applications is the lack of comprehensive, high-quality data. As discussed, most reviewed works rely on either a very small set of carefully curated data, which requires extensive time, resources, and funding for AI development, or they draw from a select set of high-quality, open-access datasets. By repeatedly using these same high-quality curated datasets, a suite of AI-based translational tools heavily biased toward the included locations, periods, and patient populations is being developed. With the clinical setting and its outcomes being a constantly changing ecosystem, it is risky to rely on the same datasets. Equitable, bias-free AI will require these systems to be dynamic, constantly updated with new data, and capable of adapting over time with fine-tuning. Technologists and clinicians may have to meet somewhere in the middle, such that technologists will have to build models using less-than-optimal data, and clinicians may have to incorporate certain practices into their data ecosystem to ensure AI models are up to date.
Our narrative review of multimodal AI, combining imaging and other clinical metadata, aims to propose clarifications for what constitutes “multimodal” AI in imaging, identify up-to-date frameworks with potential for enhanced results in future model research, comment on a shift toward generalizable foundational models, and identify trends and concerns in database curation. As the field progresses from theory to clinic, it is essential for radiologists to stay informed about the latest developments, methodologies, and ethical implications.
The current radiologic landscape is characterized by a transition toward multimodal fusion models, with increasing focus on transformers and GNNs. However, there is a considerable amount of work to be done in terms of scientific due diligence regarding gaps in methodology and model training bias. Moreover, the reliance on the few existing high-quality curated datasets highlights a major risk as AI tools become more common in the clinical setting. There is an urgent need to align the format of data required for training AI with that logged by physicians to curate comprehensive training databases.
In conclusion, while AI in radiology promises significant advancements in the field, successful and unbiased integration demands a multidisciplinary approach involving continuous education of physicians and AI developers alike. By informing radiologists, we hope to begin bridging the gap between technology and the clinic, guiding future methodologies, practices for dataset curation, and the field as a whole. By harnessing the power of AI, appropriate evaluation, and physician expertise, we hope to save more lives and improve the quality of care for patients worldwide.
Conflict of interest disclosure
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Barış Türkbey receives support as part of a CRADA between NIH and NVIDIA and between NIH and Philips Healthcare, receives royalties from NIH, and is a party to patents or potential patents related to this work. The remaining authors declare that there are no other disclosures relevant to the subject matter of this article. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government. All figures were created and licensed using BioRender.com.