ABSTRACT
Radiomics aims to improve clinical decision making through the use of radiological imaging. However, the field is challenged by reproducibility issues due to variability in imaging and subsequent statistical analysis, which particularly affects the interpretability of the model. In fact, radiomics extracts many highly correlated features that, combined with the small sample sizes often found in radiomics studies, result in high-dimensional datasets. These datasets, which are characterized by containing more features than samples, have different statistical properties than other datasets, thereby complicating their training by machine learning and deep learning methods. This review critically examines the challenges of both reproducibility issues and interpretability, beginning with an overview of the radiomics pipeline, followed by a discussion of the imaging and statistical reproducibility issues. It further highlights how limited model interpretability hinders clinical translation. The discussion concludes that these challenges could be mitigated by following best practices and by creating large, representative, and publicly available datasets.
Main points
• Radiomics is impeded by imaging and statistical reproducibility issues.
• Machine and deep learning modeling are complicated and require extensive validation.
• Radiomic features found to be predictive in modeling often do not correspond to biomarkers due to high correlation, limiting their interpretability.
• Standardization practices and larger, more diverse datasets are important to improve reproducibility.
Radiomics is a recent field that uses “an automated high-throughput extraction of large amounts of quantitative features of medical images.”1-3 The method “converts imaging data into a high dimensional mineable feature space using a large number of automatically extracted data-characterization algorithms.”4
The above definition may seem complex, but it can be succinctly summarized. Similar to how clinical routine involves characterizing a patient using parameters such as age, weight, and hemoglobin levels, radiological images can be analyzed to extract analogous parameters (also called features) that ideally describe the pathology of interest. For example, in the case of a tumor lesion, features such as its volume and diameter can be measured. A critical aspect of radiomics is the extraction of not only morphological features but also the distribution of intensity and texture. This includes, for instance, assessing whether the lesion has high brightness and a homogeneous or coarse texture, and identifying the presence of bright spots. Radiomics involves the extraction of hundreds to thousands of such features to accurately represent the lesion. These features are subsequently used to train a classifier, that, based on the characteristics of a new lesion, can determine, for example, whether the lesion is benign.
The main expectation of radiomics is that these features can serve as surrogates for biomarkers, and thus aid clinical decision making. Radiological imaging could reflect the underlying biological processes, allowing for indirect conclusions. For example, while necrotic cells are not directly observable in computed tomography (CT) scans, their presence may result in the appearance of a hypodense lesion (Figure 1). Thus, measuring the overall intensity of a lesion could be used as an indicator of cell necrosis.
Although radiomics as a field only emerged in the 2010s,1, 5 the idea can be traced back much further. In a seminal paper published in 1978, Harlow et al.6introduced concepts that are strikingly similar. Later, specifically in the 1990s, similar techniques were introduced as texture analysis.7 This is no coincidence, since the underlying idea of applying machine learning to imaging is the same and dates back to the 1960s.
The primary purpose of radiomics is to support clinical decisions. Ideally, the extracted features provide insights that humans cannot see or systematically process, allowing clinicians to answer questions using this hidden information. Radiomics has also been used to non-invasively identify genetic alterations or gene expression patterns that can be used to predict the outcome or survival risk of patients with cancer.8-10
In this review, the basic concepts of radiomics are first introduced, followed by a detailed discussion of the two major reproducibility issues that persist in the current field. Subsequently, radiomics based on deep neural networks is briefly outlined and the issues involved in their application examined. Finally, strategies for avoiding these issues are discussed.
The radiomics pipeline
As with any study, the first step in a radiomics model is to define patient cohorts, applying reasonable inclusion and exclusion criteria that reflect the target population, and defining an outcome of clinical interest.
The application of radiomics to data is technical but relatively straightforward (Figure 2).11 Images are first acquired and the region of interest (ROI) is segmented. This can be a tumor lesion or an entire organ, such as the whole prostate. The ROI plays a critical role in directing the analysis to relevant areas, thereby preventing other unrelated regions from potentially confounding the analysis.
The images are then pre-processed depending on the use case. For example, magnetic resonance imaging may require a normalization step, and CT may be thresholded to a Hounsfield units range of interest. In addition, preprocessing filters are applied. For instance, smoothing filters can reduce noise that may adversely affect features, whereas wavelet filters can decompose the image into high-frequency and low-frequency components that may carry different information, aiding subsequent analysis.
Next, features are extracted from the ROI. This is a central step, and there are three main types of generic features that are extracted: morphological features, such as volume or sphericity; intensity features, which measure the distribution of values, such as mean brightness; and texture features that reflect the co-occurrence of intensity values.
However, feature extraction will often generate large numbers of features, and many of them will be irrelevant (i.e., they will not help to solve the problem). Many will also be redundant, that is, their information is already present in other features. Therefore, a feature selection step is applied that retains only the relevant features; for example, a t-test can be used to filter out those that are not significant.
These features are then fed into a classifier, which functions in terms of making a prediction after receiving a set of features. This classifier is trained on the data using machine learning techniques. In other words, following the input of data, the algorithm identifies relevant patterns to make accurate predictions on new data. This model can then be tested and applied to new data, such as routine clinical data.
The radiomics pipeline appears pretty straightforward, but in each step, good practices must be maintained to avoid biased or false-positive results.12
Reproducibility issues
Although the pipeline may seem fairly rigid, the key issue is reproducibility. This term describes the requirement that similar findings should be observed when conditions do not change significantly. For example, scanning the same patient twice within a very short time frame should yield similar radiomic features and lead to similar predictions. Non-reproducible studies are essentially random and erratic and cannot be trusted. They may also lead to false positives, which would prevent clinical use.
Reproducibility in radiomics can be divided into two areas: imaging reproducibility and statistical reproducibility. The term “imaging reproducibility” refers to the acquisition of scans and the extraction of features, whereas “statistical reproducibility” refers to modeling using machine learning. Of course, if the imaging is not reproducible, no modeling can correct it (following the well-known “garbage in, garbage out” rule).13, 14 Nonetheless, the focus will be mainly on statistical reproducibility.
Imaging reproducibility
Imaging reproducibility refers to issues in the acquisition process resulting from variations in imaging parameters and techniques, vendor differences, and similar factors.15 Since radiomic features are extracted from the acquired images, parameters such as voxel size and reconstruction techniques can have a significant impact on these features.16, 17 The effect is also non-linear, which can render images highly non-reproducible.18 Post-hoc harmonization can mitigate the problem, but only to a limited extent.19, 20
Even if the imaging were reproducible, the segmentations are usually sensitive to intra- and inter-rater variability, and these differences can also have a large impact on the extracted features,21 making them partially non-reproducible. The same is true for the definition of the features themselves. Even simple features, such as sphericity, can show variations depending on the formulas used to calculate them. Accordingly, the Image Biomarker Standardisation Initiative (IBSI) was launched to standardize these features and assess their reproducibility.22 However, not all software programs are IBSI-compliant, and even the standardized features may still exhibit some differences.23
Another source of variability is the use of preprocessing filters. Although standardization has recently been considered by the IBSI,24 it is unknown whether preprocessing helps at all, and if so, which filters should be applied. Therefore, these preprocessing filters are applied in parallel to increase the predictive power of the resulting features.25 However, this leads to statistical problems.
Statistical reproducibility
The data generated will often have two characteristics that distinguish it from many other datasets: it will be high-dimensional, meaning that there are more features than samples, and it will be highly correlated. In radiomics, there are two main reasons for this. First, the total sample size is often limited due to the time and resources required for annotation, the rarity of the disease in question, or privacy concerns. Second, the numerous preprocessing filters extract information that is highly similar. For example, two levels of smoothing will produce features that are very alike. This results in the generation of highly correlated features.
The presence of such data presents significant challenges, as the search for predictive features and patterns becomes exponentially more difficult and resembles “finding a needle in a haystack.”26 Therefore, the risk of identifying spurious patterns and producing false-positive results is significantly increased in such data. While methods such as regularization can help overcome this problem, the issue remains unresolved.
Therefore, radiomics often employs a feature selection step, where the goal is to retain only the relevant features and remove all others, thereby reducing the dimensionality of the data. However, several methods of varying complexity are currently in use.11, 27, 28 Simpler methods, such as Spearman correlation or t-tests, typically operate by considering each variable on its own. These methods are computationally efficient but may overlook dependencies between variables, potentially leading to suboptimal feature selection. More complex methods, such as the least absolute shrinkage and selection operator method,29 the minimum redundance maximum relevancy method,30 or the Boruta method,31 are able to account for such dependencies but are more computationally demanding. While it may be intuitive to assume that more complex methods perform better, it has been shown that for many datasets, the differences may not be significant. However, simpler methods tend to be more robust, and therefore more reproducible.27 In addition, many of the feature selection methods do not select relevant features but merely score them, leaving open the decision regarding how many of the highest-scoring features to retain, which reduces their reproducibility.
Accordingly, feature selection is not a complete solution to the problem since the task of dealing with the high-dimensional space is merely transferred from the classifier. Feature selection is also subject to failure and may even underperform, especially given the inherent instability of selection methods and their dependence on the specific data sample.27 For example, the removal of a few samples can have a significant impact on the set of features considered relevant.
Subsequent classifiers are also affected by high dimensionality, either directly or indirectly, if irrelevant features have been selected. Furthermore, many classifiers make assumptions about the data that may not be true, regardless of whether feature selection has been applied. These assumptions are often controlled by hyperparameters; for example, a regularization variable may reflect the amount of noise present in the data. Therefore, the only option is to test many different parameters, which is extremely expensive in terms of computational resources. As a result, studies only test a limited number of parameters, and it remains unclear whether a significantly more effective model could have been obtained by optimizing the hyperparameters.
Validation issues
Any model requires extensive testing, the main reason for this being that models could either memorize the data or find spurious instead of predictive patterns. Such a model would perform well during training, but worse on test data and would not generalize. This problem is called overfitting, and the risk is higher for high-dimensional data, where more patterns can fit the given data.
To avoid this problem, validation is performed first. Unlike testing, validation is mainly used for model selection, specifically to determine good values for the hyperparameters, or to identify which feature selection or classifier method performs better on the given data. Ideally, validation should be performed on a second independent dataset, but alternatively, a portion of the data can be set aside. Certain common schemes are often employed in radiomics, including simple splitting, cross-validation, and bootstrapping. In simple splitting, a portion of the data (e.g., 70%) is used for training, whereas the remainder is used exclusively for validation. While this method is conceptually simple and computationally fast, it does not utilize all available data for training. Additionally, the results can be highly dependent on the specific split, leading to potential variability; that is, there is a risk that results may be good, or bad, by chance. To mitigate this, the method can be repeated several times and the results averaged. Cross-validation provides a more systematic approach by splitting the data into k subsets and iteratively training on k-1 subsets while using the remaining subset for validation. Although computationally more expensive, this method ensures that all data is used for both training and validation, providing a more reliable estimate of the performance. Nested cross-validation further refines this by applying cross-validation twice: once to the entire data for performance estimation and once on the training data for hyperparameter tuning. This scheme provides an unbiased evaluation and is considered a gold standard. Bootstrapping, on the other hand, uses resampling with replacement to create training and validation sets. Since samples can occur multiple times in the training set, this approach simulates different weights for each sample and can thus lead to better estimates. However, to obtain these estimates, a large number of repetitions (e.g., 1,000) is generally required, making it computationally highly expensive.
However, in all cases, the golden rule of machine learning must be followed: training and test sets must be kept strictly separate. Failure to follow this rule will lead to data leakage, meaning that the classifier has already seen some aspects of the test data and could adapt to it, leading to false positives.32, 33
Another issue is the variability of the data. Choosing a homogeneous cohort (e.g., from a single scanner) increases the likelihood of obtaining a working model since the predictive patterns seen during training are likely to be present in the test data. At the same time, however, the model will be highly specific and may not generalize beyond the collected data. The opposite, collecting heterogeneous data, is also critical, because the classifier may not be able to identify any predictive patterns at all, especially with small sample sizes, and there will be no relevant model to test. However, if such a model is successful, its clinical applicability will be much higher, which is the ultimate goal.34
Deep radiomics
Deep learning has recently shown great success in other fields,35 and it is natural to apply deep learning to radiomics. Deep learning is based on artificial neural networks, which, in a simplistic way, try to mimic the human brain, and date back to the early days of machine learning in the 1950s. Conceptually, in the simplest case, a network consists of multiple layers, each of which can be understood as a feature generation step. Layer by layer, the input is transformed into the desired output, and the training data is used to determine the parameters of the layers (Figure 3).
Applying deep learning to radiomics, which is termed deep radiomics, can, in contrast to the generic radiomics discussed above, mitigate two major drawbacks. First, it can potentially reduce the need for segmentation because the network can, at least potentially, determine the ROI itself. Equally important, the network can extract optimal features that are specific to the problem at hand. It can also consider more global features of the data, whereas most generic features are based on local textures. Both can lead to models that perform much better than generic models. While deep learning has only recently gained importance, neural networks have been applied to radiological data since the 1990s.36, 37
Issues with deep radiomics
Deep radiomics does not magically bypass the reproducibility problems. For example, changes in acquisition parameters have been shown to have a strong effect on predictive performance, thus affecting generalizability.38 Much is unknown about the stability of deep radiomics models, such as whether a different training sample will yield different features, or whether features from different networks are highly correlated. Robustness to image noise and slightly different segmentations has also not been systematically investigated, which is complicated by the fact that many different architectures exist.
Sample size is an even bigger issue in deep radiomics. Learning directly from data usually requires many more samples to be successful.39 As a result, deep radiomics is currently not as successful as it could be.
Consequently, several mitigation strategies have been developed.40, 41 However, they all have their own drawbacks. For example, studies often resort to using image slices for training, which not only increases the sample size but also allows for the use of smaller networks.42, 43 Nonetheless, this approach partially loses the spatial information, which reduces the potential benefit.
A more common strategy is transfer learning. Here, the network is first trained on a dataset from another domain, most commonly ImageNet, a collection of photographs.44 This pre-trained network is then fine-tuned (i.e., it is trained on the radiomic data, often at lower learning rates) to slightly adjust the network. This approach can work because there is a remarkable similarity between the low-level features of the human eye and the network; at lower levels, both appear to operate with filters comparable to Gabor filters.39 Thus, fine-tuning can focus on training the higher layers and performing better with fewer samples. However, the use of non-medical data for pre-training is again suboptimal, and larger medical data corpora have been introduced only recently, although the extent to which these can help in radiomics remains unclear, as they are usually far smaller than ImageNet.45
Since training a deep network involves many hyperparameters (e.g., learning rate, learning schedule, choice of loss function) and can be relatively complicated, another alternative is to bypass any training and instead use only pre-trained networks as feature extractors (Figure 3),46 which allows more versatile classifiers, such as boosting, to perform better, especially with smaller sample sizes.47 However, since no training is performed in this approach, the disadvantage is again that the features may be less optimal, although fusing them with generic radiomics can still prove helpful.48, 49
Finally, the hope that deep radiomics can dispense with segmentation may be in vain due to the small sample size. In addition, without a proper validation method, deep radiomics is also prone to bias due to over-engineering. In fact, a recent review found no clear advantage of deep radiomics.50
Interpretability issues
A key point in radiomics is to identify features that can potentially serve as biomarkers, just as the volume of a lesion indicates its malignancy. However, radiomics attempts to establish such a correspondence “in reverse,” using the coarser and noisier radiological images, where much information is already lost during acquisition. Radiomics seeks to capture the underlying information by making multiple measurements (in the form of different features). These are often correlated, as they can be understood as noisy and incomplete versions of the inaccessible information. There is no guarantee that the information can be recovered from the extracted features, nor that the observed predictivity of a feature actually corresponds to a biomarker.
Given a set of features, radiomics can only identify those that are statistically associated with the outcome. Such an association is not causal and could only be the basis of a subsequent statistically sound test. This problem is exacerbated by the high-dimensionality of the data, where the intuition from the low-dimensional setting that features have a clear meaning and their importance can be easily measured fails.51 In fact, the very concept of distance becomes somewhat incomprehensible in higher dimensions, often termed the curse of dimensions, and is demonstrated by the fact that in higher dimensions, most of the volume of a unit sphere is near its surface.52
In fact, the use of feature importance as a surrogate has been shown to be questionable because essentially every step in the radiomics pipeline affects the importance of features in the resulting model. Even seemingly unimportant preprocessing steps, such as the choice of discretization method23 and data normalization, which is performed to obtain the data on a uniform scale, can strongly influence the set of features and thus the interpretability.53 This influence is more evident in the feature selection step, where different methods will emphasize different aspects and thus gain different importance.27 Not only does the subsequent classifier affect interpretation but the selection of the final model can also have a great impact, as often several models will perform very closely but will select different sets of features as important.51, 54 In a systematic review, Tohidinezhad et al.55 identified 23 models that predict the effect of radiation on brain health. None of these models used exactly the same features, and the models differed widely in the factors that were significantly associated with outcome.
Moreover, even if such an identification were possible, most radiomic features are not interpretable by themselves. For example, it is unclear what semantic meaning a feature such as wavelet-LHL_glrlm_GrayLevelNonUniformityNormalized carries, and how to see the difference from a highly correlated feature that is slightly less predictive. It is unlikely that a radiologist would be able to relate the measured values of such a feature to the scan. Feature maps may be helpful for visualization,56 but they are currently only a tool and cannot be used to base an interpretation on. In addition, radiomic models are rarely based on a single feature, and a meaningful interpretation of a model using multiple features is barely possible. Paradoxically, radiomics was invented precisely because humans cannot describe textural patterns well.
The potential for highly correlated features to cause interpretation problems is illustrated by a recent study by Welch et al.,57 who reexamined the model that Aerts et al.4 used in their seminal work on patients with non-small cell lung cancer. The authors showed that volume alone is as predictive as the radiomic model, and moreover, that three of the four texture features found by Aerts et al.4 are highly correlated with volume.
Recently, post-hoc interpretations, such as Explainable AI (XAI) methods, have been applied.58 However, these are also problematic. Since there are several different XAI methods, it is likely that the resulting meanings will also differ.59 Alternatively, explainable classifiers could be used, which generally involves a trade-off between the complexity (and thus interpretability) of the classifier and its predictive performance.60 However, even if these methods are successful, they only address the classifier and do not mitigate the problems in the overall pipeline.
The situation is similar for deep radiomics. While the pipeline itself is less complex, training is more difficult, and there are many more choices regarding the architecture. It is highly likely that different choices will lead to vastly different features. In addition, the deep features do not have a mathematical formula, making any direct interpretation difficult. To remedy this situation, Cho et al.61 correlated deep features with radiomic features. However, since radiomic features are not fully interpretable by themselves, this approach is limited in scope.
Discussion
Currently, radiomics suffers from both imaging and statistical reproducibility issues, both of which affect the interpretability and applicability of the models. This affects the entire radiomics pipeline, and even feature normalization can lead to reproducibility issues.
Neither of these problems can be easily avoided. Image reproducibility could possibly be mitigated by strict standardization of imaging protocols, but this is all but impossible to implement in practice across multiple centers. Statistical reproducibility is also not easily mitigated. Methodological differences aside, different research groups will often reach different conclusions given the same data.62 Although such studies have not been conducted in radiomics, the impact is expected to be even greater, as there is generally less code and data sharing in the health domain.63
One major problem is small sample sizes. Radiomics studies need to include larger and more diverse datasets to have a chance of success. This is illustrated by current models that use deep learning to diagnose chest X-rays, or mammograms that have been shown to perform especially well.64, 65 These models are often trained on datasets that reach tens of thousands of scans. However, they are not radiomic in the sense that they do not require segmentations. The abundance of data makes segmentations unnecessary, as the network can identify the relevant regions on its own. Although it is virtually impossible to obtain such large sample sizes for rare cancers, more data would potentially reduce the dimensionality of the data and thus increase reproducibility. Nonetheless, radiomics seems to have made no progress since the seminal work of Harlow et al.6 in 1976, where sample sizes of around 300 are reported. Small sample sizes are generally unable to reflect heterogeneity. This is even true for within-patient heterogeneity. For example, suppose two features are measured in a single patient at two time points, as in a test-retest scenario, and their sum is predictive. Then, the two features may vary greatly between the two time points such that neither is reproducible; but provided their sum remains the same, this would not pose any problem for their predictive value. However, if the model was not trained on such data, it would not find that pattern and would fail on new data. Nevertheless, large sample sizes are useless if the images do not carry the necessary information and such predictive patterns do not exist. Hence more data is not always helpful.
Non-reproducible studies may also result from a failure to follow best practices, which can be ensured by adhering to proper guidelines.66, 67 For example, the study must be described in full detail in a manner that enables replication by others. Code should always be shared, and data should be shared if possible. Best practices encompass every step of the study; for example, it must be ensured that the data selection is appropriate and unbiased relative to the study’s objective.12, 68 The outcome should also be compared with current standards where applicable, for example, if a clinical scoring system is in current use (e.g., the Prostate Imaging Reporting and Data System), the radiomics model should be compared against it.69 Statistical tests (e.g., permutation tests) can be used to ensure that the resulting model is different from a random guess, which is crucial when sample sizes are small. While statistical significance should be computed, the clinical significance should also be considered to evaluate the impact of the model. Furthermore, the overall study design must be methodologically sound to avoid reporting false-positive results. In addition, reporting must be clear and complete to ensure reproducibility.70
In a seminal paper, Ioannidis argued that around 60% of all medical studies contain false-positive results.71 Studies with such obvious false positives should therefore be retracted, but this almost never happens in radiomics. On the contrary, such studies are frequently cited.72 In addition, methodologically correct studies will fare relatively worse and may appear as “negative” studies that may not be considered for publication.73 To mitigate this, a far more rigorous review process with mandatory code or data sharing would be required, as it could help in identifying potentially biased results before their publication. Currently, such studies are often only identified following publication, making it difficult to address the issue. Ensuring that publications rigorously follow reporting guidelines could be another way to reduce the problem.66, 67, 70
It is easy to overlook the fact that image processing has gone through a similar evolution in the past. The field started with the manual extraction of many features (which is the origin of the texture features used today), progressed to the extraction of more complicated features such as Fisher vectors,74 before the advent of deep learning made these steps obsolete. In fact, the interpretability of deep networks is at the semantic level of images, not features, for example, to answer the question of whether the network takes the tail of a dog into account when predicting its race. This is not easily possible in radiomics, where a visualization of the important areas of a tumor lesion would not help a radiologist understand what the network is doing. Furthermore, in current machine learning, a model is accepted if it generalizes well, not necessarily if the model is interpretable. A similar strategy may be viable for radiomics, where the applicability of models is validated on large datasets.
In conclusion, radiomics currently faces substantial challenges related to imaging and statistical reproducibility that severely impact interpretability and clinical applicability. These problems are difficult to mitigate because imaging standardization is largely impractical and statistical variability is inherent in high-dimensional datasets. As a result, the potential for clinical integration remains uncertain and questionable. A shift toward rigorous data and code sharing practices and the development of large, representative datasets would be required to partially address these challenges.