1. Introduction
Locoregional recurrence (LRR) is a highly prevalent pattern of relapse seen in about 20 to 50 % of patients with head-and-neck cancers (HNC) within two years after radiation treatment [
1- Chang J.H.
- Wu Y.
- Wu A.T.H.
Locoregionally recurrent head and neck squamous cell carcinoma: incidence, survival, prognostic factors, and treatment outcomes.
,
2- Alsahafi E.
- Begg K.
- Amelio I.
- Raulf N.
- Lucarelli P.
- Sauter T.
- et al.
Clinical update on head and neck cancer: molecular biology and ongoing challenges.
,
3- Massa S.T.
- Osazuwa-Peters N.
- Christopher K.M.
- Arnold L.D.
- Schootman M.
- Walker R.J.
- et al.
Competing causes of death in the head and neck cancer population.
]. Therefore, the development of prognostic models to accurately identify patients who are at risk for LRR prior to radiotherapy would help the clinicians make better decisions to personalize treatment.
Radiomics, which is a machine learning driven quantitative analysis of medical images, has shown promising results in predicting risk of treatment outcomes in various cancers, including head-and-neck cancer[
4- Elhalawani H.
- Mohamed A.S.
- Mulder S.
- Grossberg A.
- Smith K.E.
- Gunn G.B.
- et al.
Radiomics prediction of radiation treatment outcomes in oropharyngeal cancer: a clinical and image repository in concert with the cancer imaging archive (TCIA).
,
5- Kalendralis P.
- Shi Z.
- Traverso A.
- Choudhury A.
- Sloep M.
- Zhovannik I.
- et al.
FAIR-compliant clinical, radiomics and DICOM metadata of RIDER, interobserver, Lung1 and head-Neck1 TCIA collections.
,
6- Vallières M.
- Kay Rivest E.
- Perrin L.J.
- Liem X.
- Furstoss C.
- Aerts H.J.W.L.
- et al.
Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer.
]. However, radiomics extracts many features from the volume of interest leading to high dimensionality of the feature space [
[7]- van Timmeren J.E.
- Cester D.
- Tanadini-Lang S.
- Alkadhi H.
- Baessler B.
Radiomics in medical imaging—“how-to” guide and critical reflection.
] and attendant risk of over-fitting. Thus, feature selection is recommended to remove redundant and irrelevant features that do not contribute to the prognostic model. Therefore, selecting highly reproducible and stable features for building the model is important. In HNC, the optimal feature selection methods and classifiers remain to be studied for multicentre radiomics-based prognostic models [
8- Gangil T.
- Shahabuddin A.B.
- Dinesh Rao B.
- Palanisamy K.
- Chakrabarti B.
- Sharan K.
Predicting clinical outcomes of radiotherapy for head and neck squamous cell carcinoma patients using machine learning algorithms.
,
9- Wong A.J.
- Kanwar A.
- Mohamed A.S.
- Fuller C.D.
Radiomics in head and neck cancer: from exploration to application.
,
10- Parmar C.
- Leijenaar R.T.H.
- Grossmann P.
- Rios Velazquez E.
- Bussink J.
- Rietveld D.
- et al.
Radiomic feature clusters and prognostic signatures specific for Lung and Head & Neck cancer.
].
Like all machine learning applications, radiomics based prognostic models benefit when trained and validated on as large image datasets as feasibly achievable. However, the burden for disease sites such as head-and-neck cancer are highly variable across the world. For example, in India and Netherlands, HNC accounts for about 20% and 3%, respectively of all cancer diagnoses [
11Trends in incidence of head and neck cancers in India.
,
12528-the-netherlands-fact-sheets.pdf n.d. https://gco.iarc.fr/today/data/factsheets/populations/528-the-netherlands-fact-sheets.pdf (accessed August 22, 2022).
]. So, it is quite difficult for single institutional data to account for the variabilities that negatively impacts the generalizability of models to real-life data such as the variations seen across populations, image quality and size of the dataset.
Often models trained on small sample sizes result in model overfitting and lack generalizability [
11Trends in incidence of head and neck cancers in India.
,
13- Devakumar D.
- Sunny G.
- Balu K.
- Bowen S.R.
- Nadaraj A.
- Jeyseelan L.
- et al.
Framework for machine learning of CT and PET radiomics to predict local failure after radiotherapy in locally advanced head and neck cancers.
,
14- Giraud P.
- Giraud P.
- Gasnier A.
- El Ayachy R.
- Kreps S.
- Foy J.P.
- et al.
Radiomics and machine learning for radiotherapy in head and neck cancers.
,
15- Ger R.B.
- Zhou S.
- Elgohari B.
- Elhalawani H.
- Mackin D.M.
- Meier J.G.
- et al.
Radiomics features of the primary tumor fail to improve prediction of overall survival in large cohorts of CT- and PET-imaged head and neck cancer patients.
] which makes pooling data from different institutions quite attractive. However, multi-institutional retrospective data will unavoidably introduce heterogeneity due to but not limited to differences in clinical subjects, variations in scanners, model versions, acquisition and reconstruction protocols and target definitions. Radiomic features are particularly sensitive to such variations, often referred to as the centre or site effect [
16Understanding sources of variation to improve the reproducibility of radiomics.
,
17- Orlhac F.
- Frouin F.
- Nioche C.
- Ayache N.
- Buvat I.
Validation of a method to compensate multicenter effects affecting CT radiomics.
]. Centre effect harmonization is not trivial as different methods have shown satisfactory results in some studies [
17- Orlhac F.
- Frouin F.
- Nioche C.
- Ayache N.
- Buvat I.
Validation of a method to compensate multicenter effects affecting CT radiomics.
,
18- Orlhac F.
- Eertink J.J.
- Cottereau A.S.
- Zijlstra J.M.
- Thieblemont C.
- Meignan M.
- et al.
A guide to ComBat harmonization of imaging biomarkers in multicenter studies.
] but has had no favourable effect in others [
[19]- Da-ano R.
- Lucia F.
- Masson I.
- Abgral R.
- Alfieri J.
- Rousseau C.
- et al.
A transfer learning approach to facilitate ComBat-based harmonization of multicentre radiomic features in new datasets.
]. The best batch effect removal approaches for radiomics have been reported [
[20]- Castaldo R.
- Brancato V.
- Cavaliere C.
- Trama F.
- Illiano E.
- Costantini E.
- et al.
A framework of analysis to facilitate the harmonization of multicenter radiomic features in prostate cancer.
]; however, batch assignment within these approaches can also contribute to further variability [
[21]- Carré A.
- Battistella E.
- Niyoteka S.
- Sun R.
- Deutsch E.
- Robert C.
AutoComBat: a generic method for harmonizing MRI-based radiomic features.
].
This study investigates the combination of publicly available datasets with single-institutional retrospective data to construct radiomic models for loco-regional recurrence within two years of treatment in head-and-neck cancer. We evaluated the generalizability of the radiomic models when applied to a new real-world dataset. This study was intended to show feasibility for a prospective radiomic study that is currently enrolling patients, and to identify the potential issues arising in multi-institutional modelling studies of this kind. Specifically, we have investigated the role of feature selection method, choice of machine learning architecture, sampling effects and batch harmonization effects with regards to external validation results.
4. Discussion
In this study we simulated a multi-institutional study to build prognostic models of loco-regional recurrence (LRR) in locally advanced head-and-neck cancer (LA-HNC) patients. We designed experiments to study the responsiveness of the models to the choice of feature extraction, machine learning classifiers, batch effect normalization and data size.
This study is important as radiomics has shown potential to personalize patient treatment using routinely acquired clinical images. This is particularly important in clinical management of HNC, since biological heterogeneity inside a tumour that characterizes the interpatient differences is just as important as heterogeneity or variability seen in terms of determining the clinical outcome [
1- Chang J.H.
- Wu Y.
- Wu A.T.H.
Locoregionally recurrent head and neck squamous cell carcinoma: incidence, survival, prognostic factors, and treatment outcomes.
,
9- Wong A.J.
- Kanwar A.
- Mohamed A.S.
- Fuller C.D.
Radiomics in head and neck cancer: from exploration to application.
]. CT imaging was opted as it is an indispensable imaging in management of HNC and all patients treated with radiation will have it.
Most radiomics studies including this study rely on retrospective imaging which lends to an unavoidable bias in patient selection and data heterogeneity from scanners, imaging and reconstruction parameters, inter-observer variations in delineations, biological variability across populations etc. In real-life data, these variations manifest more prominently than prospective data. In a multi-centric setting, ensuring the quality of the data can be additionally challenging. For example, we often rely on what is defined as ROI. However, there could be differences in GTV definitions between centres where some include just the primary tumour and others both primary and nodes. In being too conservative, we risk losing a lot of data, and the contrary introduces noise leading to a trade-off between size of the dataset and noise. Also, from the images or their metadata it is often difficult to determine if the head-and-neck CT was imaged with or without contrast. With prospective imaging we might be able to limit some of these variations, but it would be at cost of much smaller samples in exchange for better control over the quality of the data.
Currently, there are many feature selection methods available. However, there is no optimal feature selection method for radiomics yet. In this study, we have chosen 1) LASSO and 2) SelectKBest based on their popularity, and ease of implementation. It was observed that features from a single dataset have quite different distributions (
Fig. 1,
Fig. 2), may not be applicable in multi-institutional dataset [
[25]Benchmarking feature selection methods in radiomics.
] and are dependent on the choice of the selection method [
[26]Evaluation of the dependence of radiomic features on the machine learning model.
]. We also observed only three features were common among the two methods (Supplementary
Table S2).
Radiomic models often run the risk of overfitting (failure to predict in unseen data) and one way to mitigate is to increase the size of the training sample, which is achieved in this study by pooling data from different institutions/sources which also introduces more heterogeneity. We chose ComBat as it is the most popular method reported for Radiomics studies with an easy Python implementation tool available [
[24]- Behdenna A.
- Haziza J.
- Azencott C.A.
- Nordor A.
pyComBat, a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods.
]. We saw that ComBat standardized the distributions and the feature mean (Supplementary
Table S5), but it leaves us no clearer as to what the cause(s) of the batch effects might be. The danger is that we wipe out real clinical demographic heterogeneities in addition to scanner/acquisition/delineation types of heterogeneities which may lead to potentially dangerous misjudgment. Our study shows that applying ComBat has either similar or better performance in the validation set compared to results without the batch correction suggesting that ComBat has more utility in some combination of multi-centre datasets but less in others. We do not at present have access to enough data to extricate the reasons for this, and we mention this as an important question for follow up studies.
We have not exhaustively searched over all possible classifiers for the best classifier. For the feature selection and classifier methods used, we considered the responsiveness on the hyper-parameters, but this was not the primary purpose of our study. We cannot exclusively say that the hyper-parameters have been optimized, but our results represent what could be seen in future multi-centre studies. We selected Logistic Regression and Radial SVM as being two of the most used classifiers [
11Trends in incidence of head and neck cancers in India.
,
14- Giraud P.
- Giraud P.
- Gasnier A.
- El Ayachy R.
- Kreps S.
- Foy J.P.
- et al.
Radiomics and machine learning for radiotherapy in head and neck cancers.
,
15- Ger R.B.
- Zhou S.
- Elgohari B.
- Elhalawani H.
- Mackin D.M.
- Meier J.G.
- et al.
Radiomics features of the primary tumor fail to improve prediction of overall survival in large cohorts of CT- and PET-imaged head and neck cancer patients.
] and their selection motivated differently; LR has linear decision boundary and models are simpler to explain compared to SVM which operates on non-linear boundaries. With the multi-institutional data, we found LR model performed poorly on the validation data, despite harmonization with ComBat, while SVM models outperformed the LR models, irrespective of the ComBat harmonization. Hence, while choosing a ML framework, that it should be considered not only for complexity e.g., linear or non-linear decision boundaries, but also on how well does the framework deals with the different sources of heterogeneity. Although more data yielded significantly better performance for both LR and SVM models compared to single institutional data (Supplementary
Table S4), we observed that once the SVM models had sufficiently learnt the heterogeneity, more data did not necessarily improve the performance. So, AUCs achieved with approximately 200 patients and about 500 patients remained at 0.6. This may be an inherent limitation of AUC, where it typically limits it to under 0.8 in small samples as shown by Bahn and Alber [
[27]On the limitations of the area under the ROC curve for NTCP modelling.
]. In multi-institutional studies a trade-off in capturing the true biological heterogeneity and achieving a sample size that can statistically account for the true differences is unavoidable.
There are certain limitations in this study. First, is the retrospective nature of the data. Secondly, the time-to-LRR was dichotomised since one cohort (HN-CMC) did not have event dates for performing the time-to-event analysis (See logic in Supplementary Section S.3). For the prospective observational trial that the institution is accruing currently, care is taken to ensure the follow-up and time to clinical outcomes including loco-regional recurrence are recorded accurately. Thirdly, the heterogeneity of imaging data limits the generalizability of the prognostic model which includes the patients with good prognosis (HPV-positive oropharynx) and poor prognosis (e.g., Stage 4 hypopharynx). Next, there was some discrepancy noted in the GTV definition in the public datasets. However, we have included the patients solely based on the tumour definitions provided by individual centres. HN-MONTREAL dataset included patients treated for nasopharynx and unknown primary. We did not actively try to remove cancer of the nasopharynx as this cancer is more prevalent in some parts of the world relatively more than others, which includes North-East India [
[28]- Chang E.T.
- Ye W.
- Zeng Y.X.
- Adami H.O.
The evolving epidemiology of nasopharyngeal carcinoma.
]. However, it could not be captured in the HN-CMC cohort. Similarly, on examining the unknown primary images, at least three were oropharynx, however they were included and reported based on the supporting document that mentioned the presence of a GTV primary. This might have some bearing on the performance of the models. However, no corrections were made and were retained in terms of simulating the real clinical scenario. Next, contrast enhanced CT are not always standard imaging available for head-and-neck cancers. Hence, the datasets included both intravenously injected contrast CTs and non-contrast enhanced CTs and no correction was applied since the effect of contrast on radiomics features in Head-and-neck tumours is not yet fully explored. Next, although PyRadiomics did not exactly comply completely with all IBSI requirements, this would not have affected the present study since all datasets were computed using the same PyRadiomics software and the same feature extraction parameter setting. However, IBSI compliance is needed to allow better reproduction and validation of the results externally. Lastly, both LR and SVM models trained on different combinations of datasets showed decreased performance when validated in the HN3 dataset (Supplementary
Table S3). Future studies will look at effect of combinations of datasets instead of just adding more data (Similar to Supplementary
Table S4).
Given our results and the growing number of studies on deep learning based oncological prognostication [
29Deep learning technology for improving cancer care in society: new directions in cancer imaging driven by artificial intelligence.
,
30- Diamant A.
- Chatterjee A.
- Vallières M.
- Shenouda G.
- Seuntjens J.
Deep learning in head & neck cancer outcome prediction.
] and the ability of these models to handle the heterogeneity in the data better compared to machine learning, it would be worth exploring their utility for multi-institutional studies. It would also be interesting to see if with federated architecture [
[31]Dekker A. Personal Health Train for Radiation Oncology in India and The Netherlands. clinicaltrials.gov; 2020.
], where we can leave some centres in training and keep others out for external validation, and easily try different combinations of data, it would be helpful in teasing out the clinical-related heterogeneities and correcting only for the scanner-related heterogeneities.
In summary, the study highlights the variability that occurs when multi-institutional data is pooled for prognostic radiomics models for head-and-neck cancer. Based on our observations, we strongly recommend that future studies mention the scanner models, imaging parameters, use of contrast agents and provide primary and nodal volumes separately along with the other clinical details relevant to the patients. Harmonization techniques may help reduce some variability; however, we are unclear if we are losing key heterogeneity that may be worth preserving. Carefully designed prospective, multi-institutional studies and data sharing will be needed to build clinically relevant radiomics models for prognostication.
Article info
Publication history
Published online: May 15, 2023
Accepted:
May 2,
2023
Received in revised form:
May 1,
2023
Received:
October 28,
2022
Copyright
© 2023 The Authors. Published by Elsevier B.V. on behalf of European Society of Radiotherapy & Oncology.