Machine learning based on automated breast volume scanner (ABVS) radiomics for differential diagnosis of benign and malignant BI-RADS 4 lesions
Shi-jie Wang and Hua-qing Liu contributed equally to this work.
Funding information: Guangdong Province Key Field Research and Development Project, Grant/Award Number: 2018B030332001; the Sun Yat-sen University 5010 Program Cultivation Project, Grant/Award Number: 2016016
Abstract
BI-RADS category 4 represents possibly malignant lesions and biopsy is recommended to distinguish benign and malignant. However, studies revealed that up to 67%–78% of BI-RADS 4 lesions proved to be benign, but received unnecessary biopsies, which may cause unnecessary anxiety and discomfort to patients and increase the burden on the healthcare system. In this prospective study, machine learning (ML) based on the emerging breast ultrasound technology-automated breast volume scanner (ABVS) was constructed to distinguish benign and malignant BI-RADS 4 lesions and compared with different experienced radiologists. A total of 223 pathologically confirmed BI-RADS 4 lesions were recruited and divided into training and testing cohorts. Radiomics features were extracted from axial, sagittal, and coronal ABVS images for each lesion. Seven feature selection methods and 13 ML algorithms were used to construct different ML pipelines, of which the DNN-RFE (combination of recursive feature elimination and deep neural networks) had the best performance in both training and testing cohorts. The AUC value of the DNN-RFE was significantly higher than less experienced radiologist at Delong's test (0.954 vs. 0.776, p = 0.004). Additionally, the accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of the DNN-RFE were 88.9%, 83.3%, 95.2%, 83.3%, and 95.2%, which also significantly better than less experienced radiologist at McNemar's test (p = 0.043). Therefore, ML based on ABVS radiomics may be a potential method to non-invasively distinguish benign and malignant BI-RADS 4 lesions.
1 INTRODUCTION
With the increasing development of ultrasound (US) in routine breast examination,1 the American College of Radiology (ACR) updated the Breast Imaging Reporting and Data System (BI-RADS) US lexicon in 2013 to standardized descriptions of lesions and reports.2 The breast lesions detected by US can be classified into seven categories (categories 0–6) with the ACR BI-RADS. Among them, BI-RADS US category 4 (herein referred to as BI-RADS 4) lesions represent suspicious lesions with a likelihood of malignancy from 2% to 95%, and biopsy is recommended for this category to confirm the pathological properties.3 However, previous studies revealed that up to 67%–78% of BI-RADS 4 lesions are confirmed as benign,4-6 but received unnecessary biopsies, which may cause unnecessary anxiety and discomfort to patients and increase the burden on the health care system.7 Therefore, improving the assessment of BI-RADS 4 lesions and avoiding unnecessary biopsies is a clinical problem that needs to be resolved.
The automated breast volume scanner (ABVS) is an emerging US technology that automatically scans the breast based on a special high-frequency broadband transducer.8 It not only overcomes the limitations of operator dependency and lack of reproducibility in the conventional US but also provides three-dimensional representation of breast tissue and allows image reformatting in three planes (axial, sagittal, and coronal plane).9, 10 Recently, several studies have shown that some unique features of ABVS may provide additional information for distinguishing benign and malignant breast lesions.11 Specifically, the retraction phenomenon is manifested as a stellate pattern around the lesion, which has high sensitivity (80%–89%) and specificity (96%–100%) for breast cancer.12-14 However, visual assessment of these image features of ABVS highly depends on the experience of radiologists and lacks agreement among readers.15 Therefore, it is still difficult to distinguish between benign and malignant BI-RADS 4 lesions by visual interpretation of ABVS images.
Radiomics is an imaging processing and analysis technique that enables the conversion of routine medical images into quantitative data and the subsequent mining of high-dimensional data, which may reflect both macroscopic and pathophysiological characteristics of tissues.16, 17 Radiomics is usually combined with machine learning (ML) methods (such as support vector machine (SVM), random forest (RF), deep neural networks (DNN), etc.) to select features and build decision support models. This strategy has been proven useful for analyzing breast magnetic resonance imaging (MRI) with impressive effectiveness.18, 19 Its application in ABVS is still rarely reported,20, 21 although ABVS images also have the standardized, reproducible, and high-resolution characteristics similar to MRI images.22 Additionally, previous studies have focused on the identification of benign and malignant breast lesions (including BI-RADS 2–5 lesions), and were limited by the relatively small number of patients.20, 21 Whether or how the radiomics method based on ABVS can be used to distinguish between benign and malignant BI-RADS 4 lesions has not been explored.
Therefore, the purpose of this study was to investigate and explore the possibility of using the ML method based on ABVS radiomics to improve the assessment and differential diagnosis of BI-RADS 4 lesions.
2 METHODS AND MATERIALS
2.1 Patients enrollments
This prospective study was approved by the Institutional Review Board of our hospital (KY2020163), and written informed consent was obtained from all participants. Between April 1 and August 31, 2020, consecutive women with BI-RADS 4 lesions detected by the US were invited to participate in the study. Further inclusion and exclusion criteria were as follows.
The inclusion criteria were as follows: (1) each BI-RADS 4 lesion was confirmed by a senior radiologist (with 7 years of breast US experience), and ultimately assigned to a subcategory (4A, 4B, or 4C) according to the second edition of the ACR BI-RADS US atlas; (2) patients who underwent US-guided core needle biopsy (CNB) and; (3) the ABVS examination were performed within 1 week before the biopsy. The exclusion criteria were as follows: (1) women who were not suitable for ABVS, such as pregnancy, breastfeeding, or breast with implants; (2) poor image quality of ABVS images; and (3) absence of a definitive pathological diagnosis.
2.2 ABVS images and clinicopathological information acquisition
The ABVS examinations were performed using the ACUSON S2000 Automated Breast Volume Scanner (Siemens Medical Solutions, Inc.) by one of two well-trained technologists (minimum previous training on ABVS of 6 months). More details on ABVS examination refer to Kim et al.8 After the examination, axial ABVS images were sent to a dedicated workstation, where the sagittal and coronal images were reconstructed automatically. A radiologist with 8 years of experience in US-based breast diagnosis and 1 year of experience in ABVS imaging selected ABVS images that showed the maximum size of the target breast lesion on axial, sagittal, and coronal planes. These ABVS images were exported in DICOM format, and all annotations and marks were removed for further radiomics analysis.
US-guided coarse needle biopsy was performed by experienced US interventional doctors. According to the standard biopsy procedure, four to eight samples per lesion were acquired using an automatic biopsy gun with a 14G or 16G needle.23 The final pathological diagnoses were divided into benign and malignant, in which malignancy was defined as infiltrating carcinoma or ductal carcinoma in situ, and all other diagnoses were considered benign. Information about age, menopausal status, and family history of breast cancer was obtained directly from the patients. Breast density was assessed on digital mammography and classified into categories A-D according to BI-RADS classification. Lesion size was measured as the largest diameter found on the coronal plane of ABVS.
2.3 Lesion segmentation and radiomics feature extraction
Breast lesions were manually segmented using free open-source software (MaZda, version 4.6, www.eletel.p.lodz.pl/mazad/) by one investigator (with 7 years of experience in breast US) who was blinded to the pathology of the breast lesions. The ABVS images were normalized using Mazda's built-in image normalization method before segmentation to minimize the influence of contrast and brightness variation. The region of interest (ROI) covered the whole lesion and adjacent tissues within 1–2 mm from the lesion margin. Seven common feature groups were automatically extracted with Mazda software (Table 1).
Radiomics features | Description |
---|---|
Histogram | Mean, variance, skewness, kurtosis, and 1st, 10th, 50th, 90th, and 99th percentiles |
Geometry | Descriptors of the two-dimensional size and shape of the ROI |
Absolute gradient | Mean, variance, skewness, kurtosis, and percentage of pixels with nonzero gradient |
GLCM | Angular second moment, contrast, correlation, sum of squares, inverse difference moment, sum average, sum variance, sum entropy, entropy, difference variance, and difference entropy; parameters are computed up to 20 times for (d, 0), (0, d), (d, d), (d, −d), and the d can take values of 1, 2, 3, 4, and 5 |
RLM | Run-length nonuniformity, gray-level nonuniformity, long-run emphasis, short-run emphasis, and fraction of image in runs; parameters are computed 4 times for horizontal, 45°, vertical, and 135° directions |
AM | Assumes a local interaction between image pixels in that pixel intensity is a weighted sum of neighborhood pixel intensities and has 5 unknown model parameters – the standard deviation of the driving noise es and the model parameter vector θ = [θ1, θ2, θ3, θ4] |
Wavelet transform | The discrete wavelet transform is a linear transformation that operates on a data vector whose length is an integer power of two, transforming it into a numerically different vector of the same length |
- Note: es denotes an independent and identically distributed noise; θ is a vector of model parameters.
- Abbreviations: AM, autoregressive model; GLCM, gray-level co-occurrence matrix; RLM, run-length matrix; ROI, region of interest.
To determine the intra- and interobserver reproducibility of radiomics features extraction, the intra- and interclass correlation coefficients (ICC) were calculated. Thirty BI-RADS 4 lesions were randomly selected for ROI segmentation by two radiologists (R1 and R2, with 3 and 6 years of experience in breast US, respectively) to evaluate the inter-observer ICC. Two weeks later, R1 repeated the ROI segmentation to evaluate the intra-observer ICC. An ICC greater than 0.80 was considered a satisfactory agreement.24
2.4 Machine learning based on ABVS radiomics
The flow chart of the study is outlined in Figure 1. All BI-RADS 4 lesions enrolled in the study were divided into two cohorts, the training cohort (80% of cases) and the testing cohort (20% of cases). To maintain a consistent percentage of malignant BI-RADS 4 lesions between the training and testing cohorts, stratified sampling was used to match the two cohorts.

Seven feature selection methods [mutual information and maximal information coefficient (MIC), random forest (RF), recursive feature elimination (RFE), minimum redundancy maximum relevance (mRMR), linear support vector classification (LSVC), logistic regression (LR), and embedding tree] and 13 ML algorithms [logistic regression (LR), support vector machine (SVM), decision tree (DT), K-nearest neighbor (KNN), extra tree (ET), random forest (RF), Gaussian naive Bayesian (Gaussian NB), linear discriminant analysis (LDA), gradient boost (GB), adaptive boosting (AdaBoost), multilayer perception (MLP), deep neural networks (DNN), and Bagging] are combined in pairs to construct different ML pipelines.
2.5 Training and testing of machine learning models
In the training cohort, the performance of each ML pipeline was evaluated based on three times fivefold cross-validation (5-CV) results using area under the receiver operating characteristic (ROC) curve (AUC) analysis. For each 5-CV process, the dataset of the training cohort was randomly divided into five folds with approximately equal sample size, in which four folds were chosen to develop the ML and the remaining fold was used to calculate the model performance metrics. After five iterations, each fold was used as the validation set exactly once. To select the best configuration of hyperparameters for each ML pipeline, we performed a grid search method and 5-CV in the training cohort for hyperparameter tuning. The hyperparameters of each ML model used in this study are shown in Table 2. To avoid overfitting, the top 6 ML pipelines in the training cohort were selected for further verification in the testing cohort.
Methods | Hyper-parameters | Range |
---|---|---|
LR | Class-weight | [‘balanced’, none] |
LDA | None | None |
SVM | None | None |
DT | Max-depth | [5, 10, 20] |
Min-samples-leaf | [2, 4, 8, 16] | |
Min-samples-split | [2, 4, 8, 16] | |
Class-weight | [“balanced,” none] | |
KNN | Weightp | [‘uniform’, ‘distance’][1, 2] |
ET | n-estimators | [10, 100, 1000] |
RF | Max-depth | Uniform (loc = 5, scale = 10) |
Min-samples-leaf | Uniform (loc = 0, scale = 0.1) | |
Min-samples-split | Uniform (loc = 0, scale = 0.1) | |
n-estimators | [10, 20, 35, 50] | |
Gaussian NB | None | None |
GB | n-estimators | [5, 10, 100] |
Max-depth | Uniform (loc = 2, scale = 10) | |
AdaBoost | Base-estimator | [Logistic, SVM, Gaussian NB, DT-clf-best] |
n-estimators | [10, 30, 100] | |
MLP | Activation | [“identity,” “logistic,” “relu”] |
Learning-rate | [“constant,” “invscaling,” “adaptive”] | |
Hidden-layer-sizes | [(32,32), (64,64), (128128)] | |
DNN | None | None |
Bagging | Base-estimator | [Logistic, SVM, Gaussian NB, DT-clf-best] |
n-estimators | [10, 100, 1000] |
- Abbreviations: AdaBoost, adaptive boosting; clf, classifier; DNN, deep neural networks; DT, decision tree; ET, extra tree; Gaussian NB, Gaussian Naive Bayesian; GB, gradient boost; invscaling, inverse scaling; KNN, K-nearest neighbor; LDA, linear discriminant analysis; LR, logistic regression; MLP, multi-layer perception; relu, rectifie linear units; RF, random forest; SVM, support vector machine.
In the testing cohort, the output for the ML was the malignant probability of BI-RADS 4 lesions (ranging from 0% to 100%), and the performance metrics of the ML were calculated, including AUC, sensitivity, specificity, accuracy, negative predictive value (NPV), and positive predictive value (PPV). To compare the diagnostic performance of ML and radiologists, three radiologists (R1, R2, and R3, with 3, 6, and 10 years of experience in breast ultrasound), independently evaluated the malignant probability of each BI-RADS 4 lesion, on a scale of 2%–95%, according to their “best guess” after observing the axial, sagittal, and coronal images of ABVS.25 The malignancy probability scale of BI-RADS4 lesions refers to the second edition of ACR BI-RADS US.
2.6 Statistical analysis
Statistical analyses were performed using IBM SPSS Statistics software (version 24.0, SPSS Inc.) and Python programming software (version 3.6.8; https://www.python.org/). Differences in characteristics between the training and testing cohorts and between the benign and malignant groups in each of these cohorts were analyzed using the SPSS software. The differences in age and tumor size were assessed using independent sample t-test, and the Chi-square test was used to evaluate the differences in breast density, subcategory of BI-RADS 4 and family history of breast cancer. The “Keras” package (version 2.4.3) was used for DNN modeling, and the “sklearn” package (version 0.23.1) was used for other ML modeling and feature selection. The feature importance ranking was used the “eli5” package (version 0.10.1). ROC curve analysis was performed to determine the performance of the ML pipelines and radiologists, and accuracy, sensitivity, specificity NPV, PPV, and AUC were calculated by “sklearn” package. The different AUCs were compared by using Delong's test, and McNemar's test was performed to assess differences in the performance between ML and radiologists. A two-sided p value less than 0.05 was considered to indicate statistical significance.
3 RESULTS
3.1 Clinicopathologic characteristics of breast lesions
A total of 223 BI-RADS 4 lesions from 193 patients (mean age, 49.4 ± 12.3 years; range 25 to 79 years) were enrolled in this study, of which 103 were malignant and 120 were benign. The average size of the lesions was 19.5 ± 10.2 mm (range, 5–59 mm). The subcategories of BI-RADS 4 lesions were as follows: BI-RADS 4A (n = 104, 46.6%), BI-RADS 4B (n = 43, 19.3%), and BI-RADS 4C (n = 76, 34.1%). The malignancy rates of category 4A, 4B, and 4C were 11.5% (12/104), 44.2% (19/43), and 94.7% (72/76), respectively. Finally, 178 lesions (97 benign and 81 malignant) were divided into training cohorts and 45 lesions (23 benign and 22 malignant) were divided into the testing cohort. There were no significant differences between the two cohorts in age, lesion size, location of lesions, the subcategory of BI-RADS 4, breast density, or family history of breast cancer (p = 0.229, 0.549, 0.992, 0.593, 0.878, 0.153). We also investigated the above information between malignant and benign lesions in the two cohorts, respectively (Table 3).
Training cohort (n = 178) | Testing cohort (n = 45) | |||||
---|---|---|---|---|---|---|
Malignant (n = 81) | Benign (n = 97) | p | Malignant (n = 22) | Benign (n = 23) | p | |
Age (year) | 54.6 ± 12.5 | 44.1 ± 9.9 | <0.001 | 57.5 ± 9.4 | 45.4 ± 11.7 | <0.001 |
Lesion size (cm) | 2.3 ± 0.9 | 1.6 ± 0.9 | <0.001 | 2.5 ± 1.2 | 1.6 ± 0.7 | 0.005 |
Subcategory of BI-RADS 4 | ||||||
4A | 11 (13.6%) | 73 (75.3%) | <0.001 | 1 (4.6%) | 19 (82.6%) | <0.001 |
4B | 16 (19.8%) | 20 (20.6%) | 3 (13.6%) | 4 (17.4 %) | ||
4C | 54 (66.6%) | 4 (4.1%) | 18 (81.8%) | 0 (0.0%) | ||
Breast density | ||||||
A | 11 (13.6%) | 4 (4.1%) | <0.001 | 3 (13.6%) | 0 (0.0%) | 0.319 |
B | 38 (46.9%) | 26 (26.8%) | 8 (36.4%) | 9 (39.2%) | ||
C | 26 (32.1%) | 42 (43.3%) | 8 (36.4%) | 11 (47.8%) | ||
D | 6 (7.4%) | 25 (25.8%) | 3 (13.6%) | 3 (13.0%) | ||
Menopausal | ||||||
Pre- | 33 (40.7%) | 73 (75.3%) | <0.001 | 5 (22.7%) | 14 (60.9%) | 0.010 |
Post- | 48 (59.3%) | 24 (24.7%) | 17 (77.3%) | 9 (39.1%) | ||
Family history | ||||||
Yes | 8 (9.9%) | 8 (8.2%) | 0.705 | 5 (22.7%) | 6 (26.1%) | 0.793 |
No | 73 (90.1%) | 89 (91.8%) | 17 (77.3%) | 17 (73.9%) | ||
Location of lesions | ||||||
Left | 51 (63.0%) | 52 (53.6%) | 0.208 | 11 (50.0%) | 15 (65.2%) | 0.302 |
Right | 30 (37.0%) | 45 (46.4%) | 11 (50.0%) | 8 (3%) |
- Note: The lesion size was defined as the maximum diameter on ABVS images.
- Family history refers to breast and/or ovarian cancer in first-degree relatives.
- The differences in characteristic variables (age and lesion size) between the two cohorts were compared by two-sample t test, whereas chi-square tests were conducted on other variables. p < 0.05.
- Abbreviation: BI-RADS, breast imaging reporting, and data system.
3.2 Feature extraction and the intra- and interobserver agreement
In this study, a total of 1101 (367 × 3) features were extracted for each lesion from the axial, sagittal, and coronal ABVS images. The mean ICCs of intra- and interobserver agreement were 0.96 (range from 0.81 to 0.99) and 0.95 (range from 0.55 to 0.98), respectively, indicating satisfactory intra- and interobserver reproducibility for the radiomics features extracted from the ABVS images.
3.3 Predictive performance of multiple ML pipelines
The predictive performance of different ML pipelines (pairwise combination of 7 feature selection methods and 13ML algorithms) is shown in Figure 2. The top 6 ML pipelines are DNN-RFE (mean AUC: 0.972), AdaBoost-RFE (mean AUC: 0.969), LR-RFE (mean AUC: 0.968), LDA-RFE (mean AUC: 0.968), Bagging-RFE (mean AUC: 0.967), and SVM-RFE (mean AUC: 0.962), respectively. Combined with the feature selection method of RFE, 13 ML algorithms showed stable and satisfactory performance, with mean AUCs ranging from 0.809 to 0.972 (Figure 3A). Fifteen important radiomics features for predicting the malignancy probability of BI-RADS 4 lesions were selected by RFE (Figure 3B).


3.4 Test the predictive performance of selected ML models
In the testing cohort, the performance of selected six ML models (DNN-RFE, AdaBoost-RFE, LR-RFE, LDA-RFE, Bagging-RFE, and SVM-RFE) and three radiologists (R1, R2, and R3) are shown in Figure 4. DNN-RFE also obtained the highest AUC (0.954), followed by LR-RFE (0.948), LDA-RFE (0.942), bagging-RFE (0.942), AdaBoost-RFE (0.940), and SVM-RFE (0.921), while the AUCs of the three radiologists were 0.776, 0.917, and 0.928, respectively. The AUC of the DNN-RFE was significantly higher than that of R1 (0.954 vs. 0.776, p = 0.004) and nonsignificant higher than those of R2 and R3 (0.954 vs. 0.917 and 0.928, p = 0.246, 0.322). In addition, the sensitivity, specificity, accuracy, NPV, and PPV of the six ML pipelines and three radiologists are summarized in Table 4. DNN-RFE had the highest accuracy (88.9%), with sensitivity, specificity, NPV, and PPV were 83.3%, 95.2%, 95.2%, and 83.3%, while R1 had the lowest accuracy (64.4%), with specificity, sensitivity, NPV, and PPV were 45.8%, 85.7%,78.6%, and 58.1%. The McNemar's test comparing DNN-RFE and less experienced radiologist's (R1) readings were significantly different (p = 0.043), but not significantly different from R2 and R3 (p = 0.343, 0.773). Two representative cases show that ML was superior to radiologists in predicting benign and malignant BI-RADS 4 lesions (Figure 5).

AUC | Accuracy | Specificity | Sensitivity | NPV | PPV | TP | FP | TN | FN | |
---|---|---|---|---|---|---|---|---|---|---|
DNN-RFE | 0.954 | 88.9% | 83.3% | 95.2% | 95.2% | 83.3% | 20 | 4 | 20 | 1 |
LR-RFE | 0.948 | 86.7% | 83.3% | 90.5% | 90.9% | 82.6% | 19 | 4 | 20 | 2 |
Bagging-RFE | 0.942 | 86.7% | 83.3% | 90.5% | 90.9% | 82.6% | 19 | 4 | 20 | 2 |
LDA-RFE | 0.942 | 86.7% | 83.3% | 90.5% | 90.9% | 82.6% | 19 | 4 | 20 | 2 |
AdaBoost-RFE | 0.940 | 84.4% | 83.3% | 85.7% | 87.0% | 81.8% | 18 | 4 | 20 | 3 |
SVM-RFE | 0.921 | 82.2% | 83.3% | 81.0% | 83.3% | 81.0% | 17 | 4 | 20 | 4 |
Reader 1 | 0.776 | 64.4% | 45.8% | 85.7% | 78.6% | 58.1% | 18 | 13 | 11 | 3 |
Reader 2 | 0.917 | 77.8% | 66.7% | 90.5% | 88.9% | 70.4% | 19 | 8 | 16 | 2 |
Reader 3 | 0.928 | 86.7% | 83.3% | 90.5% | 90.9% | 82.6% | 19 | 4 | 20 | 2 |
- Note: Reader 1, Reader 2, Reader 3 with 5, 8, and 10 years of experience in breast ultrasound, respectively.
- Abbreviations: AdaBoost, adaptive boosting; DNN, deep neural networks; FN, false negative; FP, false positive; LDA, linear discriminant analysis; LR, logistic regression; NPV, negative predictive value; PPV, positive predictive value; RFE, recursive feature elimination; SVM, support vector machine; TN, true negative; TP, true positive.

4 DISCUSSION
It is well known that BI-RADS 4 lesions have a wide range in the probability of malignancy (2–95%), and the US characteristics of BI-RADS 4 lesions have a certain degree of overlap, which may lead to a high false-positive rate and unnecessary biopsies.26 In our study, only 46.2% (103/223) of BI-RADS 4 were pathologically confirmed to be malignant, meaning that 53.8% of patients received unnecessary invasive procedures. Recently, as a supplement to the conventional US, shear wave elastography (SWE), contrast-enhanced ultrasound (CEUS), and MRI have provided more diagnostic information for BI-RADS 4 lesions, and the AUCs of these multi-mode methods range from 0.78 to 0.93.27-29 However, these multi-mode methods require specialized medical equipment and specially trained radiologists, which not only increases the workload of radiologists but also increases the financial burden of patients. In our study, we used ML method based on ABVS radiomics, which is an objective, convenient, and low-cost method that can distinguish between benign and malignant BI-RADS 4 lesions, and its performance (AUC = 0.954) is better than previous multi-mode methods. Thus, our research has confirmed the potential and possibility of ML based on ABVS radiomics in distinguishing benign and malignant BI-RADS 4 lesions to some extent.
According to the radiomics quality score (RQS) proposed by Lambin et al.,22 ABVS images with standardized, repeatable, and high-resolution characteristics will be suitable for radiomics analysis, but few studies have investigated the application of ABVS-based radiomics analysis. Marcon et al. used the ML algorithm (SVM) based on ABVS radiomics features to distinguish benign and malignant breast lesions with a maximum AUC value of 0.98 and a maximum accuracy of 90.7%.20 Another study used a novel ML algorithm (DNN) based on ABVS radiomics to detect and classify breast nodules, and the sensitivity, specificity, and accuracy of classification were 87.0%, 88.0%, and 87.5%, respectively.21 In our study, the selected ML pipeline (DNN-RFE) based on ABVS radiomics also showed satisfactory discrimination performance in the testing cohort, with the AUC, sensitivity, specificity, and accuracy of 0.954, 95.2%, 83.3%, and 88.9%, respectively. Interestingly, Romeo et al.30 used the ML algorithm (RF) based on US radiomics to distinguish benign and malignant breast lesions with an accuracy of 82% and an AUC of 0.82, which is lower than that of our study (88.9% and 0.954), although our study focuses on more challenging lesions (BI-RADS 4). The possible reason is that the three ABVS images (axial, sagittal, and coronal) may provide more radiomic features and better represent tumor heterogeneity than a single US image. Thus, ABVS images are indeed suitable for radiomics and ML methods, and may provide clinical decision support for the management and treatment of breast cancer. Besides, the application of radiomics and ML in ABVS neither dependence on the experience of radiologists, nor requires specific training of radiologists, and only three ABVS images of the lesion can be used to diagnose the lesion, which greatly simplifies the diagnosis workflow of ABVS and may promote the clinical application of ABVS.
With the increasing development of ML technology, more and more dimensionality reduction methods are proposed, and different dimensionality reduction methods may affect the performance of ML models.31 In our study, combined with the mRMR feature selection method, the prediction performance of 13 ML algorithms was generally low (average AUC: 0.762–0.840). On the contrary, combined with the RFE feature selection method, the 13 ML algorithms showed relatively stable and satisfactory performance (mean AUCs: 0.809–0.972). More importantly, the top six ML combinations for predicting malignant BI-RADS 4 lesions were all based on the feature selection method of RFE. Therefore, RFE may be the most appropriate method for selecting the radiomics features of ABVS images in our study. Similar to previous ML studies,32, 33 we also applied a variety of ML algorithms in our study and found that the DNN has the best predictive performance in both the training cohort (AUC: 0.972) and the testing cohort (AUC: 0.954). DNN is a special neural network that has three or more “hidden layers” between the input layer and output layer.34 DNNs have achieved great success in various fields of medicine, often obtaining higher accuracy than traditional ML methods and comparable performance to trained human specialists.35 Therefore, it was not surprising that DNN model had the highest AUC in our study. Interestingly, we noticed that the AdaBoost model has the best predictive performance except for DNN in the training cohort, but it has the worst performance except for SVM in the testing cohort. This may be due to the poor robustness of the AdaBoost algorithm in our research, and this is why we selected the top six ML models for further verification in the testing cohort.
In our study, the ML (DNN-RFE) is superior to radiologists, especially less experienced radiologists (R1), in predicting benign and malignant BI-RADS 4 lesions. The main reason is that ML methods are based on the radiomics features of ABVS images, which may not be identified by visual interpretation but may potentially be associated with important clinical outcomes.36 In addition, the underestimated performance of the radiologists may be another reason. In clinical practice, radiologists often evaluate breast lesions based on comprehensive information, including not only dynamic real-time US images but also other examination results and medical histories.37 In this study, in order to ensure the consistency between the radiologist and the ML, only three static ABVS images are provided to the radiologist, which will undoubtedly increase the difficulty of the radiologist. However, from another aspect, this also indirectly confirmed the potential advantage of ML in predicting BI-RADS 4 benign and malignant lesions.
There were several limitations in the present study. First, the number of lesions from a single center was relatively small, and a lack of external validation may have lead to incomplete or biased results. Second, radiomics features were extracted from the two-dimensional (2D) ROI of the ABVS images instead of the 3D ROI. However, 3D ROIs would significantly increase the workload and time of radiologists, and a study revealed that 2D single-slice texture analysis affords fairly comparable results to those afforded by 3D whole-tumor analyses.38 Third, the value of ICC analysis is limited as all radiologists performed segmentation on the ABVS images selected by a single radiologist, rather than performing independent selection of the images. Last, manual segmentation of lesions may yield inter-operator variability. With the development of artificial intelligence, semiautomatic, or automatic segmentation can be performed in the foreseeable future.
5 CONCLUSIONS
ML based on ABVS radiomics may be a potential tool to distinguish between benign and malignant BI-RADS 4 lesions and reduce unnecessary biopsies in patients. Considering its noninvasively, objectivity, and convenience, this method is worthy of further validation in large-scale and multicenter studies.
ACKNOWLEDGMENTS
The authors gratefully acknowledge the financial support by the Guangdong Province Key Field Research and Development Project (2018B030332001), as well as the Sun Yat-sen University 5010 Program Cultivation Project (2016016).
CONFLICT OF INTEREST
The authors declare no conflicts of interest.
AUTHOR CONTRIBUTIONS
Conceptualization: Shi-jie Wang, Hua-qing Liu, Tao Yang, Ming-quan Huang, Bo-wen Zheng, Tao Wu, Lan-qing Han, Yong Zhang, and Jie Ren; Methodology: Shi-jie Wang, Hua-qing Liu, Tao Yang, Ming-quan Huang, Bo-wen Zheng, Tao Wu, Lan-qing Han, Yong Zhang, and Jie Ren; Software: Shi-jie Wang, Hua-qing Liu, Lan-qing Han, and Jie Ren; Formal Analysis: Shi-jie Wang, Hua-qing Liu, Tao Yang, Ming-quan Huang, Bo-wen Zheng, Tao Wu, Lan-qing Han, Yong Zhang, and Jie Ren; Investigation: Shi-jie Wang, Hua-qing Liu, Tao Yang, Ming-quan Huang, Bo-wen Zheng, Tao Wu, Lan-qing Han, Yong Zhang, and Jie Ren; Resources: Shi-jie Wang, Tao Yang, Ming-quan Huang, and Jie Ren; Data Curation: Shi-jie Wang, Hua-qing Liu, Tao Yang, Ming-quan Huang, Bo-wen Zheng, Tao Wu, Lan-qing Han, Yong Zhang, and Jie Ren; Writing—Original Draft Preparation: Shi-jie Wang, Hua-qing Liu, Bo-wen Zheng, Tao Wu, Yong Zhang, and Jie Ren; Writing—Review and Editing: Shi-jie Wang, Hua-qing Liu, Tao Yang, Ming-quan Huang, Bo-wen Zheng, Tao Wu, Lan-qing Han, Yong Zhang, and Jie Ren; Supervision: Yong Zhang, and Jie Ren; Funding Acquisition: Jie Ren All authors have read and agreed to the published version of the manuscript.
Open Research
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request.