Volume 206, Issue 2 pp. 596-606
ORIGINAL PAPER
Open Access

Development and validation of a deep learning model for morphological assessment of myeloproliferative neoplasms using clinical data and digital pathology

Rong Wang

Rong Wang

Department of Haematology, Collaborative Innovation Center for Cancer Personalized Medicine, Jiangsu Province Hospital, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China

Search for more papers by this author
Zhongxun Shi

Zhongxun Shi

Department of Haematology, Collaborative Innovation Center for Cancer Personalized Medicine, Jiangsu Province Hospital, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China

Search for more papers by this author
Yuan Zhang

Yuan Zhang

Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, Nanjing, China

Department of Biostatistics, Center for Global Health, School of Public Health, Nanjing Medical University, Nanjing, China

Search for more papers by this author
Liangmin Wei

Liangmin Wei

Department of Biostatistics, Center for Global Health, School of Public Health, Nanjing Medical University, Nanjing, China

Department of Public Health, School of Medicine & Holistic Integrative Medicine, Nanjing University of Chinese Medicine, Nanjing, China

Search for more papers by this author
Minghui Duan

Minghui Duan

Department of Haematology, Peking Union Medical College Hospital, Beijing, China

Search for more papers by this author
Min Xiao

Min Xiao

Department of Haematology, Tongji Medical College, Tongji Hospital, Huazhong University of Science and Technology, Wuhan, China

Search for more papers by this author
Jin Wang

Jin Wang

Department of Haematology, Tongji Medical College, Tongji Hospital, Huazhong University of Science and Technology, Wuhan, China

Search for more papers by this author
Suning Chen

Suning Chen

NHC Key Laboratory of Thrombosis and Hemostasis, National Clinical Research Center for Haematologic Diseases, Jiangsu Institute of Haematology, The First Affiliated Hospital of Soochow University, Suzhou, China

Search for more papers by this author
Qian Wang

Qian Wang

NHC Key Laboratory of Thrombosis and Hemostasis, National Clinical Research Center for Haematologic Diseases, Jiangsu Institute of Haematology, The First Affiliated Hospital of Soochow University, Suzhou, China

Search for more papers by this author
Jianyao Huang

Jianyao Huang

Department of Haematology, The First Affiliated Hospital of Anhui Medical University, Hefei, China

Search for more papers by this author
Xiaomei Hu

Xiaomei Hu

Department of Pathology, Fujian Medical University Union Hospital, Fuzhou, China

Search for more papers by this author
Jinhong Mei

Jinhong Mei

The First Affiliated Hospital of Nanchang University, Nanchang, China

Search for more papers by this author
Jieyu He

Jieyu He

Department of Biostatistics, Center for Global Health, School of Public Health, Nanjing Medical University, Nanjing, China

Search for more papers by this author
Feng Chen

Feng Chen

Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, Nanjing, China

Department of Biostatistics, Center for Global Health, School of Public Health, Nanjing Medical University, Nanjing, China

Search for more papers by this author
Lei Fan

Lei Fan

Department of Haematology, Collaborative Innovation Center for Cancer Personalized Medicine, Jiangsu Province Hospital, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China

Search for more papers by this author
Guanyu Yang

Corresponding Author

Guanyu Yang

Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, Nanjing, China

Correspondence

Yongyue Wei, Center for Public Health and Epidemic Preparedness & Response, Peking University, No. 38 Xueyuan Road, Beijing 100191, China.

Email: [email protected]

Wenyi Shen, Department of Haematology, Collaborative Innovation Center for Cancer Personalized Medicine, Jiangsu Province Hospital, The First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Nanjing 210029, China.

Email: [email protected]

Guanyu Yang, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, No. 2 Sipailou, Nanjing 210096, China.

Email: [email protected]

Search for more papers by this author
Wenyi Shen

Corresponding Author

Wenyi Shen

Department of Haematology, Collaborative Innovation Center for Cancer Personalized Medicine, Jiangsu Province Hospital, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China

Correspondence

Yongyue Wei, Center for Public Health and Epidemic Preparedness & Response, Peking University, No. 38 Xueyuan Road, Beijing 100191, China.

Email: [email protected]

Wenyi Shen, Department of Haematology, Collaborative Innovation Center for Cancer Personalized Medicine, Jiangsu Province Hospital, The First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Nanjing 210029, China.

Email: [email protected]

Guanyu Yang, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, No. 2 Sipailou, Nanjing 210096, China.

Email: [email protected]

Search for more papers by this author
Yongyue Wei

Corresponding Author

Yongyue Wei

Center for Public Health and Epidemic Preparedness & Response, Key Laboratory of Epidemiology of Major Diseases (Ministry of Education), School of Public Health, Peking University, Beijing, China

Correspondence

Yongyue Wei, Center for Public Health and Epidemic Preparedness & Response, Peking University, No. 38 Xueyuan Road, Beijing 100191, China.

Email: [email protected]

Wenyi Shen, Department of Haematology, Collaborative Innovation Center for Cancer Personalized Medicine, Jiangsu Province Hospital, The First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Nanjing 210029, China.

Email: [email protected]

Guanyu Yang, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, No. 2 Sipailou, Nanjing 210096, China.

Email: [email protected]

Search for more papers by this author
Jianyong Li

Jianyong Li

Department of Haematology, Collaborative Innovation Center for Cancer Personalized Medicine, Jiangsu Province Hospital, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China

Search for more papers by this author
First published: 10 December 2024
Citations: 2

Rong Wang, Zhongxun Shi and Yuan Zhang contributed equally to the work as first authors.

Guanyu Yang, Wenyi Shen and Yongyue Wei contributed equally to this study.

Jianyong Li senior author.

Summary

The subjectivity of morphological assessment and the overlapping pathological features of different subtypes of myeloproliferative neoplasms (MPNs) make accurate diagnosis challenging. To improve the pathological assessment of MPNs, we developed a diagnosis model (fusion model) based on the combination of bone marrow whole-slide images (deep learning [DL] model) and clinical parameters (clinical model). Thousand and fifty-one MPN and non-MPN patients were divided into the training, internal testing and one internal and two external validation cohorts (the combined validation cohort). In the combined validation cohort, fusion model achieved higher areas under curve (AUCs) than clinical or DL model or both for MPNs and subtype identification. Compared with haematopathologists with different experience, clinical model achieved AUC which was comparable to seniors and higher than juniors (p = 0.0208) for polycythaemia vera. The AUCs of fusion model were comparable to seniors and higher than juniors for essential thrombocytosis (p = 0.0141), prefibrotic primary myelofibrosis (p = 0.0085) and overt primary myelofibrosis (p = 0.0330) identification. In conclusion, the performances of our proposed models are equivalent to senior haematopathologists and better than juniors, providing a new perspective on the utilization of DL algorithms in MPN morphological assessment.

Graphical Abstract

The subjectivity of morphological assessment and the overlapping pathological features of different subtypes of myeloproliferative neoplasms (MPNs) make accurate diagnosis challenging. To improve the pathological assessment of MPNs, we took both haematoxylin–eosin staining BM whole-slide images (WSI) and clinical features from 1051 MPN and non-MPN patients from seven hospitals in China, and developed artificial intelligence-assisted model based on the combination of WSI and clinical parameters for pathological diagnosis. Our model demonstrated well performance in distinguishing MPNs from non-MPN conditions and MPN subtypes. The diagnostic accuracies surpassed those of junior haematopathologists and were equivalent to seniors, providing a new perspective on the comprehensive utilization of deep learning models in MPN morphological assessment.

INTRODUCTION

Myeloproliferative neoplasms (MPNs) are a group of malignant clonal disorders that originate from haematopoietic stem cells. They are characterized by the proliferation of one or more myeloid lineages and an increased risk of transformation to acute myeloid leukaemia (AML).1 Clinical behaviours and outcomes are distinct between different subtypes, especially between primary myelofibrosis (PMF) and other subtypes,2 making accurate diagnosis crucial. Distinction between MPN subtypes is based on integrating clinical findings with molecular profiles and bone marrow (BM) morphological evaluation.3 Somatic Janus kinase 2 (JAK2), calreticulin (CALR) or myeloproliferative leukaemia protein (MPL) mutations (so-called ‘driver mutations’) are present in the majority of MPN patients and play an important role in the differentiation of MPNs from reactive conditions. However, these markers are powerless in cases of MPNs that lack driver mutations and in the differentiation of MPN subtypes. Since the 2016 World Health Organization (WHO) MPN diagnostic criteria, BM histology has been promoted from a minor to major criterion for all subtypes,4 and it has become the indispensable tool for the diagnosis of MPN subtypes. However, the subjectivity of morphological assessment and the markedly overlapping pathological features of different subtypes, especially for essential thrombocytosis (ET) and prefibrotic PMF (prePMF), lead to the lack of reproducibility between observers, making accurate pathological diagnosis challenging and controversial.5, 6

Deep learning (DL) techniques are showcasing significant improvements in digital pathology image analysis,7 and they provide numerous potential tools for physicians in general and state-of-the-art performance on various tasks such as disease detection, classification and survival prediction for malignant tumours.8-11 In haematological disorders, DL has been, therefore, attempted to enhance the diagnostic methods by turning from the conventional, subjective and laboratory characteristics of morphological features to objective, quantitative and automated characteristics of BM components and clinical information.12 Recent pilot studies with limited sample size (less than 200 samples) have demonstrated the potential of DL systems as effective assistive tools for MPN diagnosis based on the evaluation of megakaryocyte features and BM fibrosis grades, which are important features while lack of reproducibility between observers.13, 14 However, MPN morphological diagnosis requires global pathological assessment and previous models have relied on expensive and time-consuming manual annotations of single cell type or fibrosis grades. Moreover, the generalized performance of DL in large-scale deployments has not been reported.

To tackle the aforementioned challenges in terms of MPN pathological diagnosis, we have proposed a DL model that utilizes annotation-free haematoxylin–eosin (HE)-staining BM whole-slide images (WSIs) and included clinical information to improve the accuracy of pathological diagnosis. The model was developed and validated in a cohort of 1051 patients from seven representative hospitals, which is currently the largest available cohort in artificial intelligence (AI)-assisted MPN diagnostic model construction. Our model was able to effectively distinguish MPNs from non-MPNs and exhibited good performance in subtype differentiation, which was comparable to the performance of experienced haematopathologists, thereby providing a new perspective on the comprehensive utilization of DL algorithms in MPN pathological assessment.

MATERIALS AND METHODS

Study populations and diagnosis

A total of 1215 MPN patients, including polycythaemia vera (PV), ET, prePMF and overt PMF and non-MPN patients, admitted to seven hospitals in China from March 2012 to September 2021 were retrospectively selected. After WSI and clinical information filtering, 819 MPN patients, including 196 with PV, 225 with ET, 184 with prePMF and 214 with overt PMF, and 232 non-MPN patients with other haematological disorders or normal BM were finally enrolled in this study (Table 1). A final MPN diagnosis was made in accordance with the WHO (2016) MPN definitions.4 The distribution of non-MPNs diagnoses is shown in Table S1. All of the patients underwent BM biopsy with the length of specimens no less than 15 mm and the diameter no less than 2 mm. Paraffin embedding was used for section preparation. HE and reticulin staining were performed for MPN cases. Pathological diagnosis for each patient was reached by consensus of five experienced haematopathologists.

TABLE 1. Diagnosis distributions of subjects from each hospital enrolled in this study.
Hospital Non-MPNs ET PV PrePMF Overt PMF Total
Jiangsu Province Hospital 161 155 108 133 134 691
Peking Union Medical College Hospital 25 19 19 8 15 86
Affiliated Provincial Hospital, Anhui Medical University 9 6 12 16 10 53
The First Affiliated Hospital of Soochow University 13 16 14 4 12 59
The First Affiliated Hospital of Nanchang University 9 9 6 5 11 40
Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology 12 14 24 13 19 82
Fujian Medical University Union Hospital 3 6 13 5 13 40
Total 232 225 196 184 214 1051
  • Abbreviations: ET, essential thrombocytosis; MPN, myeloproliferative neoplasm; prePMF, prefibrotic primary myelofibrosis; PV, polycythaemia vera.

Sample collection and data processing

WSIs of HE-stained BM biopsy specimens from MPN and non-MPN patients were collected and scanned at 0.2 μm/pixel using PANNORAMIC SCAN 150 (3DHISTECH Ltd., Budapest, Hungary) (scanner 1) and moticVM1000 (MOTIC Ltd., Xiamen, China) (scanner 2). Patients with poor image quality of WSIs, which prevented reaching a diagnostic conclusion, were excluded. Clinical characteristics and laboratory measurements were obtained from medical records. For patients who underwent BM biopsy only once, the samples at that time point were taken. For patients with several examinations in different time points, the most recent WSIs and corresponding clinical information were taken.

Development of a DL neural network

The workflow of image preprocessing in DL neural network development is shown in Figure 1A.

Details are in the caption following the image
Image preprocessing and visualization in deep learning (DL) model construction. (A) The workflow of image preprocessing in DL model construction. (B) Representative images of visualization of activated maps from the DL model using gradient-weighted class activation mapping.

WSIs with patient-level diagnostic label without any cell type level or region of interest level annotation were scanned for DL model construction. Each WSI was cropped into 512 × 512 patches without overlap from 40× (0.2 μm/pixel) objective magnifications in order to capture detailed information of the images.15 Then, each patch underwent colour normalization using the Vahadane algorithm, which is specifically designed to mitigate the variability in stained conditions across different samples in histopathology, to minimize the inter-slide staining variability and to enhance the comparability of the patches. Considering that the cropped patches often contained significant amounts of blank background regions, whereas cell-enriched regions harboured more comprehensive information crucial for diagnosis. Therefore, a cell-enriched patch selection model based on the ResNet-18 network was trained for WSI preprocessing. Details of preprocessing using ResNet-18 are provided in Supporting Information, Methods.

We also used gradient-weighted class activation mapping (Grad-CAM) as the visual interpretability of the algorithm to generate saliency maps, which highlighted the most discriminative areas of the input images that contributed to the model's decision-making process. The CAMs generated by our model consistently focused on the pathological regions that exhibited abnormalities, such as the areas with visible cell deformation or dense clusters, which are key indicators of the presence of disease (Figure 1B). All of the experiments were performed on one NVIDIA GTX3090 GPU (24 GB), with implementation details provided in the Supporting Information, Methods.

SimCLR which is the state-of-the-art self-supervised constative learning algorithm was used for WSI training in the training cohort16 by exploring the inherent similarity between patches without requiring labelled data. A classic multiple instance learning (MIL) approach, specifically the instance-based MIL algorithm combined with a max-pooling strategy, was employed to form the whole image embedding according to attentional mechanisms as the bag embedding consisting of the embedding of representative patches that best match the exclusive characteristic of the subtype for classification. Details on SimCLR and MIL are provided in the Supporting Information, Methods. The last fully connected layer subjected to SoftMax16 activation obtained the normalized class probability for the four MPN subtypes as well as non-MPN cohort based on the average of embedding from the bag and the most representative patch; such output, in a format of n × 5 matrix, was considered the DL model score (DLMS) (Table S2). The focal loss17 was used as the objective function to diminish class imbalance and to achieve better generalization by focusing on hard examples and preventing the network from being overloaded with easy negatives during the training process.

Processing of clinical parameters and generation of the clinical model score

For each individual, a risk score which represents the log-odds of the MPN (or the subtype) was calculated as CMSsubtype for each subtype. The parameters with missing rates less than 30%, including age, gender, white blood cell count, haemoglobin (Hb), red blood cell count (RBC), haematocrit (HCT) and platelet (PLT) count , were recruited for clinical model development. The multiple imputation process was performed to handle missing values using R package mice, and the corresponding coefficients were pooled by Rubin's law during the multiple imputation process to account for imputation variance (Figure 2; Supporting Information, Methods).18

Details are in the caption following the image
Development of the clinical model and fusion model dealing with missing values. A logistic regression model was used to build the clinical model score (CMS) and fusion model score. The values of parameters for each clinical variable are shown in the tables.

Multivariate logistic regression of all of the candidate variables was performed in the training cohort. The variables with a p-value lower than 0.05 (notated by xi, where i = 1,…, m, and m indicates the number of variables retained) were selected out as predictors, and they were linearly combined with the corresponding coefficient (bi) as weight to form the clinical model score for MPN diagnosis (CMSMPN) (Table S2). Similarly, for the MPN subtype diagnosis model, logistic regression on true labels of MPN subtypes versus the others was performed and the CMS for subtype was constructed.

Generation of fusion model score

For fusion model score (FMS) generation, bivariate logistic regression was used to combine the corresponding CMS and the normalized class probability from the DL model using regression coefficients as weights (Table S2; Figure 2).

Reader study

To compare the accuracy of the models with haematopathologists, six independent haematopathologists were involved in the reader study, including three juniors (less than 5 years of experience) and three seniors (more than 10 years of experience). Hundred patients including 16 with control, 20 with PV, 21 with ET, 12 with prePMF and 31 with overt PMF were randomly selected from the combined validation cohorts for reader study. We employed a random selection process using Python's internal pseudo-random number generator, which implements the Mersenne Twister algorithm, for sample selection. To further enhance the reproducibility of our experiment, we fixed the random seed to 10 and ensured that the sequence of pseudo-random numbers remains consistent across executions, allowing the same samples to be selected in repeated experiments. All the haematopathologists reviewed data of the 100 patients independently. True labels of the patients were masked. Research aims, WSIs and clinical information that used in the model development were provided.

Model evaluation and statistical analysis

Continuous variables were summarized by median (interquartile range), compared among groups using ANOVA. Categorical variables were summarized by frequency and proportion [n (%)] and compared using χ2 test. Area under curve (AUC) and 95% confidence interval (CI) of receiver operating characteristic (ROC) analysis were estimated by R package pROC to evaluate model accuracy. Sensitivity, specificity and accuracy were also estimated to assist in model evaluation. For patients with more than one predicted diagnosis using proposed model, the output was judged to be accurate if the true label of diagnosis was included. Comparisons between the haematopathologists and the models were based on the multireader multicase analysis of variance (MRMCaov) method19 which is implemented in the R package MRMCaov. Optimal cut-off points for the CMS, DLMS and FMS of MPNs or subtype diagnosis models were estimated by Youden index in the internal testing cohort. All of the analyses were performed using R Statistical Software version 4.2.1 and Python version 3.9.1.

RESULTS

Cohort setting and population characteristics

A total of 929 patients with all of the subtypes from five hospitals were divided into the training cohort (N = 482), the internal testing cohort (N = 224) and the internal validation cohort (validation cohort 1) (N = 223). Then, 122 patients were collected from two additional hospitals and divided into two external validation cohorts, namely, validation cohort 2 (N = 82) and validation cohort 3 (N = 40) (Figure S1). The distributions of the diagnoses in each cohort are shown in Table S3. Demographic and available clinical characteristics of non-MPNs and MPN subtypes in each cohort are presented in Tables S4–S8.

Development of the diagnosis model in the training and testing cohorts

First, the DL and clinical model were made first based on HE-stained WSIs of BM biopsy specimens and clinical parameters. For DL model development, 482 WSIs from the training cohort were used for patch feature extraction and classifying algorithm training, and the internal testing cohort was used for testing model performances to choose the best model. Training and testing loss curves and corresponding mean AUCs in the internal testing cohort of the DL model are described in Figure S2. For clinical model construction, multivariate logistic regression was performed in the training cohort and age, Hb, RBC and PLT were finally enrolled (Figure 2). The fusion model was constructed by combining the DL and clinical model with different weights for each subtype using logistic regression (Figure 2).

Validation of proposed diagnosis models

We next validated our diagnosis models in the validation cohort 1, 2 and 3 (together marked as the combined validation cohort). The performances in the combined validation cohort are shown in Table S9 and the results in each cohort are detailed in Tables S10–S12, showing largely consistent results.

Validation of MPN versus non-MPN diagnosis

For MPN versus non-MPN diagnosis, the fusion model achieved better performance compared with the clinical model and DL model. The AUC and accuracy of the fusion model were 0.931 (95% CI: 0.891–0.971) and 0.933 (95% CI: 0.900–0.956), respectively, which were significantly higher than those of the clinical model (AUC: 0.848, 95% CI: 0.804–0.892, p = 0.0014; accuracy: 0.783, 95% CI: 0.734–0.825, p < 0.0001) and the DL model (AUC: 0.891, 95% CI: 0.836–0.946, p = 0.0047; accuracy: 0.882, 95% CI: 0.841–0.913, p = 0.0004) in the combined validation cohort (Figure S3A,B; Table S9). Misclassified MPN cases into non-MPNs decreased from 55(19.9%) in the clinical model to 10 (3.6%) in the fusion model, mostly from overt PMF (Figure S3C,D).

Validation of MPN subtypes diagnosis

We next validated the three models in the differentiation of MPN subtypes in the combined validation cohort.

First, for PV diagnosis, the fusion model achieved better performance than the DL model and was comparable to the clinical model. The AUC and accuracy of the fusion model (AUC: 0.975, 95% CI: 0.960–0.990; accuracy, 0.911, 95% CI: 0.874–0.937) were higher than DL model (AUC, 0.864, 95% CI: 0.820–0.907, p < 0.0001; accuracy, 0.776, 95% CI: 0.727–0.819, p < 0.0001) (Figure 3A,B; Table S9). Fifty-five (23.4%) cases were misclassified into PV group in the DL model (Figure 4A,B), and the number decreased to 19 (8.1%) in the fusion model. Notably, the clinical model also showed good performance in PV identification with an AUC of 0.975 (95% CI: 0.960–0.991) and accuracy of 0.942 (95% CI: 0.911–0.963), which was comparable with the fusion model (AUC: p = 0.9700; accuracy: p = 0.1300) (Figure 3A,B), and only five (2.1%) non-PV cases were misclassified into PV group.

Details are in the caption following the image
Performances of the clinical, DL, and fusion models in MPN subtypes identification in the combined validation cohort. The ROC curves, sensitivities, specificities and accuracies of three models on PV (A, B), ET (C, D), prePMF (E, F) and overt PMF (G, H) diagnosis in the combined validation cohort. DL, deep learning; ET, essential thrombocytosis; MPN, myeloproliferative neoplasm; ns, no significance; prePMF, prefibrotic primary myelofibrosis; PV, polycythaemia vera.
Details are in the caption following the image
Confusion matrices and misclassified cases in the clinical, DL and fusion models in myeloproliferative neoplasm subtype identification. Confusion matrices show diagnostic errors on PV (A), ET (C), prePMF (E) and overt PMF (G) for the three models. Number of misclassified cases into PV (B), ET (D), prePMF (F) and overt PMF (H) group from each subtype in the three models. DL, deep learning; ET, essential thrombocytosis; ns, no significance; prePMF, prefibrotic primary myelofibrosis; PV, polycythaemia vera.

For ET identification, the fusion model showed better performance than both the clinical and DL models. The AUC of the fusion model (0.887, 95% CI: 0.850–0.925) was significantly improved compared with that of the DL model (0.837, 95% CI: 0.790–0.884, p = 0.0001) and the clinical model (0.826, 95% CI: 0.778–0.874, p = 0.0130) (Figure 3C,D; Table S9). The fusion model (0.840, 95% CI: 0.796–0.877) showed better accuracy than the clinical model (0.732, 95% CI: 0.680–0.778, p = 0.0009), with higher specificity (fusion model: 0.913, 95% CI: 0.870–0.942, clinical model: 0.708, 95% CI: 0.648–0.762, p = 0.012) and lower sensitivity (fusion model: 0.603, 95% CI: 0.488–0.707; clinical model: 0.808, 95% CI: 0.703–0.882, p < 0.0001) (Figure 3C; Table S9). Seventy (29.1%) cases were misclassified into ET group in the clinical model, mostly from PV (N = 31) and prePMF (N = 26), and the number decreased to 11 for PV and 5 for prePMF in fusion model (Figure 4C,D).

Similarly, for prePMF identification, the fusion model performed better than the clinical model and did not differ from the DL model. The fusion model showed an AUC of 0.899 (95% CI: 0.851–0.947) and accuracy 0.863 (95% CI: 0.820–0.896), which were higher than those of the clinical model (AUC: 0.723, 95% CI: 0.639–0.806, p = 0.00011; accuracy: 0.581, 95% CI: 0.526–0.635, p < 0.0001). Specificity improved significantly from 0.541 (95% CI: 0.481–0.599) in the clinical model to 0.889 in the fusion model (95% CI: 0.846–0.921, p < 0.0001) (Figure 3E,F; Table S9), reducing the misdiagnosis rate, especially from ET. Consistently, there were 124 (45.9%) cases misclassified into prePMF group, among which 70 were from ET (Figure 4E,F), and the number decreased to 4 in the fusion model (Figure 4F).

For overt PMF diagnosis, the performance of the fusion model was better than that of the clinical model and comparable to that of the DL model. The fusion model achieved an AUC of 0.980 (95% CI: 0.961–0.999) and accuracy of 0.962 (95% CI: 0.934–0.978), comparable to DL model (AUC: 0.982, 95% CI: 0.965–0.999, p = 0.4600; accuracy: 0.946, 95% CI: 0.915–0.966, p = 0.1800) and outperforming the clinical model (AUC: 0.856, 95% CI: 0.809–0.904, p < 0.0001; accuracy: 0.824, 95% CI: 0.778–0.862, p < 0.0001) (Figure 3G,H; Table S9). Notably, only three prePMF patients were misclassified into overt PMF group in the fusion model (Figure 4G,H), indicating that our model could effectively distinguish between prePMF and overt PMF.

Comparison study of haematopathologists and the proposed models

To compare the accuracies of the proposed models with those of the haematopathologists, a reader study was performed on six independent haematopathologists with different levels of experience. The speed of model output on the regular workstation was much faster than manual processing, with less than 1 second. The performances of the proposed models and haematopathologists are summarized in Table 2. The diagnoses of each sample and the confusion matrices by haematopathologists and the proposed models are listed in Tables S13 and S14.

TABLE 2. Diagnosis performances of the three models and haematopathologists in 100 randomly selected samples.
Diagnostic subtype Diagnostic method AUC (95% CI) Sensitivity (95% CI) Specificity (95% CI)
PV Senior 0.908 (0.899, 0.917) 0.867 (0.795, 0.938) 0.950 (0.888, 1.000)
PV Junior 0.850 (0.809, 0.891) 0.733 (0.590, 0.877) 0.967 (0.918, 1.000)
PV Clinical 0.904 (0.859, 0.949) 0.850 (0.850, 0.850) 0.958 (0.891, 1.000)
PV DL 0.781 (0.754, 0.808) 0.850 (0.850, 0.850) 0.713 (0.612, 0.813)
PV Fusion 0.865 (0.748, 0.981) 0.833 (0.690, 0.977) 0.896 (0.819, 0.973)
ET Senior 0.860 (0.795, 0.926) 0.746 (0.565, 0.927) 0.975 (0.920, 1.000)
ET Junior 0.707 (0.503, 0.912) 0.444 (0.000, 0.892) 0.970 (0.933, 1.000)
ET Clinical 0.796 (0.787, 0.805) 0.905 (0.905, 0.905) 0.688 (0.590, 0.786)
ET DL 0.724 (0.712, 0.735) 0.603 (0.535, 0.671) 0.844 (0.756, 0.932)
ET Fusion 0.801 (0.790, 0.813) 0.746 (0.678, 0.814) 0.857 (0.771, 0.942)
PrePMF Senior 0.797 (0.593, 1.000) 0.722 (0.291, 1.000) 0.871 (0.732, 1.000)
PrePMF Junior 0.693 (0.544, 0.841) 0.472 (0.041, 0.903) 0.913 (0.798, 1.000)
PrePMF Clinical 0.715 (0.630, 0.799) 0.861 (0.742, 0.981) 0.568 (0.464, 0.672)
PrePMF DL 0.834 (0.766, 0.902) 0.778 (0.658, 0.897) 0.890 (0.831, 0.950)
PrePMF Fusion 0.828 (0.760, 0.896) 0.778 (0.658, 0.897) 0.879 (0.816, 0.941)
Overt PMF Senior 0.850 (0.683, 1.000) 0.710 (0.360, 1.000) 0.990 (0.949, 1.000)
Overt PMF Junior 0.834 (0.812, 0.857) 0.731 (0.609, 0.854) 0.930 (0.852, 1.000)
Overt PMF Clinical 0.739 (0.663, 0.814) 0.656 (0.610, 0.702) 0.821 (0.722, 0.921)
Overt PMF DL 0.907 (0.842, 0.972) 0.882 (0.835, 0.928) 0.932 (0.867, 0.998)
Overt PMF Fusion 0.912 (0.826, 0.997) 0.882 (0.835, 0.928) 0.942 (0.847, 1.000)
  • Abbreviations: AUC, area under curve; DL, deep learning; ET, essential thrombocytosis; prePMF, prefibrotic primary myelofibrosis; PV, polycythaemia vera.

Overall, the average performances of senior haematopathologists were better than those of junior haematopathologists, especially for PV, ET and prePMF. Namely, the AUCs of the seniors were 0.908 (95% CI: 0.899–0.917) for PV, 0.860 (95% CI: 0.795–0.926) for ET and 0.797 (95% CI: 0.593–1.000) for prePMF, while the AUCs of the juniors were 0.850 (95% CI: 0.809–0.891, p = 0.0145) for PV; 0.707 (95% CI: 0.503–0.912, p = 0.0007) for ET; and 0.693 (95% CI: 0.544–0.841, p = 0.0311) for prePMF (Table 2; Figure 5A–D).

Details are in the caption following the image
Comparison of the proposed diagnosis models with haematopathologists with different experience levels. The AUCs, sensitivities and specificities of the proposed models and haematopathologists with different experience levels in PV (A), ET (B), prePMF (C) and overt PMF (D) diagnosis. AUC, area under curve; ET, essential thrombocytosis; ns, no significance; prePMF, prefibrotic primary myelofibrosis; PV, polycythaemia vera.

We then compared the performances of the proposed models with those of the haematopathologists (Figure 5; Table S15). In MPN subtype differentiation, the clinical model achieved the highest AUC (0.904, 95% CI: 0.859–0.949) for PV, which was equivalent to the seniors (p = 0.8373) and higher than the juniors (p = 0.0208), and its performance tended to be better than that of the fusion model (p = 0.073) (Figure 5A). For ET identification, the fusion model achieved the best performance, with an AUC of 0.801 (95% CI: 0.790–0.813); it was considerably better than the juniors (p = 0.0141) and comparable to the seniors (p = 0.0914) (Figure 5B). Similarly, for prePMF identification, the AUC of the fusion model (0.828, 95% CI: 0.760–0.896) was comparable to the seniors (p = 0.4651) and higher than the juniors (p = 0.0085) (Figure 5C). In overt PMF diagnosis, the fusion model (0.912, 95% CI: 0.826–0.997) achieved better performance than the juniors (0.834, 95% CI: 0.812–0.857, p = 0.0330) and comparable to while tended to be better than the seniors (0.850, 95% CI: 0.683–1.000, p = 0.0773) (Figure 5D).

DISCUSSION

In this AI-assisted modelling study, we have successfully developed and validated the significant diagnostic value of machine learning in MPNs and subtype pathological identification. The proposed model has demonstrated its superior ability to distinguish MPNs from other haematological disorders and MPN subtypes. By examining the CAMs, we observed that our DL model successfully localized the disease-affected regions, even without explicit pixel-level annotations. This demonstrates that the model has learned to focus on the correct features related to the presence of pathology, thereby showcasing the impressive feature extraction capability of the WSI-level label DL algorithm. This finding establishes its comparability to the cell-level annotation supervised model.

The number of cohorts and the patient diversity from different hospitals might lead to a discrepancy between different validation cohorts. The diagnosis of multiple classifications is a trade-off between sensitivity and specificity and compromise among multiple classifications. The most common way to get multiple classifications in DL networks is taking the largest output probability as the prediction result based on the SoftMax layers, which benefits from the mutually exclusive and only one output. However, if there is a very minor difference between the probabilities,20 tiny relative differences would be magnified. This would lead to wrong prediction without significant differences from each other which also raises doubts as to whether it contradicts the reality of meeting clinical diagnostic criteria. Instead of taking the maximum probability, here, we combined DL predictions with clinical information into a statistical model for further subtype pathological diagnosis.

Abnormal blood routine is the common visit reason for suspected MPN patients. BM biopsy and genetic test are performed simultaneously for diagnosis. Notably, classification of MPNs is not biology driven. Pathophysiological mechanisms play a crucial role in MPN subtypes differentiation, and characteristic changes such as megakaryocyte morphology are the key point in diagnosis.21 However, the lack of reproducibility in morphological assessment especially HE staining leads to the discordance of MPN subtype identification between haematopathologists, making accurate pathological diagnosis challenging.5, 6 AI and machine learning algorithms provide an effective tool for automated feature extraction, and they have been applied in the diagnosis and prognosis of MPNs.8, 13, 14, 22 Recent studies have developed machine learning approaches to automatically describe megakaryocyte features and BM fibrosis for MPN diagnosis.13, 14 However, these models have relied on the evaluation of single cell type or fibrosis grades, which is time-consuming and needs expensive manual annotations. In our study, we developed a DL algorithm based on annotation-free HE-stained WSIs and combined clinical information to improve diagnostic accuracy. Our models appealed to the actual clinical MPN pathological diagnostic scenario by taking clinical characteristics into consideration. Meanwhile, multiple external validations made the model convincing.

Careful morphological assessment is particularly important for the identification of MPNs and reactive alterations or other haematological disorders. Our fusion model showed high accuracy in distinguishing MPNs from non-MPNs based on clinical parameters and morphological features. Notably, combining driver mutation profiles could further improve the identification accuracy, while for triple-negative MPNs, pathological evaluation may be more helpful.

Erythrocytosis is the hallmark of PV and patients who meet the Hb or HCT threshold, especially Hb >185 g/L (male) and 165 g/L (female) or HCT >52% (male) and >48% (female), accompanied by the presence of JAK2V617F mutation are likely diagnosed as PV in the absence of morphological assessment.23 Although the AUC of the fusion model was not further improved compared with the clinical model by combining morphological features, it still provided an effective tool for PV identification. Importantly, JAK2V617F mutation should be confirmed before making a final diagnosis.

PrePMF with thrombocytosis shares similar clinical manifestations with ET and they always initially mimic with each other, making it difficult for haematopathologists and haematologists to differentiate. Nearly 10%–20% of previously diagnosed ET cases were reclassified as prePMF by re-evaluating the BM specimens.24 The frequencies of driver mutations and MF grades are similar in ET and prePMF, and the differentiation mainly relies on HE staining assessment.25 In our study, the average performances of senior haematopathologists on ET and prePMF identification were better than those of the juniors, indicating that differences in experience may influence the identification of pathological abnormalities. Here, our fusion model achieved AUCs of more than 0.850 in ET and prePMF identification, and it performed better than junior haematopathologists and was comparable to the seniors, which is useful when access to haematopathology expertise is restricted, especially in low-resource healthcare systems.

Overt PMF is characterized by the megakaryocyte alterations and BM fibrosis, and it represents the continuous stage of prePMF.4 Our fusion model did not include reticulin staining, which was a limitation. Nevertheless, the AUC of the fusion model in overt PMF identification was more than 0.900, and this model was able to effectively distinguish between prePMF and overt PMF based on HE-stained WSIs, indicating that our machine learning model has high sensitivity in feature identification and extraction. It provides a tool for the preliminary diagnosis of overt PMF in conditions when reticulin staining is not accessible or has not been done at the first visit.

There are some limitations to our study. First, our model provided an effective tool for HE-stained BM specimen assessment, while additional important pathological features like reticulin staining were not included. Nevertheless, future models would benefit from incorporating more information to improve the predictive accuracy. Second, samples with diseases such as myelodysplastic neoplasms (MDS) with fibrosis and MDS/MPN were not included in the non-MPN cohort. The differentiation between MPNs and these diseases is more complicated, and specific training is needed for the identification of these diseases. Third, despite the favourable performance of the fusion model on MPN differentiation, further validation in prospective and real-world cohort including patients from different regions and levels of medical institutions should be conducted. Fourth, the subtype identification model may overlook the uncertainty in MPN diagnosis. In our future research, we will further optimize the subtype identification model using Bayesian method to consider uncertainty of overall MPN diagnosis. Fifth, for the proposed models, the predicted diagnosis maybe more than one for some cases and the output was judged to be accurate if the true label of diagnosis was included. Despite this, we believed that narrowing down the range of subtypes is still clinically meaningful. We will improve our models by optimizing the algorithm, expanding the sample size and enrolling more information like MF grades in the ongoing prospective study.

In conclusion, by using the largest sample size in the field and combining both morphological features and clinical characteristics, the proposed AI models exhibited satisfactory performances in pathological diagnosis and subtype differentiation of MPNs. The models outperformed junior haematopathologists and achieved comparable performance to senior haematopathologists. To enhance the integration of DL in clinical diagnosis, prospective validation and tool development are crucial in future work to enhance the integration of DL in clinical practice.

AUTHOR CONTRIBUTIONS

Y.W., R.W., W.S. and J.L. conceived and designed the study. Y.Z. and G.Y. developed and validated the deep learning algorithm. R.W., Z.S., L.W., J.H., M.D., M.X., J.W., S.C., Q.W., J.H., X.H., L.F. and J.M. collected clinical data and WSIs. Y.Z. and J.H. performed WSI preprocessing. Z.S., Y.Z., W.S., R.W., J.L. and Y.W. did statistical analysis and clinical interpretation. F.C. and G.Y. reviewed statistical analysis and deep learning procedures. Z.S. and Y.Z. prepared initial draft. All the authors critically reviewed the manuscript. All the authors read and approved the final report. The corresponding authors had full access to the data and had final responsibility for the decision to submit for publication.

ACKNOWLEDGEMENTS

This study was supported by the National Natural Science Foundation of China (82200151 to Z.S. and 82473728 to Y.W.) and Jiangsu Province Capability Improvement Project through Science, Technology and Education (ZDXK202209 to J.L.). We thank LetPub (www.letpub.com) for its linguistic assistance during the preparation of this manuscript.

    FUNDING INFORMATION

    This study was supported by the National Natural Science Foundation of China (82200151 to Z.S. and 82473728 to Y.W.) and Jiangsu Province Capability Improvement Project through Science, Technology and Education (ZDXK202209 to J.L.).

    CONFLICT OF INTEREST STATEMENT

    The authors declare no conflicts of interest.

    ETHICS STATEMENT

    This study was approved by the ethic committee of Jiangsu Province Hospital (2021-SR-019).

    PATIENT CONSENT STATEMENT

    The informed consent was waived considering the retrospective nature of the study.

    DATA AVAILABILITY STATEMENT

    The algorithm of the models is available in ‘https://github.com/WeiLab4Research/INDEED’. The clinical data can be obtained upon reasonable request by emailing to the corresponding author according to the ethical requirement of the institutions.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.