Volume 106, Issue 4 pp. 282-293

ORIGINAL ARTICLE

Free Access

Comparison of three machine learning algorithms for classification of B-cell neoplasms using clinical flow cytometry data

Wikum Dinalankara,

Wikum Dinalankara

Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, New York, USA

Search for more papers by this author

David P. Ng,

David P. Ng

orcid.org/0000-0001-8604-2073

Department of Pathology, University of Utah, Salt Lake City, Utah, USA

Search for more papers by this author

Luigi Marchionni,

Luigi Marchionni

Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, New York, USA

Search for more papers by this author

Paul D. Simonson,

Corresponding Author

Paul D. Simonson

[email protected]

Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, New York, USA

Correspondence

Paul D. Simonson, Division of Hematopathology, Pathology and Laboratory Medicine, Weill Cornell Medicine, 525 East 68th Street, Starr 702B, New York, NY 10065, USA.

Email: [email protected]

Search for more papers by this author

Wikum Dinalankara,

Wikum Dinalankara

Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, New York, USA

Search for more papers by this author

David P. Ng,

David P. Ng

orcid.org/0000-0001-8604-2073

Department of Pathology, University of Utah, Salt Lake City, Utah, USA

Search for more papers by this author

Luigi Marchionni,

Luigi Marchionni

Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, New York, USA

Search for more papers by this author

Paul D. Simonson,

Corresponding Author

Paul D. Simonson

[email protected]

Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, New York, USA

Correspondence

Paul D. Simonson, Division of Hematopathology, Pathology and Laboratory Medicine, Weill Cornell Medicine, 525 East 68th Street, Starr 702B, New York, NY 10065, USA.

Email: [email protected]

Search for more papers by this author

First published: 09 May 2024

https://doi.org/10.1002/cyto.b.22177

Citations: 4

Share a link

Email
Wechat
Bluesky

Abstract

Multiparameter flow cytometry data is visually inspected by expert personnel as part of standard clinical disease diagnosis practice. This is a demanding and costly process, and recent research has demonstrated that it is possible to utilize artificial intelligence (AI) algorithms to assist in the interpretive process. Here we report our examination of three previously published machine learning methods for classification of flow cytometry data and apply these to a B-cell neoplasm dataset to obtain predicted disease subtypes. Each of the examined methods classifies samples according to specific disease categories using ungated flow cytometry data. We compare and contrast the three algorithms with respect to their architectures, and we report the multiclass classification accuracies and relative required computation times. Despite different architectures, two of the methods, flowCat and EnsembleCNN, had similarly good accuracies with relatively fast computational times. We note a speed advantage for EnsembleCNN, particularly in the case of addition of training data and retraining of the classifier.

1 INTRODUCTION

In pathology, flow cytometry has found broad applications in the diagnosis, classification, and monitoring of various diseases. It is particularly valuable in hematopathology, where it plays a crucial role in the evaluation of blood disorders, such as leukemia and lymphoma. The data generated by flow cytometry is typically presented in the form of two-dimensional scatter plots or histograms of marker properties such as forward scatter, side scatter, and biomarker fluorescence intensity. These plots allow for the visualization and interpretation of heterogenous cellular populations. Flow cytometry “gating” is a crucial step in the data analysis process that helps to identify and isolate specific populations of cells within a sample. It involves the selection of specific regions or subsets within two-dimensional scatter plots to focus the analysis on relevant cell populations and exclude unwanted background noise. The process may be repeated for subsets of cells identified in this manner, resulting in a consecutive series of such gates.

Given this manner of manual classification of flow cytometry data, it stands to reason that a machine learning technique might be designed to automatically carry out this process, and, indeed, there is a history of published research literature in the application of various machine learning approaches to classification of flow cytometry data (Aghaeepour et al., 2013; 2011; Ge & Sealfon, 2012). Note that we use “machine learning” as an umbrella term for learning methods such that it includes both deep learning methods (i.e., deep neural network-based methods) as well as other approaches that do not fall under the term “deep learning” (e.g., random forests, support vector machines). There are many examples of prior applications of machine learning and related computational methods to flow cytometry. One application is the unsupervised clustering of flow cytometry events to identify subpopulations of cells, and many clustering algorithms have been developed (Aghaeepour et al., 2013; Liu et al., 2019; Weber & Robinson, 2016). This approach offers what is expected to be a less biased approach to identifying cell subpopulations since the algorithms can identify subpopulations beyond those that the human operator is looking for via traditional gating, and it is not restricted to two-dimensionally-based polygon gates. In another application, traditional gating can be enhanced by algorithm-assisted gating, which allows detailed, visual interpretation of the algorithm's approach to division of data into subpopulations and is more similar to traditional practice; some of these algorithms include FlowDensity, FlowType, and FlowLearn (Aghaeepour et al., 2012; Lux et al., 2018; Malek et al., 2015). In some applications, the immunophenotypes of various cell populations (e.g., neoplastic cells of chronic lymphocytic leukemia) are well defined and distinct from the background population of cells. In these cases, classification of single cells can be performed which can be particularly useful, for example, in minimal residual disease analysis, wherein the number of neoplastic cells is very small (Salama et al., 2022). In situations in which a diagnosis might be more accurately rendered by considering the context of the cell population as a whole (which classification methods are the focus of examination of this paper); most of these approaches require some sort of dimensionality reduction step to reduce a list of individual cells to an overall population representation. These dimensionality reduction approaches can also aid in visualization of flow cytometry data that can help identify clustering of cells in a way that cannot be easily achieved using the traditional series of two-dimensional plots. Some of these dimensionality reduction approaches include principal component analysis (PCA), t-SNE, UMAP, self-organizing models (SOMs), SPADE, Wanderlust, Citrus, and PhenoGraph (Brinkman, 2020). We note that in addition to clustering, classification, and dimensionality reduction algorithms, much work has also been invested in accompanying quality control and scaling/normalization algorithms, which help to filter out low quality events (e.g., “doublets”, data collected during interruptions in fluid flow, and laser excitation interruptions) and scale data to minimize batch to batch differences. Some available quality control packages include flowAI (Monaco et al., 2016), flowClean (Fletez-Brant et al., 2016), and PeacoQC (Emmaneel et al., 2022), while various scaling and batch normalization packages include gaussNorm (Hahne et al., 2010), fdaNorm (Finak et al., 2014), and CytoNorm (Van Gassen et al., 2020).

While there are numerous advantages of an automated process for classifying flow cytometry data, two major considerations are of great interest to the clinical laboratory: first, it could significantly ease the workload burden on the part of pathology personnel, and, second, it could potentially reveal possibilities for more accurate classification of cases for which a straightforward gating structure may not be immediately apparent to the human eye. We will return to these ideas and others in our Discussion.

Here we present a comparison of three recently published machine learning methods for classification of clinical flow cytometry data. Each of the examined methods classifies samples according to specific disease categories using ungated flow cytometry data. The first classifier employed here is “Flowcat”, published by Zhao et al. (2020). The algorithm has been presented in the context of classification of subtypes of mature B cell neoplasms along with normal cases (we perform our analyses under the same classification problem; see Methods – Section 2). Flow cytometry routinely plays a significant role in the diagnosis of many types of B cell neoplasms (Seegmiller et al., 2019). This approach uses self-organizing maps as a dimensionality reduction tool to preprocess the data before the application of a convolutional neural network (Li et al., 2022) to obtain the classification results.

The second classifier we apply is “EnsembleCNN”, first presented by Simonson et al. (2021) for diagnostic binary classification of presence or absence of classic Hodgkin lymphoma, followed by prediction of need for additional antibody panels in real-time (Simonson et al., 2022). While this approach also utilizes convolutional neural networks, rather than train a single CNN as with Flowcat, EnsembleCNN trains a set of CNNs, one for each possible pairwise marker combination that can be generated from the antibody panels available (see Methods – Section 2). Dimensionality reduction is achieved by generating a series of two-dimensional histograms, one for each possible pairwise marker combination. The output of these CNNs is then aggregated through a random forest classifier to obtain the final predictive labels.

Finally, (Ng & Zuromski, 2021) presented an approach for classification of clinical flow cytometry data that uses uniform manifold approximation and projection (UMAP; McInnes et al., 2018) as a dimensionality reduction approach with a Random Forest (Ho, 1995) to perform the classification task, which we abbreviate as UMAP-RF here and is our third algorithm we review herein. In Ng and Zuromski (2021) UMAP-RF is applied to the detection and classification of B-cell malignancies from peripheral blood derived flow cytometry data with a 10-color B-cell panel. The 12-parameter data (10 fluorescence parameters along with forward scatter and side scatter) was first reduced to two dimensions through UMAP embeddings and then summarized as two-dimensional histograms, which served as input for random forest classification.

Here, we report a comparison of these classifiers and evaluate them in terms of classification accuracy and relevant performance metrics as well as computational efficiency in terms of computing time (Figure 1). We also discuss the relevant explainability potential of each of these classifiers.

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

Setup of the experiment for comparison of classifiers. The B-cell neoplasm data is filtered for cases with availability of marker panels and then divided separately into (i) training and testing sets and (ii) 5 equally sized folds for cross validation. The analyses are performed using the three classification algorithms for (i) and the first two algorithms for (ii). [Color figure can be viewed at wileyonlinelibrary.com]

2 METHODS

2.1 Flow cytometry dataset

The clinical flow cytometry dataset analyzed in this work is comprised of 21,152 total cases from the Munich Leukemia Laboratory (MLL) spanning normal controls and eight different B-cell neoplasm subtypes according to the provided diagnosis. Specimens consisted of either peripheral blood or bone marrow aspirate, and the data were collected using Navios instruments (personal communication). The case dates vary from 2016-01-02 to 2018-12-13 and have been made publicly available in association with publication of the Flowcat classifier (Zhao et al., 2020). Thus, our analysis of this dataset with respect to the Flowcat classifier will in large part be an independent reproduction of the results presented in Zhao et al. (2020). Note that we are using the data as it has been curated for the Flowcat project, which involves pre-processing of the data as described by Zhao et al. (2020). This includes linear gain with a factor for SS and FS channels, and exponential scaling for other channels, as well as visual inspection of channel plots to detect machine calibration issues. In the deposited data, apart from the case date, no additional metadata was available that could be used for pre-processing of the cases on our end.

The cases have reported neoplastic cell levels varying from 0% to 97% (mean = 15.82%). Most cases have 3 antibody panels (n = 20,738) while some have only 2 panels (n = 414), with these being comprised of 7 panels in total. Note that institution specific and cytometer specific variables were not available for possible batch-wise adjustments.

Selection of cases that include the top 3 most widely available marker panel subsets provides us with a set of 20,731 cases with the subtype specific counts shown in Table 1 and the selected antibody panels in Table S1. The selected files range from 10,028 to 50,000 cells (mean = 47,781.5). The selected markers are as follows; for panel 1, forward scatter (FS), side scatter (SS), FMC7-FITC, CD10-PE, IgM-ECD, CD79b-PC5.5, CD20-PC7, CD23-APC, CD19-APCA750, CD5-PacBlue, CD45-KrOr. For panel 2: FS, SS, Kappa-FITC, Lambda-PE, CD38-ECD, CD25-PC5.5, CD11c-PC7, CD103-APC, CD19-APCA750, CD22-PacBlue, CD45-KrOr. For panel 3: FS, SS, CD8-FITC, CD4-PE, CD3-ECD, CD56-APC, CD19-APCA750, HLA-DR-PacBlue, CD45-KrOr.

TABLE 1. Sample counts for normal samples and B cell neoplasms by subtype. We use the subtypes to form a 9-class classification problem for our comparison of classifiers.

Type	Count	Description
Normal	11,373	Normal
CLL	4442	Chronic lymphocytic leukemia
MBL	1614	Monoclonal B-cell lymphocytosis
MZL	1121	Marginal zone lymphoma
LPL	731	Lymphoplasmacytic lymphoma
PL	593	Prolymphocytic leukemia
MCL	313	Mantle cell lymphoma
FL	250	Follicular lymphoma
HCL	225	Hairy cell leukemia

We omit hairy cell leukemia variant (HCLv), given the insufficient case counts (n = 69) to allow for training, testing, and validation. We use the remaining data for multi-class classification between 9 labels (normal and 8 B-cell neoplasm subtypes). We will use the subtype abbreviations presented in Table 1 to discuss our results from here on. It is important to note that the sub-type labels presented here are clinic-pathologic diagnoses and not decisions solely derived from analysis of flow data. Zhao et al. (2020) notes that the diagnoses were confirmed by complementary tests where necessary based on histology, cytomorpholoty, fluorescence in situ hybridization and/or molecular genetics.

2.2 Classifiers

2.2.1 Flowcat

Flowcat utilizes convolutional neural networks in conjunction with self-organizing maps (SOMs; Kohonen, 1990, 2013), an approach designed to convert flow cytometry data into lower dimensional representations. In the context of MFC data, SOMs have been utilized in previous research to obtain two dimensional representations, most notably by the FlowSOM method, presented by Quintelier et al. (2021) and Van Gassen et al. (2015). In FlowSOM, the two-dimensional representations are developed predominantly as a visualization tool in addition to obtaining a clustering of cells by cell type. However, in the context of Flowcat, the aim is to obtain a two-dimensional representation of the higher dimensional panel-level cell data to be utilized as input for a convolutional neural network. For example, a case analyzed using three antibody panels (resulting in three LMD, or listmode data, files, which contain all the raw data generated during acquisition and can be converted to the flow cytometry standard, or FCS, format) will be reduced to three m × m weight matrices computed independently of each other, which may then be treated similarly to typical image data.

The convolutional neural network portion of Flowcat can be described as follows: each panel (and resulting SOM weight matrix/dimensionality reduction) is processed through a separate pipeline of convolutional filters, with three levels of filters of gradually decreasing sizes applied, followed by global max pooling. The pooled outputs are then concatenated across the panels to form a single case-level input layer to a neural network comprised of two fully connected hidden layers (of 64 and 32 nodes, respectively), completed by an output layer with a set of nodes corresponding to the expected output classes (i.e., classification subtypes).

2.2.2 EnsembleCNN

EnsembleCNN also employs CNNs to perform classification; however, the overall architecture is significantly different from that of Flowcat. In EnsembleCNN, a separate CNN is trained for each possible pairwise combination of markers, conceptually similar to interpretation of the two-dimensional scatterplots typically observed during human analysis of flow cytometry data (Figure S1). The ensemble of outputs from the CNNs is passed to a random forest classifier for final classification of a given case.

Given that, the data curated is comprised of 3 marker panels of 11, 11, and 9 markers (see Table S1), EnsembleCNN trains $(\binom{11}{2}) + (\binom{11}{2}) + (\binom{9}{2}) = 146$ CNNs. Each marker combination (represented by a CNN) has 9 output nodes, one for each of the possible classification labels, resulting in a total of 1314 outputs being provided as input to the random forest. The random forest itself is trained to produce a categorical 9-class label output. Note that these panels may share certain markers in common, and this does not affect the procedure.

To prepare the data to be provided as input to these CNNs, the following process is applied: for every pair of markers considered, the cells/events from a single file (i.e., LMD file corresponding to a single case and antibody panel) will form 2D histograms of size b × b where b is the number of bins along a given axis, with histograms being converted to log scale, with each bin value x being transformed as log(x + 1) and standardized to the [0, 1] range. Then the values on these b × b matrices will represent the proportion of cells corresponding to that region of the scatterplot.

The architecture of each CNN is depicted in Figure S1, which involves multiple convolutional filters with 10% dropout between them and max pooling at two levels. The output obtained from the final convolution layer is flattened into a single vector and provided as input to a single densely connected hidden layer of 64 nodes, followed by the output layer with softmax activation, with these last two layers incorporating L2 regularization to mitigate overfitting.

2.2.3 UMAP-RF

In UMAP-RF, a two-dimensional representation of each panel for all cases is obtained through the UMAP algorithm. Given that the training data in most classification scenarios would be hundreds or thousands of cases and thus total many millions of cells, it would be computationally exhaustive in typical laboratory resource settings to use all of the training data to build a UMAP representation. Therefore, a down sampling approach is suggested by Ng and Zuromski (2021) where a smaller subset of cells (ranging between 8.6 million and 21 million in the results presented) is selected across all subtypes to build a two-dimensional UMAP. Following this, both the training and testing data are embedded in two dimensions via this UMAP projection.

The 2D embeddings obtained in this way are then condensed into two-dimensional histograms of 32 × 32 bins. For example, a file containing 50,000 cells and 11 markers would first be projected to the UMAP space to obtain a matrix of 50,000 × 2 dimensions, and the resulting 2D scatterplot would then be further reduced to a 32 × 32 matrix of densities. This process would be performed independently for each panel of markers.

The resulting 2D histograms are then flattened and concatenated across the panels to obtain a single vector for each case. These vectors are then provided as input to train a random forest classifier, which provides the 9-class output labels. The testing cases can then be processed in the same manner and subjected to the random forest classifier to obtain the desired predictive labels.

2.3 Explainability tools

In terms of classifier feature importance, the primary measures we utilize here with respect to EnsembleCNN are Shapley Additive Explanation (SHAP) values. Originally introduced in the context of game theory (Shapley, 1988), it has been adapted as a tool for measuring the marginal contributions of each feature of a machine learning model to the outputs obtained (Lundberg & Lee, 2017). The approach involves computing a null or “background” set of values for each feature and running various subsets of feature combinations by replacement of each feature value for each datapoint tested with the background value. The outputs obtained from the model are then compared with the outputs obtained for the correct feature value, and the difference in outputs are aggregated to estimate the contribution of each feature to the output.

Given the architecture of EnsembleCNN, we can perform explainability operations at the level of the random forest classifier. This allows us to deduce which pairs of features are of high importance for each predicted class. Further, it is possible to apply SHAP based methods to identify from each individual histogram the most impactful bin regions; see Simonson et al. (2021) for an explanation of this approach.

In Zhao et al. (2020), the contribution of various markers to the CNN output is estimated through occlusion analysis, where the output is computed with each marker omitted, and comparison of the differences in the predictions obtained are indicative of the importance of each marker to each subtype. Zhao et al. (2020) also note that saliency maps can be used to map regions of FlowSOM networks to specific diagnoses. By mapping individual flow cytometry events to the FlowSOM nodes, it is possible to plot increased or decreased numbers of cells corresponding to a given region of two-dimensional flow cytometry plots, which can aid in highlighting cells with immunophenotypes that are important for diagnosis.

2.4 Software and programming

For working with Flowcat, the python code base was obtained through the relevant GitHub repository (Zhao, 2020). Due to the large number of dependencies of this codebase, we were unable to work with Flowcat with the more recent python versions, and instead the Flowcat analysis was performed using python 3.6, the version in which it was originally deployed. The classifier is implemented using the keras/tensorflow library (Abadi et al., 2016), and we made minimal modifications to the code—only for the purpose of replacing deprecated functions with their newer counterparts. Flowcat was then run through a high-level script, which calls various Flowcat operations, and the resulting prediction files were then analyzed.

For EnsembleCNN, we adapted the python code made available by the senior author, which was deployed using python 3.9. We did not significantly modify the CNN architecture beyond expanding the output layer from 2 nodes (binary classification) to 9 (multiclass classification), with softmax activation. The code utilizes the keras/tensorflow library, and we made use of the random forest classifier made available through sklearn (Pedregosa et al., 2011).

The analysis of UMAP-RF was also performed using python 3.9. We did not use any existing codebase for this approach besides standard python libraries and the UMAP and random forest functionalities made available through the umap-learn and sklearn libraries. For some of the UMAP generations, we utilized the Google Cloud Platform due to significant memory requirements.

For UMAP-RF, we attempted two versions: one that sampled 10 million events from the training cases (for each panel), and one that sampled 22 million events. The total number of events to be sampled would be divided by 9 to obtain the number of events to be sampled for each group, and then further divided by the number of cases in each group to determine the number of cells to sample from each individual case. These cells were used to build a UMAP, onto which all training and testing data (including all cells from these cases) were projected. This process was repeated for each of the three antibody panels used in the analysis. Following this, all UMAP projected data were converted to 32 × 32 standardized histograms, and the flattened versions of these histograms were concatenated across the three panels; this produced a single vector of length 3072, representing one case for input to the RF. With the 22 million events sampling, we encountered issues with memory requirements on our local GPU server available and resorted to using Google Cloud Platform functionality to build some of the UMAPs. These UMAP objects were then copied over to the standard computing server where the projection of individual cases and the subsequent random forest training and validation was performed.

Note that all classifiers were run on the same server with GPU processing made available through tensorflow compiled to enable GPU functionality where possible. The time command (available on Unix platforms) was used to measure the time taken by each method to perform the overall training and testing pipeline. With the exception of using a Google cloud computing environment for the generation of some large UMAPs, all other analyses were performed on our local GPU server which includes 376 GB RAM, two Intel(R) Xeon(R) Gold 6226R CPUs @ 2.90 GHz, and two NVidia Quadro RTX 6000 GPUs with 24 GB VRAM each.

3 RESULTS

We performed several analyses based on the B cell neoplasm subtype classification problem. First, we performed a classification analysis based on a training and testing split of the data as presented in Zhao et al. (2020). We also performed a 5-fold cross validation with the Flowcat and EnsembleCNN classifiers to compare the predictions obtained for the entire dataset (UMAP-RF was not run by 5-fold cross validation due to greatly increased computation time required).

3.1 EnsembleCNN: Effect of histogram bin size

With EnsembleCNN, we tested the performance of varying the number of bins along a given axis for values 5, 10, 25, 50, 75 and 100 (See Figure S2 for an example of forward scatter versus CD45 for different numbers of bins). We first used the portion of training data reserved for training the random forest classifier to perform training and testing of the random forest with different parameters for each of the number of bins tested. With respect to the weighted averages of class-wise precision and recall, bin numbers in the range of 25 to 75 remained fairly optimal and consistent; however with respect to the macro averages, a bin number of 25 was optimal, and we observed a small decrease in performance for certain classes (and therefore the resulting macro averages) for bin numbers 50 and 75. We also computed accuracies using the test set data for the histogram bin numbers to further investigate the effect of this parameter. Table S4 shows the class specific sensitivities (recall) and precisions obtained for the test set data at different bin numbers. These results are also conveyed in Figure S7 which shows the corresponding averaged precision, recall, and F1-scores (both macro and weighted by class size).

Interestingly, we observed that while classification performance increased with number of bins increasing from 5 to 25, towards the higher end of the range tested there was a decline in performance, especially at 100 bins per axis. This behavior is particularly striking for mantle cell lymphoma (MCL), marginal zone lymphoma (MZL), and prolymphocytic leukemia (PL). For most other classes, the performance remains fairly consistent or fluctuates somewhat as in the case of follicular lymphoma (FL). In the case of hairy cell leukemia (HCL), the recall reaches a maximum bins = 100.

As a result of this comparison of numbers of bins, we selected a bin number of 25 (per axis) as our optimal parameter for comparison of EnsembleCNN results with the other classifiers.

Our tuning of the random forest within the training data suggested 1000 estimators and a maximum cutoff of 100 leaf nodes as optimal.

3.2 Primary classification analysis using held-out test set

The creation of the held-out test set from the overall case set by Zhao et al. (2020) employs a date-specific cutoff (2018-07-01) for selecting the test set (2348 cases). This was done to simulate a laboratory deployment situation in which a classifier may be trained on existing cases and then tested on the incoming cases following the training of the classifier. We applied the same training and testing split to all three classifiers to compare the resulting predictions obtained (see Table 2). For EnsembleCNN, the training set was split again, with 50% randomly selected cases from each class being used for training the CNNs, and the remainder used for training the RF component.

TABLE 2. Sample counts used for training and testing with a held-out test set. The same train-test splits were used with all three classifiers. With EnsembleCNN, the training set was further split by class with 50% being used for training the CNNs and the remaining 50% of each class being used for training the random forest classifier (these two sub-columns are labeled as ‘CNN’ and ‘RF’, respectively).

Group	Train		Test
Group	CNN	RF	Test
CLL	2016	2019	403
MBL	776	778	60
MCL	136	139	38
PL	278	277	36
LPL	320	322	84
MZL	498	499	116
FL	110	112	24
HCL	100	101	24
Normal	4900	4903	1563

For UMAP-RF, we performed two sets of tests with different levels of sampling to build the UMAPs. In the first instance we sampled approximately 10 million events from each panel. To balance the sampling across the classes, this was equally distributed across the classes: for example, since there are 201 HCL cases in the training set, 10,000,000/(9 × 201) = 5527 events would be sampled from each HCL case in the training set for each panel. In the second instance, we sampled 22 million events from each panel. Figures S5 and S6 show the classification results obtained for these samplings, with the 22 million events sampling providing a significantly better result than the 10 million events sampling even with just 2 panels (the 10 million events sampling was performed for all 3 panels), and we have used this result as the benchmark for comparison with the other classifiers as it is to date the best performance we have obtained with UMAP-RF.

Table 3 shows the classification performances obtained for comparison (with UMAP-RF results being those obtained with a 22 million events sampling from each panel). Both EnsembleCNN and Flowcat yielded better performances than UMAP-RF, and the remaining focus of our comparative analysis will be predominantly on these two classifiers.

TABLE 3. Classification results from held-out test set. Precision, recall, and F1-score results of each classifier are shown with respect to the held-out test set of 2348 cases. The macro averaged results treat all classes with equal weight while the weighted average results weight the result for each class with respect to the class size. EnsembleCNN and Flowcat perform comparably with respect to multiple classes resulting in similar weighted averages across all classes; for many of the remaining classes EnsembleCNN yields higher precision and recalls, resulting in a higher macro averaged performance. Both classifiers perform significantly better than UMAP-RF, for which approximately 22 million events were sampled from each panel to build panel specific UMAPs.

	Flowcat	EnsembleCNN	UMAP-RF	Flowcat	EnsembleCNN	UMAP-RF	Flowcat	EnsembleCNN	UMAP-RF
	Precision			Recall			F1-score
CLL	0.93	0.98	0.95	0.89	0.85	0.88	0.91	0.91	0.92
FL	0.51	0.75	0.62	0.83	0.75	0.75	0.63	0.75	0.68
HCL	0.92	1	0.92	1	0.96	0.92	0.96	0.98	0.92
LPL	0.64	0.54	0.34	0.5	0.63	0.51	0.56	0.58	0.41
MBL	0.44	0.47	0.44	0.6	0.75	0.58	0.51	0.58	0.50
MCL	0.67	0.76	0.61	0.37	0.58	0.37	0.47	0.66	0.46
MZL	0.75	0.72	0.67	0.68	0.71	0.55	0.71	0.71	0.61
PL	0.46	0.48	0.46	0.67	0.78	0.67	0.55	0.6	0.55
Normal	0.97	0.97	0.95	0.98	0.97	0.94	0.97	0.97	0.94
Macro average	0.7	0.74	0.66	0.72	0.77	0.69	0.7	0.75	0.66
Weighted average	0.91	0.92	0.88	0.9	0.91	0.87	0.91	0.91	0.87

Note: Values in bold indicate the best performance value for the given diagnosis.

Comparing Flowcat and EnsembleCNN, normal cases are classified at 98% and 97% (sensitivity/recall), respectively, by both classifiers. The difference here amounts to 16 more normal cases being predicted correctly by Flowcat (see Figures S3–S6). HCL is predicted with perfect accuracy by both classifiers, and CLL is predicted at 89% and 86% sensitivity, respectively. For the remaining groups, there are larger differences in sensitivity by the two classifiers. MBL, MCL, PL, LPL are predicted with higher sensitivity by EnsembleCNN while FL and MZL are predicted with higher sensitivity by Flowcat.

Note that the F1-score is the harmonic mean of precision and recall and is considered a summary statistic of the two (F1 = 2 [precision × recall]/[precision + recall]). Table 3 also shows macro (all classes being weighted equally) and weighted (classes being weighted with respect to their sizes) results. The weighted averages for precision and recall (and hence the F1 score) are comparable between the two classifiers; for precision, Flowcat provides 91%, EnsembleCNN provides 92%, and UMAP-RF provides 88%; for recall they are 90%, 91%, and 87%, respectively; the F1-scores are respectively, 0.91, 0.91 and 0.87. When compared using the macro averaged results, EnsembleCNN performed slightly better. While some of the class specific results differ from those originally presented for Flowcat (Zhao et al., 2020), the F1-scores are on par with the previously reported weighted F1 of 0.92 and an averaged F1 of 0.73.

3.3 Five-fold cross validation

We performed 5-fold cross validations from random splits of the data (original training and testing sets combined) for both Flowcat and EnsembleCNN. The exact same data splits were processed with both classifiers. Figure 2 shows the cross-validation predictions aggregated across the 5-fold in the form of confusion matrix heatmaps. The rows in the heatmaps indicate the true labels and the columns indicate the predicted labels; hence, the diagonal indicates the recall/sensitivity obtained for each class.

The results obtained are fairly similar to those obtained for the two classifiers in the previous held-out-test-set analysis. Normal samples are classified with 97% recall by both classifiers. For the remaining classes, there are fairly close levels of agreement (i.e., at most 6% difference) with the exceptions of prolymphocytic leukemia (PL), marginal zone lymphoma (MZL), and follicular lymphoma (FL). With PL, Flowcat obtained 54% recall across the 5-fold whereas EnsembleCNN obtained 69%. With MZL, Flowcat obtained 55% recall and EnsembleCNN obtained 66% recall. The reverse performance is true for the remaining case: for FL, EnsembleCNN performed at recall 76% while Flowcat obtained 82% recall.

The full precision, recall, and F1-score results are shown in Table S2. The weighted averaged results for the two classifiers are on par with each other, in the 87%–88% range for both precision and recall; however, the macro averages show some differences: for precision, Flowcat provides 72% while EnsembleCNN provides 66%. For Recall, Flowcat provides 75% for EnsembleCNN it is 69%. Summarizing these results, Flowcat obtains an F1-score of 0.73 while for EnsembleCNN it is 0.67.

3.4 Explainability analysis with EnsembleCNN classifier

For EnsembleCNN, we used the python shap package to perform an analysis of feature importance for the random forest classifier through Shapley Additive Explanations. The summary of the analysis performed with respect to the predicted outputs of the testing data are shown in Figure 3.

Note again that each of the 146 CNN classifiers have 9 outputs corresponding to probabilities predicting classes. Therefore, each input received by the random forest classifier can be labeled by the pair of markers derived from one of the three panels and the respective class label of the CNN output node, and feature labels derived in this manner are shown in the summary plot. For example, the topmost row has the label “(1) CD79b – CD20 –> normal”, which indicates that this corresponds to the normal class output from the CNN for the CD79 and CD20 marker combination from panel 1. The color scheme indicates that this feature is of high importance for prediction of normal cases, as are the next three features. The fifth row shows that CD20 versus CD5 is most impactful in identifying prolymphocytic leukemia, while the sixth row shows that CD19 versus CD5 is most impactful for identifying chronic lymphocytic leukemia. Additional associations can be determined in a similar manner.

Our SHAP value analysis also suggests that the most important marker pairs are found in panels 1 and 2. This is expected, as panel 3 is predominantly T-cell associated and therefore unlikely to be as relevant for the diagnostic purposes here in comparison to panels 1 and 2.

Explainability analysis was not performed using the Flowcat or UMAP-RF classifiers.

3.5 Resource usage

We also examined the classifiers with respect to their computational time usage to carry out the training and testing analyses referred to above. We analyzed the clock time elapsed as well as CPU/GPU time used by the system. A comparison of results is provided in Table S3 in minutes and seconds. The “real” time indicates the elapsed time as measured by the system clock; the ‘user’ and ‘sys’ times indicate the CPU time spent (by all cores) in user and system code, and the sum of the latter two indicate the total CPU time spent.

With Flowcat, the time for building the reference SOM was about 2 min on average, with the projection of all case data across all panels taking between 15 to 18 h on different iterations of the analysis carried out. The final CNN training was completed between 2 and 3 min in all the analyses performed. For the result provided in Table S3, the total clock time spent was 17.1 h, and the CPU time was 201.2 h.

For EnsembleCNN, the time taken in total (including the building of histograms) was much less. For 50 bins per axis, the time taken for training the 146 CNNs and the subsequent operations with the random forest amounted to about 1.5 h (with the time taken by the RF being negligible in comparison to the CNNs). The total clock time spent was 3.85 h and the CPU time was 3.55 h, with smaller numbers of bins taking even less time.

UMAP-RF had the highest time usage in our comparison, with multiple days being required to compute the projections of all cases. The timing results provided in Table S3 are with respect to the sampling of 10 million events from each panel, and the sampling of 22 million events took much longer for building the UMAP and obtaining the subsequent projections for all cases. For the 10 million events sampling, building the UMAP for a single panel took 1.8 h in clock time and 58.81 h in CPU time. The subsequent case projection (for both training and testing) took 225.76 h (about 9 days) in clock time and 13422.78 in CPU time. Note that these results need to be multiplied by ~3 to obtain the time required to process all 3 panels, unless there are computational resources adequate for performing UMAP analysis of all 3 panels concurrently. The training and testing of the random forest classifier itself only took about 3 min.

With UMAP-RF, we were also met with considerable challenges with respect to memory usage. The training of the UMAP requires a considerable amount of memory. For example, with the sampling of 10 million events, for each panel a matrix of 10,000,000 cells and 11 markers (or 9 markers for panel 3) need to be processed by the UMAP algorithm. With the 22 million events sampling, we could not complete some of the UMAP building on our local GPU server due to insufficient memory and resorted to using the Google Cloud Platform. For this level of sampling, the projection of individual cases with a single UMAP panel took nearly 15 days with the maximum available parallelized resource usage (i.e., ~45 days for all 3 panels and about 20,000 cases).

Overall, EnsembleCNN represented a significant time-advantage in these experiments when compared for overall training time. The times required for applying the EnsembleCNN and Flowcat classifiers to test sets is negligible by comparison. UMAP-RF requires considerably more resources in terms of both memory and time, and the results we have obtained so far do not suggest UMAP-RF as implemented here will be able to outperform the other two algorithms for most scenarios.

4 DISCUSSION

Our main classification results show modest distinctions between EnsembleCNN and Flowcat, both of which considerably outperformed our implementation of UMAP-RF. Macro averaging of precision values across the classes (i.e., average precision irrespective of class size) showed a very slightly higher averaged precision for EnsembleCNN than Flowcat, and the same holds true for recall values. Weighted averaging of precision values and recall values also demonstrated very similar performance.

With regard to misclassifications, we observed that both classifiers are likely to misclassify CLL as MBL, though the reverse (classify MBL as CLL) was predominantly observed for Flowcat. Both classifiers were also somewhat likely to misclassify mantle cell lymphoma and prolymphocytic leukemia as each other and the same pattern can be observed between lymphoplasmacytic lymphoma and marginal zone lymphoma. The misclassifications between CLL and MBL can be expected to a certain degree since MBL may be considered a precursor to CLL (Scarfò et al., 2013). Similarly, the confusion between mantle cell lymphoma and prolymphocytic leukemia for both classifiers is not surprising given the expected immunophenotypic similarities between the two entities, and similarly for marginal zone lymphoma versus lymphoplasmacytic lymphoma. Both classifiers infrequently mis-identify tumor cases from multiple classes as normal.

Given that the subtype level diagnostic labels associated with the cases have been derived through clinical diagnostic processes (including validation through complementary testing), it is to the credit of many of these algorithms that once trained, a substantial proportion of the labels could be derived through analysis of flow data alone. However, some degree of misclassification is expected, and we expect to perform further analysis regarding the misclassifications provided by each classifier in future work.

With regard to EnsembleCNN, the effect of varying the number of histogram bins on classification performance is noteworthy, and we believe this merits further investigation in the future, specifically with respect to the decrease in performance with regard to certain classes with increasing bin number beyond 25. One possibility for this might be that at higher resolution more noise is captured in the histogram whereas at lower resolutions only the more globally relevant levels of variability in cell immunophenotypes are represented. These results also suggest that the same bin size may not be the optimal setting for all classes or pairwise histogram being analyzed here; that is, different bin sizes, or combinations thereof, might yield more accurate classification.

Our analysis of training time usage demonstrated a considerable advantage for EnsembleCNN, by a factor of 2.5×–3.2× versus Flowcat. This could be of considerable importance in a clinical deployment setting where a classifier would be expected to be re-trained with additional incoming cases regularly to improve performance, as well as to ensure that the classifier is being trained to handle any time specific biases being introduced into the data. The time advantage would multiply for retraining since for EnsembleCNN the majority of time is spent in creating and saving two-dimensional histograms, which can be reused by simply concatenating the histograms with new data, while retraining with Flowcat would involve recalculating all FlowSOM embeddings each time.

Given that the UMAP-RF classifier took a considerably longer time than the other two classifiers, with much of that time being used for the projection of cases into UMAP embeddings, and given that it was unable to perform classification tasks as well as the other two, we believe it would not be a suitable competitor for a real-time deployment setting among this set of classifiers. While the 22 million events sampling provided a significantly better performance in comparison to the 10 million events sampling, it came at the cost of requiring considerably more memory to compute the UMAPs, as well as taking a considerable amount of time to compute the subsequent projections of individual cases. Given the time requirement, it became very prohibitive to carry out the 5-fold cross validation experiments that we carried out with the other two classifiers. We do note that the classification accuracy might be considerably improved for the UMAP-RF if variability in day-to-day or instrument-to-instrument signal intensities can be corrected; however, given that the instrument identification was not seen in the raw data files, it was not practical to attempt such a correction. If such corrections can be performed and a method is found to accelerate the UMAP algorithm, the approach's prediction accuracy could be more useful for real-time analysis.

In the course of this work, we carried out some simple explainability analysis by calculating SHAP values for the EnsembleCNN algorithm. The use of a separate CNN for each marker pair in EnsembleCNN allows an easy analysis of important marker pairs to be performed at the level of the random forest classifier. Further analysis of each individual CNN is also possible through a SHAP value analysis to deduce important regions of various marker combination histograms, and ultimately estimated for individual cells, as demonstrated in (Simonson et al., 2021). An alternative approach would be to train a CNN to perform cell-level classification (normal vs. neoplastic, for example), and then subsequently utilize the cell-level predictions to provide a diagnostic prediction. Indeed, we aim to provide an example of such an approach for flow data based diagnostic work in future publications. However, it should also be noted that the approach described here has the advantage of revealing possible hitherto un-used or under-used markers and/or bystander cells to arrive at a specific diagnosis prediction.

We did not conduct an explainability analysis for the Flowcat classifier or UMAP-RF classifier due to the time and resource requirements that it would entail. However, in the original FlowCat publication (Zhao et al., 2020), marker importance analysis was carried out through an occlusion analysis whereby each marker is replaced with zero values and the SOM generation, projection, and classifier training is repeated to compare the results obtained. A similar occlusion approach explainability analysis focused on marker importance could be performed for the UMAP-RF approach. Single cell highlighting in saliency maps could be performed for the FlowCat approach; we also note that further development of the software to employ Shapley Additive Explanations (SHAP) could help to identify predictive features more specific to individual cases.

Per the original article that introduced the data set (Zhao et al. 2020), the data were collected at the Munich Leukemia Laboratory (MLL). In a follow up study, the FlowSOM approach was applied to data from other labs using the previously learned weights and a transfer learning approach wherein additional learning was performed using data from the outside laboratories, which resulted in improved performance at those laboratories (Mallesh et al., 2021). While transfer of learning to other labs has thus been established for the FlowSOM approach, transfer learning for the other approaches has not been established. It is expected that retraining or transfer learning will be required for the other algorithms, similar to that of the FlowSOM algorithm. While it was not specified whether the data are of various specimen types (blood, lymph node, bone marrow, etc.), given that the three tested algorithms make their classification predictions using the ungated sample data, which therefore provides context for any abnormal cells for a given specimen type, it is therefore expected that with enough training data the classifiers might perform well for multiple sample types, though a comparison of algorithms trained on specific specimen types could be very informative, and we predict that with sufficient training data, specimen type specific classifiers would perform better.

This study has focused on the application of these algorithms for the classification of B-cell neoplasms, though it is expected that these and other artificial intelligence algorithms can be applied to many more diagnoses. Some examples include detection of acute myeloid leukemia (Ko et al., 2018), myelodysplastic syndrome (Duetz et al., 2021), chronic myeloid leukemia (Ni et al., 2013), acute lymphoblastic leukemia minimal residual disease detection (Fišer et al., 2012; Reiter et al., 2019), and chronic lymphocytic leukemia minimal residual disease detection (Salama et al., 2022). There are several ways in which such algorithms could be used to enhance the practice of clinical flow cytometry generally. All three of the algorithms could be used to automatically review flow cytometry to flag cases for priority review by a pathologist to help in the quicker diagnosis of patients with aggressive disease. The algorithms could also be used for automated flagging of cases that require additional flow cytometry antibody panels work-up. Simonson et al. previously demonstrated this in reporting a real-time, prospective study of the potential utilization of the EnsembleCNN algorithm to flag cases that would require an additional flow cytometry antibody panel to help discriminate chronic lymphocytic leukemia from mantle cell lymphoma (Simonson et al., 2022).

In that study, the EnsembleCNN algorithm frequently checked the flow cytometry data file server for newly available data, and if the probability score threshold of needing the additional antibody panel was met, an email was immediately sent to one of the authors of the study. The conclusion of the study was that the algorithm performed very well when compared with the actual ordering of the additional antibody panel and could eventually be used to expedite work-up of cases and require less time of technologists and hematopathologists. Alternatively, identification of such cases could be used to batch similar cases for more efficient laboratory workflow when speed of work-up is not the highest priority. It is also noted that, in the current study, it is recognized that several of the diagnoses will not be made based on flow cytometry alone (e.g., marginal zone lymphoma and plasma cell lymphoma); however, the probability scores for those diagnoses could be used to help notify the hematopathologist early regarding the potential need to order relevant immunohistochemical and special stains, which could aid in meeting time cutoffs for placing orders for stains. The predictions made by the algorithms could also be used to create draft or templated reports, which can save hematopathologists time and promote consistency in reporting. These algorithms could also be used for potential quality improvement by flagging cases for additional review for which the diagnoses rendered in draft reports by staff members do not match the diagnoses predicted by the algorithm, potentially helping improve diagnostic accuracy. Finally, with further development and validation, it is conceivable that these algorithms could serve in the place of a second official hematopathology reviewer, decreasing the workload for hematopathologists that often must serve this role for select kinds of cases/diagnoses. In short, the opportunities for using artificial intelligence to assist in and improve flow cytometry are plentiful.

In summary, while accuracy metrics are very similar for Flowcat and EnsembleCNN approaches, computational time required to train the algorithms was less for EnsembleCNN. UMAP-RF, as implemented for this publication, was an order of magnitude slower for training and was less accurate in diagnosis predictions. It is expected that additional feature engineering in combination with explainability analysis could improve the performance of all three algorithms, which will be the subject of future work.

ACKNOWLEDGMENTS

This work was supported by the Department of Pathology and Laboratory Medicine at Weill Cornell Medicine, Cornell University, with the support of National Institutes of Health awards U54CA273956 and R01CA200859. We thank Lauren Zuromski for their assistance with manuscript reviewing and Nanditha Mallesh for assistance with flowCat setup.

CONFLICT OF INTEREST STATEMENT

The authors declare no conflicts of interest.

Open Research

DATA AVAILABILITY STATEMENT

Source code will be made available on GitHub (https://github.com/wikum/flowComparison/) with publication of the manuscript. [Corrections added on May 24, 2024, after first online publication: Data Availability Statement has been updated.]

Supporting Information

REFERENCES

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., … Zheng, X. (2016). TensorFlow: a system for large-scale machine learning, in: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16 (pp. 265–283). USENIX Association.
Google Scholar
Aghaeepour, N., Chattopadhyay, P. K., Ganesan, A., O'Neill, K., Zare, H., Jalali, A., Hoos, H. H., Roederer, M., & Brinkman, R. R. (2012). Early immunologic correlates of HIV protection can be identified from computational analysis of complex multivariate T-cell flow cytometry assays. Bioinformatics, 28, 1009–1016. https://doi.org/10.1093/bioinformatics/bts082
10.1093/bioinformatics/bts082
CAS PubMed Web of Science® Google Scholar
Aghaeepour, N., Finak, G., Hoos, H., Mosmann, T. R., Brinkman, R., Gottardo, R., & Scheuermann, R. H. (2013). Critical assessment of automated flow cytometry data analysis techniques. Nature Methods, 10, 228–238. https://doi.org/10.1038/nmeth.2365
10.1038/nmeth.2365
CAS PubMed Web of Science® Google Scholar
Aghaeepour, N., Nikolic, R., Hoos, H. H., & Brinkman, R. R. (2011). Rapid cell population identification in flow cytometry data. Cytom Part J. Int. Soc. Anal. Cytol., 79, 6–13. https://doi.org/10.1002/cyto.a.21007
10.1002/cyto.a.21007
PubMed Web of Science® Google Scholar
Brinkman, R. R. (2020). Improving the rigor and reproducibility of flow cytometry-based clinical research and trials through automated data analysis. Cytometry. Part A: the Journal of the International Society for Analytical Cytology, 97, 107–112. https://doi.org/10.1002/cyto.a.23883
10.1002/cyto.a.23883
PubMed Web of Science® Google Scholar
Duetz, C., Van Gassen, S., Westers, T. M., van Spronsen, M. F., Bachas, C., Saeys, Y., & van de Loosdrecht, A. A. (2021). Computational flow cytometry as a diagnostic tool in suspected-myelodysplastic syndromes. Cytometry. Part A, 99, 814–824. https://doi.org/10.1002/cyto.a.24360
10.1002/cyto.a.24360
CAS PubMed Web of Science® Google Scholar
Emmaneel, A., Quintelier, K., Sichien, D., Rybakowska, P., Marañón, C., Alarcón-Riquelme, M. E., Van Isterdael, G., Van Gassen, S., & Saeys, Y. (2022). PeacoQC: Peak-based selection of high quality cytometry data. Cytometry. Part A, 101, 325–338. https://doi.org/10.1002/cyto.a.24501
10.1002/cyto.a.24501
PubMed Web of Science® Google Scholar
Finak, G., Jiang, W., Krouse, K., Wei, C., Sanz, I., Phippard, D., Asare, A., De Rosa, S. C., Self, S., & Gottardo, R. (2014). High-throughput flow cytometry data normalization for clinical trials. Cytometry. Part A, 85, 277–286. https://doi.org/10.1002/cyto.a.22433
10.1002/cyto.a.22433
CAS PubMed Web of Science® Google Scholar
Fišer, K., Sieger, T., Schumich, A., Wood, B., Irving, J., Mejstříková, E., & Dworzak, M. N. (2012). Detection and monitoring of normal and leukemic cell populations with hierarchical clustering of flow cytometry data. Cytometry. Part A: the Journal of the International Society for Analytical Cytology, 81, 25–34. https://doi.org/10.1002/cyto.a.21148
10.1002/cyto.a.21148
PubMed Web of Science® Google Scholar
Fletez-Brant, K., Špidlen, J., Brinkman, R. R., Roederer, M., & Chattopadhyay, P. K. (2016). flowClean: Automated identification and removal of fluorescence anomalies in flow cytometry data. Cytometry. Part A, 89, 461–471. https://doi.org/10.1002/cyto.a.22837
10.1002/cyto.a.22837
CAS PubMed Web of Science® Google Scholar
Ge, Y., & Sealfon, S. C. (2012). flowPeaks: a fast unsupervised clustering for flow cytometry data via K-means and density peak finding. Bioinforma. Oxf. Engl., 28, 2052–2058. https://doi.org/10.1093/bioinformatics/bts300
10.1093/bioinformatics/bts300
CAS PubMed Web of Science® Google Scholar
Hahne, F., Khodabakhshi, A. H., Bashashati, A., Wong, C.-J., Gascoyne, R. D., Weng, A. P., Seyfert-Margolis, V., Bourcier, K., Asare, A., Lumley, T., Gentleman, R., & Brinkman, R. R. (2010). Per-channel basis normalization methods for flow cytometry data. Cytometry. Part A, 77A, 121–131. https://doi.org/10.1002/cyto.a.20823
10.1002/cyto.a.20823
Web of Science® Google Scholar
Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition (Vol. 1, pp. 278–282). https://doi.org/10.1109/ICDAR.1995.598994
Google Scholar
Ko, B.-S., Wang, Y.-F., Li, J.-L., Li, C.-C., Weng, P.-F., Hsu, S.-C., Hou, H.-A., Huang, H.-H., Yao, M., Lin, C.-T., Liu, J.-H., Tsai, C.-H., Huang, T.-C., Wu, S.-J., Huang, S.-Y., Chou, W.-C., Tien, H.-F., Lee, C.-C., & Tang, J.-L. (2018). Clinically validated machine learning algorithm for detecting residual diseases with multicolor flow cytometry analysis in acute myeloid leukemia and myelodysplastic syndrome. eBioMedicine, 37, 91–100. https://doi.org/10.1016/j.ebiom.2018.10.042
10.1016/j.ebiom.2018.10.042
PubMed Web of Science® Google Scholar
Kohonen, T. (1990). The self-organizing map. Proc. IEEE, 78, 1464–1480. https://doi.org/10.1109/5.58325
10.1109/5.58325
Web of Science® Google Scholar
Kohonen, T. (2013). Essentials of the self-organizing map. Neural Netw. Twenty-fifth Anniversay Commemorative Issue, 37, 52–65. https://doi.org/10.1016/j.neunet.2012.09.018
10.1016/j.neunet.2012.09.018
Google Scholar
Li, Z., Liu, F., Yang, W., Peng, S., & Zhou, J. (2022). A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst., 33, 6999–7019. https://doi.org/10.1109/TNNLS.2021.3084827
10.1109/TNNLS.2021.3084827
PubMed Web of Science® Google Scholar
Liu, X., Song, W., Wong, B. Y., Zhang, T., Yu, S., Lin, G. N., & Ding, X. (2019). A comparison framework and guideline of clustering methods for mass cytometry data. Genome Biology, 20, 297. https://doi.org/10.1186/s13059-019-1917-7
10.1186/s13059-019-1917-7
PubMed Web of Science® Google Scholar
Lundberg, S.M., Lee, S.-I., 2017. A unified approach to interpreting model predictions, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17. Curran Associates Inc., pp. 4768–4777.
Google Scholar
Lux, M., Brinkman, R. R., Chauve, C., Laing, A., Lorenc, A., Abeler-Dörner, L., & Hammer, B. (2018). flowLearn: Fast and precise identification and quality checking of cell populations in flow cytometry. Bioinformatics, 34, 2245–2253. https://doi.org/10.1093/bioinformatics/bty082
10.1093/bioinformatics/bty082
CAS PubMed Web of Science® Google Scholar
Malek, M., Taghiyar, M. J., Chong, L., Finak, G., Gottardo, R., & Brinkman, R. R. (2015). flowDensity: Reproducing manual gating of flow cytometry data by automated density-based cell population identification. Bioinformatics, 31, 606–607. https://doi.org/10.1093/bioinformatics/btu677
10.1093/bioinformatics/btu677
CAS PubMed Web of Science® Google Scholar
Mallesh, N., Zhao, M., Meintker, L., Höllein, A., Elsner, F., Lüling, H., Haferlach, T., Kern, W., Westermann, J., Brossart, P., Krause, S. W., & Krawitz, P. M. (2021). Knowledge transfer to enhance the performance of deep learning models for automated classification of B cell neoplasms. Patterns, 2, 100351. https://doi.org/10.1016/j.patter.2021.100351
10.1016/j.patter.2021.100351
Google Scholar
McInnes, L., Healy, J., Saul, N., & Großberger, L. (2018). UMAP: Uniform manifold approximation and projection. J. Open Source Softw., 3, 861. https://doi.org/10.21105/joss.00861
10.21105/joss.00861
Google Scholar
Monaco, G., Chen, H., Poidinger, M., Chen, J., de Magalhães, J. P., & Larbi, A. (2016). flowAI: Automatic and interactive anomaly discerning tools for flow cytometry data. Bioinformatics, 32, 2473–2480. https://doi.org/10.1093/bioinformatics/btw191
10.1093/bioinformatics/btw191
CAS PubMed Web of Science® Google Scholar
Ng, D. P., & Zuromski, L. M. (2021). Augmented human intelligence and automated diagnosis in flow cytometry for hematologic malignancies. Am. J. Clin. Pathol., 155, 597–605. https://doi.org/10.1093/ajcp/aqaa166
10.1093/ajcp/aqaa166
PubMed Web of Science® Google Scholar
Ni, W., Tong, X., Qian, W., Jin, J., & Zhao, H. (2013). Discrimination of malignant neutrophils of chronic myelogenous leukemia from normal neutrophils by support vector machine. Computers in Biology and Medicine, 43, 1192–1195. https://doi.org/10.1016/j.compbiomed.2013.06.004
10.1016/j.compbiomed.2013.06.004
PubMed Web of Science® Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in python. J. Mach. Learn. Res., 12, 2825–2830.
Web of Science® Google Scholar
Quintelier, K., Couckuyt, A., Emmaneel, A., Aerts, J., Saeys, Y., & Van Gassen, S. (2021). Analyzing high-dimensional cytometry data using FlowSOM. Nat. Protoc., 16, 3775–3801. https://doi.org/10.1038/s41596-021-00550-0
10.1038/s41596-021-00550-0
CAS PubMed Web of Science® Google Scholar
Reiter, M., Diem, M., Schumich, A., Maurer-Granofszky, M., Karawajew, L., Rossi, J. G., Ratei, R., Groeneveld-Krentz, S., Sajaroff, E. O., Suhendra, S., Kampel, M., Dworzak, M. N., & International Berlin-Frankfurt-Münster (iBFM)-FLOW-network and the AutoFLOW consortium. (2019). Automated flow cytometric MRD assessment in childhood acute B- lymphoblastic leukemia using supervised machine learning. Cytometry. Part A: the Journal of the International Society for Analytical Cytology, 95, 966–975. https://doi.org/10.1002/cyto.a.23852
10.1002/cyto.a.23852
PubMed Web of Science® Google Scholar
Salama, M. E., Otteson, G. E., Camp, J. J., Seheult, J. N., Jevremovic, D., Holmes, D. R., Olteanu, H., & Shi, M. (2022). Artificial intelligence enhances diagnostic flow cytometry workflow in the detection of minimal residual disease of chronic lymphocytic leukemia. Cancers, 14, 2537. https://doi.org/10.3390/cancers14102537
10.3390/cancers14102537
CAS PubMed Web of Science® Google Scholar
Scarfò, L., Fazi, C., & Ghia, P. (2013). MBL versus CLL: how important is the distinction? Hematol. Oncol. Clin. North Am., 27, 251–265. https://doi.org/10.1016/j.hoc.2013.01.004
10.1016/j.hoc.2013.01.004
PubMed Web of Science® Google Scholar
Shapley, L. S. (1988). A value for n-person games. In A. E. Roth (Ed.), The shapley value: Essays in honor of Lloyd S. Shapley (pp. 31–40). Cambridge University Press. https://doi.org/10.1017/CBO9780511528446.003
10.1017/CBO9780511528446.003
Google Scholar
Simonson, P. D., Lee, A. Y., & Wu, D. (2022). Potential for process improvement of clinical flow cytometry by incorporating real-time automated screening of data to expedite addition of antibody panels. American Journal of Clinical Pathology, 157, 443–450. https://doi.org/10.1093/ajcp/aqab166
10.1093/ajcp/aqab166
PubMed Web of Science® Google Scholar
Simonson, P. D., Wu, Y., Wu, D., Fromm, J. R., & Lee, A. Y. (2021). De Novo identification and visualization of important cell populations for classic Hodgkin lymphoma using flow cytometry and machine learning. Am. J. Clin. Pathol., 156, 1092–1102. https://doi.org/10.1093/ajcp/aqab076
10.1093/ajcp/aqab076
PubMed Web of Science® Google Scholar
Van Gassen, S., Gaudilliere, B., Angst, M. S., Saeys, Y., & Aghaeepour, N. (2020). CytoNorm: A normalization algorithm for cytometry data. Cytometry. Part A, 97, 268–278. https://doi.org/10.1002/cyto.a.23904
10.1002/cyto.a.23904
PubMed Web of Science® Google Scholar
Van Gassen, S., Callebaut, B., Van Helden, M. J., Lambrecht, B. N., Demeester, P., Dhaene, T., & Saeys, Y. (2015). FlowSOM: Using self-organizing maps for visualization and interpretation of cytometry data. Cytometry, Part A, 87, 636–645. https://doi.org/10.1002/cyto.a.22625
10.1002/cyto.a.22625
PubMed Web of Science® Google Scholar
Weber, L. M., & Robinson, M. D. (2016). Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytometry. Part A, 89, 1084–1096. https://doi.org/10.1002/cyto.a.23030
10.1002/cyto.a.23030
CAS PubMed Web of Science® Google Scholar
Zhao M, Mallesh N, Höllein A, Schabath R, Haferlach C, Haferlach T, Elsner F, Lüling H, Krawitz P, Kern W. Hematologist-Level Classification of Mature B-Cell Neoplasm Using Deep Learning on Multiparameter Flow Cytometry Data. Cytometry A. 2020, 97(10):1073–1080. https://doi.org/10.1002/cyto.a.24159. Epub 2020 Jun 9. PMID: 32519455.
10.1002/cyto.a.24159
Google Scholar

Citing Literature

Volume106, Issue4

Special Issue on Artificial Intelligence, Volume 1

July 2024

Pages 282-293

Comparison of three machine learning algorithms for classification of B-cell neoplasms using clinical flow cytometry data

Abstract

1 INTRODUCTION