Volume 3, Issue 2 e70030
METHODS ARTICLE
Open Access

From digitized whole-slide histology images to biomarker discovery: A protocol for handcrafted feature analysis in brain cancer pathology

Xuanjun Lu

Xuanjun Lu

School of Electronic Engineering, Xi'an Shiyou University, Xi'an, Shaanxi, China

Search for more papers by this author
Yawen Ying

Yawen Ying

Department of Medical Research, Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, Guangdong, China

Search for more papers by this author
Jing Chen

Jing Chen

Department of Medical Research, Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, Guangdong, China

Search for more papers by this author
Zhiyang Chen

Zhiyang Chen

Department of Radiology, Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, Guangdong, China

Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, Guangdong, China

School of Clinical Dentistry, University of Sheffield, Sheffield, UK

Search for more papers by this author
Yuxin Wu

Yuxin Wu

Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, Georgia, USA

Department of Computer Science & Informatics, Emory University, Atlanta, Georgia, USA

Search for more papers by this author
Prateek Prasanna

Prateek Prasanna

Department of Biomedical Informatics, Stony Brook University, Stony Brook, New York, USA

Search for more papers by this author
Xin Chen

Xin Chen

Department of Radiology, School of Medicine, Guangzhou First People's Hospital, South China University of Technology, Guangzhou, Guangdong, China

Search for more papers by this author
Mingli Jing

Corresponding Author

Mingli Jing

School of Electronic Engineering, Xi'an Shiyou University, Xi'an, Shaanxi, China

Correspondence

Mingli Jing, Zaiyi Liu and Cheng Lu.

Email: [email protected];

[email protected] and

[email protected].

Search for more papers by this author
Zaiyi Liu

Corresponding Author

Zaiyi Liu

Department of Radiology, Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, Guangdong, China

Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, Guangdong, China

Correspondence

Mingli Jing, Zaiyi Liu and Cheng Lu.

Email: [email protected];

[email protected] and

[email protected].

Search for more papers by this author
Cheng Lu

Corresponding Author

Cheng Lu

Department of Radiology, Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, Guangdong, China

Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, Guangdong, China

Correspondence

Mingli Jing, Zaiyi Liu and Cheng Lu.

Email: [email protected];

[email protected] and

[email protected].

Search for more papers by this author
First published: 28 May 2025

Xuanjun Lu, Yawen Ying and Jing Chen contributed equally to this work and shared the first authorship.

Abstract

Hematoxylin and eosin (H&E)-stained histopathological slides contain abundant information about cellular and tissue morphology and have been the cornerstone of tumor diagnosis for decades. In recent years, advancements in digital pathology have made whole-slide images (WSIs) widely applicable for diagnosis, prognosis, and prediction in brain cancer. However, there remains a lack of systematic tools and standardized protocols for using handcrafted features in brain cancer histological analysis. In this study, we present a protocol for handcrafted feature analysis in brain cancer pathology (PHBCP) to systematically extract, analyze, model, and visualize handcrafted features from WSIs. The protocol enabled the discovery of biomarkers from WSIs through a series of well-defined steps. The PHBCP comprises seven main steps: (1) problem definition, (2) data quality control, (3) image preprocessing, (4) feature extraction, (5) feature filtering, (6) modeling, and (7) performance analysis. As an exemplary application, we collected pathological data of 589 patients from two cohorts and applied the PHBCP to predict the 2-year survival of glioblastoma multiforme (GBM) patients. Among the 72 models combining nine feature selection methods and eight machine learning classifiers, the optimal model combination achieved discriminative performance with an average area under the curve (AUC) of 0.615 over 100 iterations under five-fold cross-validation. In the external validation cohort, the optimal model combination achieved a generalization performance with an AUC of 0.594. We provide an open-source code repository (GitHub website: https://github.com/XuanjunLu/PHBCP) to facilitate effective collaboration between medical and technical experts, thereby advancing the field of computational pathology in brain cancer.

Key points

What is already known about this topic?

  • Hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) contain abundant information about cellular and tissue morphology. However, there remains a lack of systematic tools and standardized protocols for using handcrafted features in brain cancer histological analysis.

What does this study add?

  • This study presents a protocol for handcrafted feature analysis in brain cancer pathology to systematically extract, analyze, model, and visualize handcrafted features from WSIs, thereby promoting efficient collaboration between medical and technical experts.

1 INTRODUCTION

Histopathological slides, recognized as the “gold standard” for tumor diagnosis,1 hold significant value not only in the morphological assessment of diseases but also in critical biomedical information such as tumor heterogeneity, microenvironment characteristics, and molecular phenotypes.2 In the diagnosis and treatment of brain cancer, histopathological analysis using hematoxylin and eosin (H&E)-stained slides provides indispensable diagnostic evidence for clinical decision-making. However, the traditional diagnostic workflow relies on pathologists' visual inspection of slides under a microscope from low to high magnification. This qualitative analytical approach has inherent limitations. First, subjective interpretation is prone to variability because of differences in experience, leading to diagnostic inconsistency.3 Second, conventional examination has difficulty in quantitatively extracting subvisual tissue features, which may include crucial prognostic information.4 Third, the efficiency bottleneck of manual analysis becomes apparent when dealing with large numbers of slides.5 Thus, an accurate, objective, and interpretable protocol is an important goal in brain cancer pathology.

In recent years, the development of digitized whole-slide image (WSI) technology has revolutionized the field of pathology by enabling permanent digital storage of histopathological slides.6 Leveraging WSI, handcrafted features—i.e., features extracted by manually designed algorithms guided by domain-specific prior knowledge and empirical expertise—have been employed to extract attributes for the discovery of biomarkers. These features can derive quantitative prognostic, predictive, and other pathological information from H&E-stained WSIs, potentially transforming precision oncology and improving patient outcomes. Over the past few years, biomarkers based on handcrafted features have been extensively applied in numerous cancers, including head and neck squamous cell carcinoma,7 urothelial cancer,8 papillary thyroid carcinoma,9 hepatocellular carcinoma,10, 11 lung cancer,12, 13 oropharyngeal squamous cell carcinoma,14 and colorectal cancer.15 However, their application in brain cancer remains insufficient and is scarcely reported in the literature.

Current brain cancer research mainly focuses on radiology images,16-19 with few studies dedicated to histopathological analysis. Even within the limited pathological investigations, deep learning approaches dominate,20 yet their “black-box” nature results in a lack of interpretability, significantly hindering broader clinical translation. The substantial computational resource requirements and intricate preprocessing pipelines associated with deep-learning models pose additional barriers, further limiting their accessibility and practical adoption in clinical settings.

In this study, we present a protocol for handcrafted feature analysis in brain cancer pathology (PHBCP) based on H&E-stained WSIs. The protocol represents a simple, flexible, and modular open-source pipeline. We demonstrate the use of the protocol using two cohorts of glioblastoma multiforme (GBM). By following this protocol, medical and technical experts will be able to promote communication and collaboration, develop novel biomarkers, and collectively tackle clinical challenges in brain cancer, ultimately improving patient outcomes.

2 METHODS

2.1 Overview of the protocol

The protocol comprises seven main steps (Figure 1): (1) problem definition, (2) data quality control, (3) image preprocessing, (4) feature extraction, (5) feature filtering, (6) modeling, and (7) performance analysis. The problem-definition step specifies the precise clinical objectives to be analyzed. The data quality-control step aims to eliminate slides that contain contamination, artifacts, and other issues. The image-preprocessing step provides a WSI-based standardized preprocessing process, encompassing the region of interest (ROI) acquisition, WSI slicing, and color normalization. The feature-extraction step details the types, roles, extraction, and aggregation approaches for handcrafted features. The feature-filtering step refines a large set of redundant features to identify those most relevant to the label. The modeling step involves constructing models based on the filtered features to achieve optimal analytical performance. Lastly, the performance-analysis step visualizes important features and conducts downstream analyses. In this study, we use H&E-stained WSIs from two independent cohorts [The Cancer Genome Atlas (TCGA) and The Cancer Imaging Archive (TCIA)]21 to demonstrate how to use the PHBCP. The uniqueness of the protocol lies in its use of interpretable handcrafted features, rather than deep learning, to establish the complex relationship between WSI and the clinical question. Subsequently, Sections 3.2-3.8 correspond to steps 1–7, respectively.

Details are in the caption following the image

Conceptual overview of the protocol. Seven main steps transform images into quantitative feature information, thereby supporting experimental conclusions (Sections 3.2-3.8 correspond to steps 1–7 shown in Figure 1, respectively).

2.2 Problem definition

First, the clinical problem is determined, followed by the collection of tissue samples and corresponding pathology reports from patients. Slicing and staining operations were performed on the tissue samples. Stained tissue slides are converted into digital whole-slide histology images using digital scanners for subsequent computational pathology analysis. For example, one may want to interrogate the relationship between the nuclear shape features and the grade of tumor of the central nervous system.

2.3 Data quality control

The collected WSIs may be scanned by diverse clinical personnel utilizing different scanners across multiple institutions, which inevitably results in heterogeneous image quality. To mitigate the potential influence of these external variables on the experimental results, it is necessary to exclude substandard slides. However, the manual inspection of image quality in a high-throughput experimental context is often unfeasible. Consequently, a tool referred to as HistoQC22 is usually employed as an objective and rapid quality control process, identifying and flagging issues such as reagent contamination, artifacts, tissue folding, and staining irregularities, thereby enabling automated assessment of WSI quality. Combined with a pathological image viewer-QuPath,23 substandard data are excluded. Multicenter batch effects, such as stain variations, which may affect the robustness of the model. It is recommended to use Batch Effect Explorer24 to unveil the batch effect between cohorts.

2.4 Image preprocessing

In Section 2.3, we create a tissue mask, which excludes artifacts, tissue folding, and other external influences. In this step, the ROI of the current task is extracted from the tissue mask and split into image tiles of the desired size and magnification, for example, 224 × 224 pixels at 20x magnification, using the OpenSlide library,25 which is a Python library used for processing WSIs. To ensure a relatively dense tissue distribution, only those image tiles containing more than a certain proportion of tissue area are selected, for example, with 80% tissue area. To reduce computational load and avoid subjective selection bias, K-means clustering is performed on the tiles of each WSI to group tiles with similar phenotypes together. To ensure no critical regions are missed, E tiles are selected from each cluster, resulting in L × E tiles being used to characterize each patient, in which L is the number of clusters. Due to staining variations across different centers, deconvolution-based color normalization26-28 is applied to the selected tiles to eliminate color discrepancies between WSIs.

2.5 Feature extraction

Feature extraction refers to the process of transforming images into quantitatively described feature values. Five types of handcrafted features are provided in this protocol: first-order statistics (n = 17), gray level co-occurrence matrix (GLCM) features (n = 24), gray level run length matrix (GLRLM) features (n = 16), nuclear shape features (n = 25), and nuclear texture features (n = 13). First-order statistics describe the distribution of pixel intensity values within the tissue region. The GLCM features characterize the frequency of pairs of identical pixel intensity values in the tissue region. The GLRLM features describe the continuity of pixel intensity values over a specified distance in the tissue region. The nuclear shape features are employed to quantify the geometric properties of nuclear contours, thereby reflecting the characteristic patterns of nuclear deformation and cellular morphological changes during tumor progression. The nuclear texture features, by quantifying the heterogeneity and spatial configuration of chromatin distribution within the nucleus, enable an in-depth analysis of the structural distortions in the intranuclear microenvironment during tumor evolution. In total, 95 features can be extracted for each image tile. The details of the five types of features are as follows:

First-order statistics29:
Energy = i = 1 N p ( X ( i ) + c ) 2 $\text{Energy}=\sum\limits _{i=1}^{{N}_{p}}{(X(i)+c)}^{2}$ (1)
Entropy = i = 1 N g p ( i ) log 2 ( p ( i ) + ϵ ) $\text{Entropy}=-\sum\limits _{i=1}^{{N}_{g}}p(i){\log }_{2}(p(i)+{\epsilon})$ (2)
Minimum = min ( X ) $\text{Minimum}=\min (X)$ (3)
The 10 th percentile of X $\text{The}\hspace*{.5em}{10}^{\text{th}}\text{percentile}\hspace*{.5em}\text{of}\hspace*{.5em}X$ (4)
The 90 th percentile of X $\text{The}\hspace*{.5em}{90}^{\text{th}}\text{percentile}\hspace*{.5em}\text{of}\hspace*{.5em}X$ (5)
Maximum = max ( X ) $\text{Maximum}=\max (X)$ (6)
Mean = 1 N p i = 1 N p X ( i ) $\text{Mean}=\frac{1}{{N}_{p}}\sum\limits _{i=1}^{{N}_{p}}X(i)$ (7)
Median = med ( X ) $\text{Median}=\text{med}(X)$ (8)
Interquartile range = 75 th percentile 25 th percentile $\text{Interquartile}\hspace*{.5em}\text{range}={75}^{\text{th}}\hspace*{.5em}\text{percentile}-{25}^{\text{th}}\hspace*{.5em}\text{percentile}$ (9)
Range = max ( X ) min ( X ) $\text{Range}=\max (X)-\min (X)$ (10)
Mean absolute deviation = 1 N p i = 1 N p | X ( i ) X | $\text{Mean}\hspace*{.5em}\text{absolute}\hspace*{.5em}\text{deviation}=\frac{1}{{N}_{p}}\sum\limits _{i=1}^{{N}_{p}}\vert X(i)-\overline{X}\vert $ (11)
Robust mean absolute deviation = 1 N 10 90 i = 1 N 10 90 | X 10 90 ( i ) X 10 90 | $\text{Robust}\hspace*{.5em}\text{mean}\hspace*{.5em}\text{absolute}\hspace*{.5em}\text{deviation}=\frac{1}{{N}_{10-90}}\sum\limits _{i=1}^{{N}_{10-90}}\vert {X}_{10-90}(i)-{\overline{X}}_{10-90}\vert $ (12)
Root mean squared = 1 N p i = 1 N p ( X ( i ) + c ) 2 $\text{Root}\hspace*{.5em}\text{mean}\hspace*{.5em}\text{squared}=\sqrt{\frac{1}{{N}_{p}}\sum\limits _{i=1}^{{N}_{p}}{(X(i)+c)}^{2}}$ (13)
Skewness = 1 N p i = 1 N p X ( i ) X 3 1 N p i = 1 N p X ( i ) X 2 3 $\text{Skewness}=\frac{\frac{1}{{N}_{p}}\sum\limits _{i=1}^{{N}_{p}}{\left(X(i)-\overline{X}\right)}^{3}}{{\left(\sqrt{\frac{1}{{N}_{p}}\sum\limits _{i=1}^{{N}_{p}}{\left(X(i)-\overline{X}\right)}^{2}}\right)}^{3}}$ (14)
Kurtosis = 1 N p i = 1 N p X ( i ) X 4 1 N p i = 1 N p X ( i ) X 2 2 $\text{Kurtosis}=\frac{\frac{1}{{N}_{p}}\sum\limits _{i=1}^{{N}_{p}}{\left(X(i)-\overline{X}\right)}^{4}}{{\left(\frac{1}{{N}_{p}}\sum\limits _{i=1}^{{N}_{p}}{\left(X(i)-\overline{X}\right)}^{2}\right)}^{2}}$ (15)
Variance = 1 N p i = 1 N p X ( i ) X 2 $\text{Variance}=\frac{1}{{N}_{p}}\sum\limits _{i=1}^{{N}_{p}}{\left(X(i)-\overline{X}\right)}^{2}$ (16)
Uniformity = i = 1 N g p ( i ) 2 $\text{Uniformity}=\sum\limits _{i=1}^{{N}_{g}}p{(i)}^{2}$ (17)
where X $X$ is a set of N p ${N}_{p}$ pixels included in the ROI, X $\overline{X}$  is the average of X $X$ , c $c$ is an optional drift value, ϵ ${\epsilon}$ is an arbitrarily small positive number. P ( i ) $P(i)$ is the first-order histogram with N g ${N}_{g}$ discrete intensity levels, and p ( i ) $p(i)$ :
p ( i ) = P ( i ) N p $p(i)=\frac{P(i)}{{N}_{p}}$ (18)
GLCM features29:
Autocorrelation = i = 1 N g j = 1 N g p ( i , j ) i j $\text{Autocorrelation}=\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{g}}p(i,j)ij$ (19)
Joint average = i = 1 N g j = 1 N g p ( i , j ) i $\text{Joint}\,\text{average}=\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{g}}p(i,j)i$ (20)
Cluster prominence = i = 1 N g j = 1 N g i + j μ x μ y 4 p ( i , j ) $\text{Cluster}\,\text{prominence}=\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{g}}{\left(i+j-{\mu }_{x}-{\mu }_{y}\right)}^{4}p(i,j)$ (21)
Cluster shade = i = 1 N g j = 1 N g i + j μ x μ y 3 p ( i , j ) $\text{Cluster}\,\text{shade}=\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{g}}{\left(i+j-{\mu }_{x}-{\mu }_{y}\right)}^{3}p(i,j)$ (22)
Cluster tendency = i = 1 N g j = 1 N g i + j μ x μ y 2 p ( i , j ) $\text{Cluster}\,\text{tendency}=\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{g}}{\left(i+j-{\mu }_{x}-{\mu }_{y}\right)}^{2}p(i,j)$ (23)
Contrast = i = 1 N g j = 1 N g ( i j ) 2 p ( i , j ) $\text{Contrast}=\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{g}}{(i-j)}^{2}p(i,j)$ (24)
Correlation = i = 1 N g j = 1 N g p ( i , j ) i j μ x μ y σ x ( i ) σ y ( j ) $\text{Correlation}=\frac{\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{g}}p(i,j)ij-{\mu }_{x}{\mu }_{y}}{{\sigma }_{x}(i){\sigma }_{y}(j)}$ (25)
Difference average (DA) = k = 0 N g 1 k p x y ( k ) $\text{Difference}\hspace*{.5em}\text{average}\hspace*{.5em}\text{(DA)}=\sum\limits _{k=0}^{{N}_{g}-1}k{p}_{x-y}(k)$ (26)
Difference entropy = k = 0 N g 1 p x y ( k ) log 2 p x y ( k ) + ϵ $\text{Difference}\hspace*{.5em}\text{entropy}=\sum\limits _{k=0}^{{N}_{g}-1}{p}_{x-y}(k){\log }_{2}\left({p}_{x-y}(k)+{\epsilon}\right)$ (27)
Difference variance = k = 0 N g 1 ( k D A ) 2 p x y ( k ) $\text{Difference}\,\text{variance}=\sum\limits _{k=0}^{{N}_{g}-1}{(k-DA)}^{2}{p}_{x-y}(k)$ (28)
Joint energy = i = 1 N g j = 1 N g ( p ( i , j ) ) 2 $\text{Joint}\,\text{energy}=\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{g}}{(p(i,j))}^{2}$ (29)
Joint entropy = i = 1 N g j = 1 N g p ( i , j ) log 2 ( p ( i , j ) + ϵ ) $\text{Joint}\,\text{entropy}=-\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{g}}p(i,j){\log }_{2}(p(i,j)+{\epsilon})$ (30)
Informational measure of correlation 1 = HXY HXY 1 max ( HX , HY ) $\text{Informational}\,\text{measure}\,\text{of}\,\text{correlation}\,1=\frac{\text{HXY}-\text{HXY}1}{\max (\text{HX},\text{HY})}$ (31)
Informational measure of correlation 2 = 1 e 2 ( HXY 2 HXY ) $\text{Informational}\,\text{measure}\,\text{of}\,\text{correlation}\,2=\sqrt{1-{e}^{-2(\text{HXY}2-\text{HXY})}}$ (32)
Inverse difference moment = k = 0 N g 1 p x y ( k ) 1 + k 2 $\text{Inverse}\,\text{difference}\,\text{moment}=\sum\limits _{k=0}^{{N}_{g}-1}\frac{{p}_{x-y}(k)}{1+{k}^{2}}$ (33)
Maximal correlation coefficient = second largest eigenvalue of Q Q ( i , j ) = k = 0 N g p ( i , k ) p ( j , k ) p x ( i ) p y ( k ) \begin{align*}\text{Maximal}\,\text{correlation}\,\text{coefficient}=\sqrt{\text{second}\,\text{largest}\,\text{eigenvalue}\,\text{of}\,Q}\\ Q(i,j)=\sum\limits _{k=0}^{{N}_{g}}\frac{p(i,k)p(j,k)}{{p}_{x}(i){p}_{y}(k)}\end{align*} (34)
Inverse difference moment normalized = k = 0 N g 1 p x y ( k ) 1 + k 2 N g 2 $\text{Inverse}\,\text{difference}\,\text{moment}\,\text{normalized}=\sum\limits _{k=0}^{{N}_{g}-1}\frac{{p}_{x-y}(k)}{1+\left(\frac{{k}^{2}}{{{N}_{g}}^{2}}\right)}$ (35)
Inverse difference = k = 0 N g 1 p x y ( k ) 1 + k $\text{Inverse}\,\text{difference}=\sum\limits _{k=0}^{{N}_{g}-1}\frac{{p}_{x-y}(k)}{1+k}$ (36)
Inverse difference normalized = k = 0 N g 1 p x y ( k ) 1 + k N g $\text{Inverse}\,\text{difference}\,\text{normalized}=\sum\limits _{k=0}^{{N}_{g}-1}\frac{{p}_{x-y}(k)}{1+\left(\frac{k}{{N}_{g}}\right)}$ (37)
Inverse variance = k = 1 N g 1 p x y ( k ) k 2 $\text{Inverse}\,\text{variance}=\sum\limits _{k=1}^{{N}_{g}-1}\frac{{p}_{x-y}(k)}{{k}^{2}}$ (38)
Maximum probability = max ( p ( i , j ) ) $\text{Maximum}\,\text{probability}=\max (p(i,j))$ (39)
Sum average = k = 2 2 N g p x + y ( k ) k $\text{Sum}\,\text{average}=\sum\limits _{k=2}^{2{N}_{g}}{p}_{x+y}(k)k$ (40)
Sum entropy = k = 2 2 N g p x + y ( k ) log 2 p x + y ( k ) + ϵ $\text{Sum}\,\text{entropy}=\sum\limits _{k=2}^{2{N}_{g}}{p}_{x+y}(k){\log }_{2}\left({p}_{x+y}(k)+{\epsilon}\right)$ (41)
Sum squares = i = 1 N g j = 1 N g i μ x 2 p ( i , j ) $\text{Sum}\,\text{squares}=\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{g}}{\left(i-{\mu }_{x}\right)}^{2}p(i,j)$ (42)
where ϵ ${\epsilon}$ is an arbitrarily small positive number, P ( i , j ) $P(i,j)$ is the co-occurrence matrix, N g ${N}_{g}$ is the number of discrete intensity levels, σ x ${\sigma }_{x}$ is the standard deviation of p x ${p}_{x}$ , σ y ${\sigma }_{y}$ is the standard deviation of p y ${p}_{y}$ , and the other parameters are as follows:
p ( i , j ) = P ( i , j ) P ( i , j ) $p(i,j)=\frac{P(i,j)}{\sum P(i,j)}$ (43)
p x ( i ) = j = 1 N g p ( i , j ) ${p}_{x}(i)=\sum\limits _{j=1}^{{N}_{g}}p(i,j)$ (44)
p y ( j ) = i = 1 N g p ( i , j ) ${p}_{y}(j)=\sum\limits _{i=1}^{{N}_{g}}p(i,j)$ (45)
μ x = i = 1 N g p x ( i ) i ${\mu }_{x}=\sum\limits _{i=1}^{{N}_{g}}{p}_{x}(i)i$ (46)
μ y = j = 1 N g p y ( j ) j ${\mu }_{y}=\sum\limits _{j=1}^{{N}_{g}}{p}_{y}(j)j$ (47)
p x + y ( k ) = i = 1 N g j = 1 N g p ( i , j ) , i + j = k , k = 2 , 3 , . . . , 2 N g ${p}_{x+y}(k)=\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{g}}p(i,j),i+j=k,k=2,3,...,2{N}_{g}$ (48)
p x y ( k ) = i = 1 N g j = 1 N g p ( i , j ) , | i j | = k , k = 0 , 1 , . . . , N g 1 ${p}_{x-y}(k)=\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{g}}p(i,j),\vert i-j\vert =k,k=0,1,...,{N}_{g}-1$ (49)
HX = i = 1 N g p x ( i ) log 2 p x ( i ) + ϵ $\text{HX}=-\sum\limits _{i=1}^{{N}_{g}}{p}_{x}(i){\log }_{2}\left({p}_{x}(i)+{\epsilon}\right)$ (50)
HY = j = 1 N g p y ( j ) log 2 p y ( j ) + ϵ $\text{HY}=-\sum\limits _{j=1}^{{N}_{g}}{p}_{y}(j){\log }_{2}\left({p}_{y}(j)+{\epsilon}\right)$ (51)
HXY = i = 1 N g j = 1 N g p ( i , j ) log 2 ( p ( i , j ) + ϵ ) $\text{HXY}=-\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{g}}p(i,j){\log }_{2}(p(i,j)+{\epsilon})$ (52)
HXY 1 = i = 1 N g j = 1 N g p ( i , j ) log 2 p x ( i ) p y ( j ) + ϵ $\text{HXY}1=-\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{g}}p(i,j){\log }_{2}\left({p}_{x}(i){p}_{y}(j)+{\epsilon}\right)$ (53)
HXY 2 = i = 1 N g j = 1 N g p x ( i ) p y ( j ) log 2 p x ( i ) p y ( j ) + ϵ $\text{HXY}2=-\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{g}}{p}_{x}(i){p}_{y}(j){\log }_{2}\left({p}_{x}(i){p}_{y}(j)+{\epsilon}\right)$ (54)
GLRLM features29:
Short run emphasis = i = 1 N g j = 1 N r P ( i , j | θ ) j 2 N r ( θ ) $\text{Short}\,\text{run}\,\text{emphasis}=\frac{\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{r}}\frac{P(i,j\vert \theta )}{{j}^{2}}}{{N}_{r}(\theta )}$ (55)
Long run emphasis = i = 1 N g j = 1 N r P ( i , j | θ ) j 2 N r ( θ ) $\text{Long}\,\text{run}\,\text{emphasis}=\frac{\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{r}}P(i,j\vert \theta ){j}^{2}}{{N}_{r}(\theta )}$ (56)
Gray level non uniformity = i = 1 N g j = 1 N r P ( i , j | θ ) 2 N r ( θ ) $\text{Gray}\,\text{level}\,\text{non}-\text{uniformity}=\frac{\sum\limits _{i=1}^{{N}_{g}}{\left(\sum\limits _{j=1}^{{N}_{r}}P(i,j\vert \theta )\right)}^{2}}{{N}_{r}(\theta )}$ (57)
Gray level non uniformity normalized = i = 1 N g j = 1 N r P ( i , j | θ ) 2 N r ( θ ) 2 $\text{Gray}\,\text{level}\,\text{non}-\text{uniformity}\,\text{normalized}=\frac{\sum\limits _{i=1}^{{N}_{g}}{\left(\sum\limits _{j=1}^{{N}_{r}}P(i,j\vert \theta )\right)}^{2}}{{N}_{r}{(\theta )}^{2}}$ (58)
Run length non uniformity = j = 1 N r i = 1 N g P ( i , j | θ ) 2 N r ( θ ) $\text{Run}\,\text{length}\,\text{non}-\text{uniformity}=\frac{\sum\limits _{j=1}^{{N}_{r}}{\left(\sum\limits _{i=1}^{{N}_{g}}P(i,j\vert \theta )\right)}^{2}}{{N}_{r}(\theta )}$ (59)
Run length non uniformity normalized = j = 1 N r i = 1 N g P ( i , j | θ ) 2 N r ( θ ) 2 $\text{Run}\,\text{length}\,\text{non}-\text{uniformity}\,\text{normalized}=\frac{\sum\limits _{j=1}^{{N}_{r}}{\left(\sum\limits _{i=1}^{{N}_{g}}P(i,j\vert \theta )\right)}^{2}}{{N}_{r}{(\theta )}^{2}}$ (60)
Run percentage = N r ( θ ) N p $\text{Run}\,\text{percentage}=\frac{{N}_{r}(\theta )}{{N}_{p}}$ (61)
Gray level variance = i = 1 N g j = 1 N r p ( i , j | θ ) ( i μ ) 2 μ = i = 1 N g j = 1 N r p ( i , j | θ ) i \begin{align*}\text{Gray}\,\text{level}\,\text{variance}=\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{r}}p(i,j\vert \theta ){(i-\mu )}^{2}\\ \mu =\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{r}}p(i,j\vert \theta )i\end{align*} (62)
Run variance = i = 1 N g j = 1 N r p ( i , j | θ ) ( j μ ) 2 μ = i = 1 N g j = 1 N r p ( i , j | θ ) j \begin{align*}\text{Run}\,\text{variance}=\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{r}}p(i,j\vert \theta ){(j-\mu )}^{2}\\ \mu =\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{r}}p(i,j\vert \theta )j\end{align*} (63)
Run entropy = i = 1 N g j = 1 N r p ( i , j | θ ) log 2 ( p ( i , j | θ ) + ϵ ) $\text{Run}\,\text{entropy}=-\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{r}}p(i,j\vert \theta ){\log }_{2}(p(i,j\vert \theta )+{\epsilon})$ (64)
Low gray level run emphasis = i = 1 N g j = 1 N r P ( i , j | θ ) i 2 N r ( θ ) $\text{Low}\,\text{gray}\,\text{level}\,\text{run}\,\text{emphasis}=\frac{\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{r}}\frac{P(i,j\vert \theta )}{{i}^{2}}}{{N}_{r}(\theta )}$ (65)
High gray level run emphasis = i = 1 N g j = 1 N r P ( i , j | θ ) i 2 N r ( θ ) $\text{High}\,\text{gray}\,\text{level}\,\text{run}\,\text{emphasis}=\frac{\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{r}}P(i,j\vert \theta ){i}^{2}}{{N}_{r}(\theta )}$ (66)
Short run low gray level emphasis = i = 1 N g j = 1 N r P ( i , j | θ ) i 2 j 2 N r ( θ ) $\text{Short}\,\text{run}\,\text{low}\,\text{gray}\,\text{level}\,\text{emphasis}=\frac{\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{r}}\frac{P(i,j\vert \theta )}{{i}^{2}{j}^{2}}}{{N}_{r}(\theta )}$ (67)
Short run high gray level emphasis = i = 1 N g j = 1 N r P ( i , j | θ ) i 2 j 2 N r ( θ ) $\text{Short}\,\text{run}\,\text{high}\,\text{gray}\,\text{level}\,\text{emphasis}=\frac{\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{r}}\frac{P(i,j\vert \theta ){i}^{2}}{{j}^{2}}}{{N}_{r}(\theta )}$ (68)
Long run low gray level emphasis = i = 1 N g j = 1 N r P ( i , j | θ ) j 2 i 2 N r ( θ ) $\text{Long}\,\text{run}\,\text{low}\,\text{gray}\,\text{level}\,\text{emphasis}=\frac{\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{r}}\frac{P(i,j\vert \theta ){j}^{2}}{{i}^{2}}}{{N}_{r}(\theta )}$ (69)
Long run high gray level emphasis = i = 1 N g j = 1 N r P ( i , j | θ ) i 2 j 2 N r ( θ ) $\text{Long}\,\text{run}\,\text{high}\,\text{gray}\,\text{level}\,\text{emphasis}=\frac{\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{r}}P(i,j\vert \theta ){i}^{2}{j}^{2}}{{N}_{r}(\theta )}$ (70)
where ϵ ${\epsilon}$ is an arbitrarily small positive number, N g ${N}_{g}$ is the number of discrete intensity levels, N r ${N}_{r}$ is the number of discrete run lengths, N p ${N}_{p}$ is the number of pixels, P ( i , j | θ ) $P(i,j\vert \theta )$ is the run length matrix for a direction, and the other parameters are as follows:
N r ( θ ) = i = 1 N g j = 1 N r P ( i , j | θ ) , 1 N r ( θ ) N p ${N}_{r}(\theta )=\sum\limits _{i=1}^{{N}_{g}}\sum\limits _{j=1}^{{N}_{r}}P(i,j\vert \theta ),1\le {N}_{r}(\theta )\le {N}_{p}$ (71)
p ( i , j | θ ) = P ( i , j | θ ) N r ( θ ) $p(i,j\vert \theta )=\frac{P(i,j\vert \theta )}{{N}_{r}(\theta )}$ (72)
Nuclear shape features are as follows:
Area ratio = area are a max $\text{Area}\,\text{ratio}=\frac{\text{area}}{\text{are}{\mathrm{a}}_{\max }}$ (73)
where area $\text{area}$ is the area of a nucleus, and are a max $\text{are}{\mathrm{a}}_{\max }$ represents the area of a circle with a radius equal to the maximum Euclidean distance from the centroid of the nucleus to its contour points.
Distance ratio = distanc e mean distanc e max $\text{Distance}\,\text{ratio}=\frac{\text{distanc}{\mathrm{e}}_{\text{mean}}}{\text{distanc}{\mathrm{e}}_{\max }}$ (74)
where distanc e mean $\text{distanc}{\mathrm{e}}_{\text{mean}}$ is the average Euclidean distance from the centroid to the contour points, and distanc e max $\text{distanc}{\mathrm{e}}_{\max }$ is the maximum Euclidean distance from the centroid to the contour points in a nucleus.
Distance std = 1 N 1 i = 1 N d c i d c 2 $\text{Distance}\,\text{std}=\sqrt{\frac{1}{N-1}\sum\limits _{i=1}^{N}{\left(d{c}_{i}-\overline{dc}\right)}^{2}}$ (75)
where d c $dc$ is the set of the centroid to contour points standardized Euclidean distances in a nucleus, d c $\overline{dc}$ is the average of d c $dc$ , and N $N$ is the number of the contour points.
Distance var = Distance st d 2 $\text{Distance}\,\text{var}=\text{Distance}\,\text{st}{\mathrm{d}}^{2}$ (76)
Long or short distance ratio = disu m L disu m S $\text{Long}\,\text{or}\,\text{short}\,\text{distance}\,\text{ratio}=\frac{\text{disu}{\mathrm{m}}_{\mathrm{L}}}{\text{disu}{\mathrm{m}}_{\mathrm{S}}}$ (77)
where disu m L $\text{disu}{\mathrm{m}}_{\mathrm{L}}$ is the long-distance sum of contour points in a nucleus. Specifically, it is calculated by uniformly sampling a certain number of points among the contour points with a longer sampling interval and then summing up the Euclidean distances between adjacent sampled points. Similarly, disu m S $\text{disu}{\mathrm{m}}_{\mathrm{S}}$ is the short-distance sum of contour points.
Perimeter ratio = perimete r 2 area $\text{Perimeter}\,\text{ratio}=\frac{\text{perimete}{\mathrm{r}}^{2}}{\text{area}}$ (78)
where perimeter $\text{perimeter}$ is the perimeter of a nucleus.
Smoothness = i = 1 N | d i d i 1 + d i + 1 2 | $\text{Smoothness}=\sum\limits _{i=1}^{N}\vert {d}_{i}-\frac{{d}_{i-1}+{d}_{i+1}}{2}\vert $ (79)
where d i ${d}_{i}$ is the Euclidean distance from the contour point of a nucleus to the centroid.
Invariant momen t 1 = η 20 + η 02 $\text{Invariant}\,\text{momen}{\mathrm{t}}_{1}={\eta }_{20}+{\eta }_{02}$ (80)
Invariant momen t 2 = η 20 η 02 2 + 4 η 11 2 $\text{Invariant}\,\text{momen}{\mathrm{t}}_{2}={\left({\eta }_{20}-{\eta }_{02}\right)}^{2}+4{{\eta }_{11}}^{2}$ (81)
Invariant momen t 3 = η 30 3 η 12 2 + 3 η 21 η 03 2 $\text{Invariant}\,\text{momen}{\mathrm{t}}_{3}={\left({\eta }_{30}-3{\eta }_{12}\right)}^{2}+{\left(3{\eta }_{21}-{\eta }_{03}\right)}^{2}$ (82)
Invariant momen t 4 = η 30 + η 12 2 + η 21 + η 03 2 $\text{Invariant}\,\text{momen}{\mathrm{t}}_{4}={\left({\eta }_{30}+{\eta }_{12}\right)}^{2}+{\left({\eta }_{21}+{\eta }_{03}\right)}^{2}$ (83)
Invariant momen t 5 = η 30 3 η 12 η 30 + η 12 η 30 + η 12 2 3 η 21 + η 03 2 + 3 η 21 η 03 η 21 + η 03 3 η 30 + η 12 2 η 21 + η 03 2 $\text{Invariant}\,\text{momen}{\mathrm{t}}_{5}=\left({\eta }_{30}-3{\eta }_{12}\right)\left({\eta }_{30}+{\eta }_{12}\right)\left[{\left({\eta }_{30}+{\eta }_{12}\right)}^{2}-3{\left({\eta }_{21}+{\eta }_{03}\right)}^{2}\right]+\left(3{\eta }_{21}-{\eta }_{03}\right)\left({\eta }_{21}+{\eta }_{03}\right)\left[3{\left({\eta }_{30}+{\eta }_{12}\right)}^{2}-{\left({\eta }_{21}+{\eta }_{03}\right)}^{2}\right]$ (84)
Invariant momen t 6 = η 20 η 02 η 30 + η 12 2 η 21 + η 03 2 + 4 η 11 η 30 + η 12 η 21 + η 03 $\text{Invariant}\,\text{momen}{\mathrm{t}}_{6}=\left({\eta }_{20}-{\eta }_{02}\right)\left[{\left({\eta }_{30}+{\eta }_{12}\right)}^{2}-{\left({\eta }_{21}+{\eta }_{03}\right)}^{2}\right]+4{\eta }_{11}\left({\eta }_{30}+{\eta }_{12}\right)\left({\eta }_{21}+{\eta }_{03}\right)$ (85)
Invariant momen t 7 = 3 η 21 η 03 η 30 + η 12 η 30 + η 12 2 3 η 21 + η 03 2 η 30 3 η 12 η 21 + η 03 3 η 30 + η 12 2 η 21 + η 03 2 $\text{Invariant}\,\text{momen}{\mathrm{t}}_{7}=\left(3{\eta }_{21}-{\eta }_{03}\right)\left({\eta }_{30}+{\eta }_{12}\right)\left[{\left({\eta }_{30}+{\eta }_{12}\right)}^{2}-3{\left({\eta }_{21}+{\eta }_{03}\right)}^{2}\right]-\left({\eta }_{30}-3{\eta }_{12}\right)\left({\eta }_{21}+{\eta }_{03}\right)\left[3{\left({\eta }_{30}+{\eta }_{12}\right)}^{2}-{\left({\eta }_{21}+{\eta }_{03}\right)}^{2}\right]$ (86)
where η p q ${\eta }_{pq}$ is the normalized central moment30 calculated based on the contour point set of a nucleus.
Fractal dimension = slope lg 1 r k , l g L r k k = 1 m $\text{Fractal}\,\text{dimension}=\text{slope}\left({\left\{\mathit{lg}\left(\frac{1}{{r}_{k}}\right),lgL\left({r}_{k}\right)\right\}}_{k=1}^{m}\right)$ (87)
where s l o p e $slope$ is the slope of linear regression, r k ${r}_{k}$ is the k t h ${k}^{th}$ sampling interval of the contour points in a nucleus, L r k $L\left({r}_{k}\right)$ is the fractal length at the k t h ${k}^{th}$ sampling interval,31 and m = N 2 $m=\sfrac{N}{2}$
Fourier descriptor = z Re Z 0 , Re Z 1 , . . . , Re Z 9 $\text{Fourier}\,\text{descriptor}=z\left[\text{Re}\left({Z}_{0}\right),\text{Re}\left({Z}_{1}\right),...,\text{Re}\left({Z}_{9}\right)\right]$ (88)
where z $z$ is the discrete Fourier transform of a set of contour points in a nucleus, and Z $Z$ is the Fourier descriptor.
Nuclear texture features are as follows:
Contrast energy = k c k 2 p c c k $\text{Contrast}\,\text{energy}=\sum\limits _{k}{\left({c}_{k}\right)}^{2}{p}_{c}\left({c}_{k}\right)$ (89)
where c k ${c}_{k}$ is the absolute difference of intensity-level pairs, and p c c k ${p}_{c}\left({c}_{k}\right)$ is the normalized co-occurrence probability of the corresponding intensity-level pairs.
Contrast inverse moment = k 1 1 + c k 2 p c c k $\text{Contrast}\,\text{inverse}\,\text{moment}=\sum\limits _{k}\frac{1}{1+{{c}_{k}}^{2}}{p}_{c}\left({c}_{k}\right)$ (90)
Contrast average ( CA ) = k c k p c c k $\text{Contrast}\,\text{average}(\text{CA})=\sum\limits _{k}{c}_{k}{p}_{c}\left({c}_{k}\right)$ (91)
Contrast variance = k c k C A 2 p c c k $\text{Contrast}\,\text{variance}=\sum\limits _{k}{\left({c}_{k}-CA\right)}^{2}{p}_{c}\left({c}_{k}\right)$ (92)
Contrast entropy = k p c c k ln p c c k $\text{Contrast}\,\text{entropy}=-\sum\limits _{k}{p}_{c}\left({c}_{k}\right)\mathit{ln}\,{p}_{c}\left({c}_{k}\right)$ (93)
Intensity average ( IA ) = l m l p m m l $\text{Intensity}\,\text{average}(\text{IA})=\sum\limits _{l}{m}_{l}{p}_{m}\left({m}_{l}\right)$ (94)
where m l ${m}_{l}$ is the average of intensity-level pairs, and p m m l ${p}_{m}\left({m}_{l}\right)$ is the normalized co-occurrence probability of the corresponding intensity-level pairs.
Intensity variance = l m l I A 2 p m m l $\text{Intensity}\,\text{variance}=\sum\limits _{l}{\left({m}_{l}-IA\right)}^{2}{p}_{m}\left({m}_{l}\right)$ (95)
Intensity entropy = l p m m l ln p m m l $\text{Intensity}\,\text{entropy}=-\sum\limits _{l}{p}_{m}\left({m}_{l}\right)\mathit{ln}\,{p}_{m}\left({m}_{l}\right)$ (96)

The other features, including entropy, energy, correlation, informational measure of correlation 1, and informational measure of correlation 2, are derived from GLCM features. Note that there are other types of handcrafted features, such as the spatial interaction between histological primitives,32, 33 that can be integrated into the PHBCP.

In summary, for each WSI, a feature matrix of size (L × E) × N can be obtained, in which N is the total number of extracted features. Based on this, users can aggregate and concatenate features using various methods according to their needs, for example, mean, standard deviation, and skewness.

2.6 Feature filtering

Feature filtering is a critical step in machine learning that identifies the most relevant and informative features from the feature matrix while eliminating redundant and irrelevant features. This process not only reduces the dimensionality of features but also mitigates overfitting, enhances model interpretability, and improves computational efficiency. In this study, the protocol employs comprehensive feature-selection methods to ensure robust and reliable features.

Firstly, to address multicollinearity and reduce feature redundancy, the PHBCP calculates the pairwise Spearman's rank correlation coefficient matrix among all features. Features with a correlation coefficient greater than the threshold, for example, 0.9 is generally used for removing features having more than 90% synchronicity, are removed. Subsequently, the feature matrix is standardized using the Z-score method.34 The above steps ensure that only non-redundant features are retained for further analysis.

Secondly, the protocol uses comprehensive feature selection methods to capture diverse aspects of feature importance and interactions. These methods include Lasso regression (LR), random forest (RF), elastic-net (EN), recursive feature elimination (RFE), univariate analysis (UA), minimum redundancy maximum relevance (MRMR), t-test, Wilcoxon rank-sum test (WRST), and mutual information (MI) methods, which are implemented in Python using scikit-learn, mrmr_selection, and scipy libraries. Here, users can select an appropriate number of features based on the sample size to achieve a suitable predictive performance and avoid overfitting and the curse of dimensionality.

Finally, each feature selection method is integrated with a classifier, and multi-fold cross-validation with user-defined iterations is performed to assess the consistency and reliability of the selected features across multiple data splits.

By combining correlation-based filtering with comprehensive feature selection methods and cross-validation, the protocol provides a robust framework to identify the most discriminative features while minimizing redundancy and overfitting.

2.7 Modeling

In this step, the PHBCP combines the feature selection methods and classifiers one by one to construct potential models. The protocol employs eight machine learning classifiers, including quadratic discriminant analysis (QDA), linear discriminant analysis (LDA), RF, K-nearest neighbors (KNN), linear support vector machine (LSVM), Gaussian naive bayes (GNB), stochastic gradient descent (SGD), and adaptive boosting (AdaBoost), which are implemented in Python using scikit-learn library. The eight classifiers are implemented in conjunction with the top features selected using the nine feature selection methods. The classifiers are evaluated with multi-fold cross-validation with user-defined iterations within a training cohort. Ultimately, the PHBCP identifies the optimal model combination from the 72 different combinations based on the highest average area under the curve (AUC) across user-defined iterations.

2.8 Performance analysis

Based on the optimal model combination determined during the modeling phase, one can conduct performance analysis focusing on the visualization of the top features' feature distribution, feature importance, and survival analysis in the external validation cohort.

Firstly, the PHBCP calculates the mean, median, and skewness of the top feature values, and then divides the feature values into equal bins to obtain 10 intervals. The distribution of the top features is visualized using histograms overlaid with kernel density estimation (Gaussian kernel) curves. Secondly, a horizontal bar is used to visualize the selection frequency percentage of the top features across the multi-fold cross-validation with user-defined iterations, highlighting the most important features and understanding their contributions to the optimal model combination. A higher selection frequency indicates a greater predictive contribution to the model and a stronger clinical relevance to the research question. Finally, the PHBCP locks down the optimal model combination and corresponding top features in the training cohort and conducts survival analysis in the external validation cohort. A Kaplan–Meier curve is used to evaluate, for example, the survival probability between predicted long- and short-term survival patients. The log-rank test was employed to examine survival differences, indicating the prognostic significance of the categorical variable on the survival endpoint. All tests are two-sided, with the significance level set at 0.05.

3 RESULTS

Given that GBM is the most common and aggressive type of malignant primary brain tumor,35, 36 we used a GBM survival prediction problem as an exemplary task to demonstrate how to use the PHBCP. First, WSI data and corresponding basic clinical information were obtained from two independent cohorts through The Cancer Genome Atlas (TCGA) (389 cases) and The Cancer Imaging Archive (TCIA, 200 cases).21 Subsequently, the entire tissue region was defined as the ROI, with overall survival (OS) established as the endpoint. For patients whose death occurred during the follow-up period, OS of less than or equal to 2 years was classified as short-term survival, while OS greater than 2 years was classified as long-term survival. For censored patients, the final follow-up time was used as the OS, with cases whose OS exceeded 2 years classified as long-term survival, while cases with an OS of less than or equal to 2 years were considered missing information and excluded from the analysis. OS was defined as the time from surgery to death.

In Section 2.3, HistoQC was used to exclude WSIs with fewer than 250,000 usable pixels as well as those exhibiting significant issues such as extensive blurring, tissue folding, reagent contamination, and abnormal staining. The detailed settings are provided in the supplementary parameter settings for HistoQC. For patients with multiple slides, one slide was selected for subsequent analysis based on its image quality using the pathological image viewer-QuPath. Inclusion and exclusion criteria were applied to both cohorts. Inclusion criteria: (1) patients who underwent resection and were confirmed to have GBM through surgical pathological specimens; (2) patients whose OS information was complete; (3) patients who contained follow-up information. Exclusion criteria: (1) missing H&E-stained WSIs of 20x magnification and (2) histopathological slides that did not meet the standard requirements for analysis. Ultimately, 207 patients from TCGA were included as the training cohort, while 57 patients from TCIA were incorporated as the external validation cohort. Table 1 presents a summary of the basic clinical information and distribution differences between the training cohort and the external validation cohort.

TABLE 1. Baseline and clinical characteristics in the training cohort and external validation cohort.
Training cohort (N = 207) External validation cohort (N = 57) p
Age 0.8546
≤ 65 150 (72.5%) 42 (73.7%)
> 65 57 (27.5%) 15 (26.3%)
Sex 0.9723
Male 124 (59.9%) 34 (59.6%)
Female 83 (40.1%) 23 (40.4%)
Race <0.0001
White 189 (91.3%) 24 (42.1%)
Asian 3 (1.4%) 19 (33.3%)
Other 11 (5.3%) 13 (22.8%)
Unknown 4 (1.9%) 1 (1.8%)
History of LGG
Yes 3 (1.4%) NA
No 204 (98.6%) NA
Event status 0.4987
Occurred 191 (92.3%) 51 (89.5%)
Censored 16 (7.7%) 6 (10.5%)
Survival status 0.5731
Long term (>2 years) 54 (26.1%) 17 (29.8%)
Short term (≤2 years) 153 (73.9%) 40 (70.2%)
  • The p-values were calculated by Pearson's Chi-square test.

In Section 2.4, the tissue mask generated through HistoQC was aligned with the corresponding WSI, and image tiles of 224 × 224 pixels were extracted at 20x magnification without overlap. The tiles from each WSI were clustered into 10 classes, and 50 tiles were randomly selected from each class to ensure a comprehensive analysis of all regions. Stain normalization was performed on 500 selected tiles for subsequent feature extraction.

In Section 2.5, three types of features were extracted: First-order statistics, GLCM features, and GLRLM features, totaling 57 features. For each WSI, a feature matrix of size 500 × 57 was obtained, and the feature matrix was averaged to aggregate a 1 × 57 feature vector.

In Sections 2.6 and 2.7, in order to avoid the curse of dimensionality and overfitting, we set the number of top features to six, based on the experimental experience that the number of selected features should be approximately one-10th of the number of minority class samples. In this study, the training cohort contained 54 minority class samples and thus the top six features were chosen. The Spearman correlation threshold was set to 0.9. The predictive performance of the model in the training cohort was evaluated by performing 100 iterations of five-fold cross-validation in 72 model combinations to avoid incidental results. The detailed results are presented in Table 2. Table 2 shows that the optimal model combination was MI-KNN (AUC = 0.615 ± 0.027). The results for accuracy and F1 score are presented in Tables S1 and S2, respectively.

TABLE 2. AUC performance of eight different classifiers with nine different feature selection methods in the training cohort.
QDA LDA RF KNN LSVM GNB SGD AdaBoost
LR 0.595 ± 0.022 0.550 ± 0.029 0.597 ± 0.037 0.591 ± 0.028 0.511 ± 0.034 0.601 ± 0.028 0.533 ± 0.038 0.610 ± 0.033
RF 0.593 ± 0.028 0.564 ± 0.025 0.586 ± 0.033 0.580 ± 0.029 0.521 ± 0.039 0.582 ± 0.030 0.536 ± 0.046 0.578 ± 0.032
EN 0.572 ± 0.029 0.547 ± 0.021 0.598 ± 0.038 0.579 ± 0.032 0.527 ± 0.034 0.567 ± 0.027 0.532 ± 0.038 0.584 ± 0.035
RFE 0.585 ± 0.032 0.509 ± 0.032 0.577 ± 0.036 0.583 ± 0.039 0.454 ± 0.034 0.575 ± 0.022 0.513 ± 0.040 0.572 ± 0.034
UA 0.576 ± 0.028 0.565 ± 0.023 0.577 ± 0.034 0.557 ± 0.033 0.530 ± 0.032 0.553 ± 0.028 0.541 ± 0.040 0.590 ± 0.033
MRMR 0.573 ± 0.027 0.566 ± 0.022 0.586 ± 0.032 0.558 ± 0.033 0.540 ± 0.031 0.554 ± 0.026 0.545 ± 0.034 0.595 ± 0.040
t-test 0.572 ± 0.032 0.568 ± 0.020 0.579 ± 0.035 0.562 ± 0.030 0.537 ± 0.029 0.555 ± 0.024 0.535 ± 0.038 0.594 ± 0.035
WRST 0.596 ± 0.031 0.577 ± 0.025 0.588 ± 0.041 0.574 ± 0.036 0.532 ± 0.038 0.587 ± 0.028 0.548 ± 0.036 0.605 ± 0.034
MI 0.594 ± 0.025 0.528 ± 0.032 0.567 ± 0.039 0.615 ± 0.027 0.467 ± 0.036 0.591 ± 0.024 0.531 ± 0.041 0.586 ± 0.037
  • Note: The bold values represent the AUC and standard deviation of the optimal model combination.

The top six features and their distribution from the MI-KNN model combination in Section 2.8 are illustrated in Figure 2. Six features closely approximated a normal distribution. The normal distribution indicates that these features were relatively stable within the patients, reflecting limited interindividual variability. The features shown in Figure 3 were used to analyze the contribution of the top 12 selected features to the MI-KNN model combination. The top two features, glcm_Contrast_average_20 and glcm_Imc1_average_20, were considered the most relevant to patient outcomes because of their highest selection frequency, underscoring their potential as biomarkers for GBM survival prediction. Finally, in the survival analysis of the external validation cohort, long-term survival patients showed higher survival probabilities compared with short-term survival patients, with statistically significant differences between the two groups (Figure 4), indicating that the constructed classification model had significant predictive value for survival endpoints in GBM patients. Based on the top six features of the MI-KNN model combination in the training cohort, a KNN classifier was trained. In the external validation cohort, the classification performance was observed with an AUC of 0.594, an ACC of 0.754, and an F1 score of 0.848.

Details are in the caption following the image

The distribution of the top six feature values in the training cohort (A–F) approximate a normal distribution. The x-axis represents the binned feature intervals (10 bins), and the y-axis indicates the frequency of samples.

Details are in the caption following the image

The top 12 features at selection frequency and their percentage contributions to the MI-KNN model combination across 100 iterations of five-fold cross-validation.

Details are in the caption following the image

The Kaplan–Meier curve of GBM patients from the external validation cohort was stratified by the MI-KNN model combination into long-term and short-term survival groups.

4 DISCUSSION

In this paper, we develop and present a PHBCP. The presented protocol, termed PHBCP, offers a systematic, modular, and open-source framework and provides WSI processing and analysis guidelines in brain cancer. The results and methodology outlined in this protocol demonstrate its potential to enhance the discriminability and efficiency of brain cancer prediction and prognosis.

Features can be primarily categorized into two types: handcrafted features and deep learning-derived features. Handcrafted features are those extracted through manually designed algorithms, typically based on domain-specific knowledge or experience, such as texture features, statistical features, and geometric features.32 Given an input, these features yield a fixed and interpretable output. In contrast, deep learning-derived features are primarily learned automatically from data by deep learning models, without the need for manual design. Examples include features extracted by ResNet,37 CONCH38 and UNI.39 In practical application, models based on handcrafted features can provide interpretable and clinically relevant insights, which are essential for building trust among medical and technical experts. In contrast, although deep learning models have shown excellent performance in many tasks, these models typically rely on large amounts of data and cannot extract discriminative features from small samples. Additionally, the internal feature representations of deep learning models are complex, their decision-making processes are opaque, and it is difficult to explain their reasoning logic, resulting in poor interpretability. It is also challenging to incorporate specific medical prior knowledge into these deep learning models. These issues limit the widespread application of deep learning-derived features in clinical practice and, to some extent, hinder efficient collaboration between medical and technical experts. In this paper, PHBCP indicates the importance of these features in discovering novel biomarkers and improving the understanding of tumor heterogeneity, a key challenge in brain cancer research.

In the exemplary task of predicting 2-year survival in GBM, four of the top six features were related to GLCM, one feature was associated with first-order statistics, and one feature was connected to GLRLM. The glcm_Contrast_average_20 feature was identified as the most prognostically relevant image feature because of its highest selection frequency. This feature is utilized to quantify the local intensity variation. Through analysis, it was observed that the magnitude of contrast is often closely correlated with the area and distribution of tumors and necrotic regions. This correlation can be used to explain the reasoning logic behind survival prediction using image features. When combined with multiomics data, the biological significance underlying these images can be further elucidated.

By providing a step-by-step guide, the protocol enables seamless collaboration between medical and technical experts, fostering the development of innovative solutions to clinical problems. In this paper, the 2-year survival prediction in GBM serves merely as an exemplary task. Researchers can also conduct other brain cancer-related analyses based on PHBCP, such as isocitrate dehydrogenase (IDH) mutation analysis. The open-source nature of the protocol ensures its accessibility to a wide range of researchers, promoting reproducibility and scalability across different institutions and datasets.

The protocol has limitations. Although we have established a protocol for handcrafted features in brain cancer pathology, it does not encompass all types of handcrafted features. However, other researchers can add comprehensive handcrafted features to the PHBCP. As new pathological insights emerge, the reliance on handcrafted features may require continuous refinement. We anticipate that future contributions from medical and technical experts will enhance and expand this protocol.

5 CONCLUSION

The protocol presented in this study is a significant step forward in the analysis of handcrafted features for brain cancer pathology. By providing a structured and collaborative framework, it empowers pathologists and clinicians to harness histopathological data for improved brain cancer care. We anticipate that this protocol will serve as a valuable resource for the scientific community, driving innovation and promoting the diagnosis and treatment of brain cancer.

AUTHOR CONTRIBUTIONS

Xuanjun Lu: Methodology; data curation; validation; writing—original draft; software; investigation; formal analysis; visualization; funding acquisition. Yawen Ying: Methodology. Jing Chen: Data curation. Zhiyang Chen: Data curation. Yuxin Wu: Software; methodology. Prateek Prasanna: Data curation. Xin Chen: Resources; writing—review and editing. Mingli Jing: Writing—review and editing; resources. Zaiyi Liu: Writing—review and editing; resources. Cheng Lu: Conceptualization; methodology; project administration; funding acquisition; writing—review and editing; resources; software; writing—original draft.

ACKNOWLEDGMENTS

This study was supported by National Natural Science Foundation of China (82272084), National Key R&D Program of China (2023YFC3402800), Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application (2022B1212010011), and Postgraduate Innovation and Practical Ability Training Program of Xi'an Shiyou University (YCS23114144). The authors would like to thank the support provided by MediAI Hub, an advanced medical image analysis software developed and maintained by MediaLab. TCIA data used in this publication were generated by the National Cancer Institute Clinical Proteomic Tumor Analysis Consortium.

    CONFLICT OF INTEREST STATEMENT

    The authors declare no conflicts of interest.

    ETHICS STATEMENT

    Image and data of TCGA and TCIA are publicly available,21 and do not require ethics approval.

    DATA AVAILABILITY STATEMENT

    Image and data of The Cancer Genome Atlas (TCGA) are publicly available in https://portal.gdc.cancer.gov/. The Cancer Imaging Archive (TCIA) data used in this publication were generated by the National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC).

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.