Journal of Cellular and Molecular Medicine

REVIEW

Open Access

Identification of circRNA-disease associations via multi-model fusion and ensemble learning

Jing Yang

School of Computer Science, Shaanxi Normal University, Xi'an, Shaanxi, China

Contribution: Conceptualization (equal), Data curation (equal), Formal analysis (equal), Methodology (lead), Validation (equal), Visualization (equal), Writing - original draft (equal)

Search for more papers by this author

Xiujuan Lei,

Corresponding Author

Xiujuan Lei

[email protected]

orcid.org/0000-0002-9901-1732

School of Computer Science, Shaanxi Normal University, Xi'an, Shaanxi, China

Correspondence

Xiujuan Lei, School of Computer Science, Shaanxi Normal University, Xi'an, Shaanxi 710119, China.

Email: [email protected]

Contribution: Resources (supporting), Supervision (lead), Writing - review & editing (equal)

Search for more papers by this author

Fa Zhang,

Fa Zhang

School of Medical Technology, Beijing Institute of Technology, Beijing, China

Contribution: Writing - review & editing (equal)

Search for more papers by this author

Jing Yang,

Jing Yang

School of Computer Science, Shaanxi Normal University, Xi'an, Shaanxi, China

Contribution: Conceptualization (equal), Data curation (equal), Formal analysis (equal), Methodology (lead), Validation (equal), Visualization (equal), Writing - original draft (equal)

Search for more papers by this author

Xiujuan Lei,

Corresponding Author

Xiujuan Lei

[email protected]

orcid.org/0000-0002-9901-1732

School of Computer Science, Shaanxi Normal University, Xi'an, Shaanxi, China

Correspondence

Xiujuan Lei, School of Computer Science, Shaanxi Normal University, Xi'an, Shaanxi 710119, China.

Email: [email protected]

Contribution: Resources (supporting), Supervision (lead), Writing - review & editing (equal)

Search for more papers by this author

Fa Zhang,

Fa Zhang

School of Medical Technology, Beijing Institute of Technology, Beijing, China

Contribution: Writing - review & editing (equal)

Search for more papers by this author

First published: 20 March 2024

https://doi.org/10.1111/jcmm.18180

Citations: 7

Share a link

Email
Wechat
Bluesky

Abstract

Circular RNA (circRNA) is a common non-coding RNA and plays an important role in the diagnosis and therapy of human diseases, circRNA-disease associations prediction based on computational methods can provide a new way for better clinical diagnosis. In this article, we proposed a novel method for circRNA-disease associations prediction based on ensemble learning, named ELCDA. First, the association heterogeneous network was constructed via collecting multiple information of circRNAs and diseases, and multiple similarity measures are adopted here, then, we use metapath, matrix factorization and GraphSAGE-based models to extract features of nodes from different views, the final comprehensive features of circRNAs and diseases via ensemble learning, finally, a soft voting ensemble strategy is used to integrate the predicted results of all classifier. The performance of ELCDA is evaluated by fivefold cross-validation and compare with other state-of-the-art methods, the experimental results show that ELCDA is outperformance than others. Furthermore, three common diseases are used as case studies, which also demonstrate that ELCDA is an effective method for predicting circRNA-disease associations.

1 INTRODUCTION

Circular RNAs (circRNAs) are a class of non-coding RNA with a closed structure, and there is accumulating evidence indicating that circRNA plays an important role in biological processes such as the genetic aetiology of human complex diseases.¹ circRNA was first observed in the cytoplasm of eukaryotic cells.² In the past, limited by technology, the research on circRNA has not been well-developed, but in recent years, high-throughput sequencing technologies have developed rapidly, the amount of circRNAs appears an exponential growth trend, and multiple databases of circRNAs have been established. CircBase,³ which collects information of circRNAs on multiple species; circBank,⁴ a comprehensive database of more than 140,000 human annotated circRNAs, based on the data of all humans in circBase database, further analysis and processing were conducted, which also provides miRNA-circRNA interactions; circNet,⁵ an updated database for exploring circular RNA regulatory networks in cancers; circFunBase,⁶ a web-accessible database that can provide a high-quality functional circRNA resource; exoRBase⁷ provides the comprehensive annotation and expression landscapes of circRNAs; and circR2Disease,^{8, 9} circRNADisease,¹⁰ and circ2Disease v2.0¹¹ are databases that manually curated experiment-supported human circRNAs related to diseases.

As a result, identifying potential circRNA-disease associations via wet-lab experiment is time-consuming and costly, which urge researchers to explore effective computational methods based on known associations and biological information.¹² These methods can be roughly divided into two categories: traditional machine learning-based methods and deep learning-based methods.

Traditional machine learning-based methods always treat the association prediction problem as a binary classification problem. Fan et al.¹³ proposed a method (KATZHCDA) based on KATZ measure for predicting unknown circRNA-disease associations; however, the network structure has a significant impact on model performance. Zhao et al.¹⁴ proposed a computational method IBNPKATZ, which also base on KATZ measurement, heavily relies on the structure of network. Yan et al.¹⁵ developed a method (DWNN-RLS) based on Regularized Least Squares (RLS) of Kronecker product kernel and Decreasing Weight K-Nearest Neighbour (DWNN), due to the calculation process of Kronecker product, it is not suitable for large-scale datasets. Wei et al.¹⁶ proposed a novel computational method (icircDA-MF) based on Matrix Factorization (MF), which introduced the information of gene in this work. Zhao et al.¹⁷ proposed a method based on locality-constrained linear coding, but the calculation of circRNAs and diseases similarity matrices will lead some bias. Peng et al.¹⁸ proposed a method (RNMFLP) combining Robust Nonnegative Matrix Factorization (RNMF) and Label Propagation (LP). Wang et al.¹⁹ developed a method (KNN-NMF) using weighted K nearest neighbours to reduce the false-negative association impact on prediction performance; however, the construction of similarity networks for the above three models are only depending on the topology information and ignoring the biological attribute information. Zhang et al.²⁰ predicted associations via metapath2vec++ and matrix factorization, metapath2vec++ requires prior specification of metapaths and inefficient for large-scale networks. Ding et al.²¹ predicted associations based on variational graph autoencoder with matrix factorization, where the variational auto-encoder assumes the latent variable follows a simple gaussian distribution, limits the expressiveness of the learned embeddings. Zhang et al.²² proposed a novel method (ICDMOE) for predicting circRNA-disease associations through a multi-objective evolutionary algorithm, but the interaction of features is not considered in the model.

Deep learning-based methods usually learn feature embeddings of circRNAs and diseases on neural networks. Wang et al.²³ developed a method (GCNCDA) based on multi-similarity fusion and Fast learning with Graph Convolutional Network (FastGCN), where GCN is sensitivity to graph structures and has limited generalization capability. Bian et al.²⁴ proposed a method (GATCDA) to predict circRNA-disease associations based on graph attention network, and the performance of the model highly depended on the attention mechanism, requiring careful tuning and optimization. Zheng et al.²⁵ develop a method (iCDA-CGR) based on Chaos game representation to identify circRNA-disease associations, where Chaos game needs a large number of iterations for obtain the expressive representations. Ji et al.²⁶ proposed a method (GATNNCDA) that combines Graph Attention Network (GAT) and multi-layer neural network to infer disease-related circRNAs, but the similarity network of circRNAs is highly dependent on the circRNA-disease network. Wang et al.²⁷ predicted unknown associations based on GraRep, where GraRep proposed only for homogeneous graphs, and the performance on heterogeneous graphs is limited. Chen et al.²⁸ proposed a novel method via signed heterogeneous graph network, due to the computational complexity, it is not suitable for large-scale graphs. Chen et al.²⁹ proposed a method (RGCNCDA) based on Relational Graph Convolutional Network (RGCN) and incorporate microRNA (miRNA) to improve the prediction performance, however, RGCN mainly focuses on the local information, and ignores the global information. Guo et al.³⁰ proposed a method (THGNCDA) using graph neural network with attention to learn the importance of its each neighbour, but the model complexity is relatively high.

We propose a novel Ensemble Learning-based CircRNA-Disease Association prediction method (short for ELCDA) in this work. First, a heterogeneous network is constructed and multiple similarities are calculated based on different views; then, MAGNN (metapath aggregated graph neural network),³¹ CMF³² and GraphSAGE³³ are used to obtain the comprehensive representations of circRNAs and diseases; and the embeddings obtained by different models are fed into different classifiers, a soft voting strategy is used to fuse the classification results and obtain the final prediction results.

In summary, the main contributions of this study are listed as follows:

A 3-layer heterogenous network is constructed among circRNA, miRNA and disease, and 4 different similarity measurements are calculated from multi-views;
The metapath-based feature extractor mainly used to capture global information, GraphSAGE is used to obtain the local, nonlinear features, linear information is obtained via MF, the comprehensive representation can be obtained by integrating these features together;
Multiple classifiers are used here, and then an ensemble learning method is adopted to obtain the final predicted results.

2 MATERIALS AND METHODS

2.1 Problem description

The network of circRNA-disease associations can be considered as a bipartite network, assuming there are m circRNAs and n diseases in the network, the nodes can be denoted as two sets ${T}_C=\left\{{c}_1,{c}_2,\cdots, {c}_m\right\}$ and ${T}_D=\left\{{d}_1,{d}_2,\cdots, {d}_n\right\}$ , and there are three types of edges between nodes, which can be denoted as $E=\left\{{e}_{cc},{e}_{cd},{e}_{dd}\right\}$ , where ${e}_{cc}$ and ${e}_{dd}$ are the similarity between circRNAs and diseases, ${e}_{cd}$ is the association between circRNA and disease, if circRNA c is related to disease d, ${e}_{cd}=1$ , else, ${e}_{cd}=0$ . The goal of our study was to reconstruct the adjacency matrix between circRNAs and diseases and make it as similar as possible to the original adjacency matrix, the values greater than 0 in the reconstructed matrix demonstrate that the corresponding circRNAs and diseases may have associations. As shown in Figure 1, the black solid lines and black dashed lines represented the known associations and predicted associations, respectively.

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

The circRNA-disease association prediction problem.

2.2 Materials

In this paper, we collect the information of circRNAs, miRNAs and diseases, the known associations among circRNAs, miRNAs and diseases are downloaded from circBank, circR2Disease V2.0 and HMDD V3.2,³⁴ after data preprocessing, we obtain a dataset which contains 2223 circRNAs, 996 miRNAs and 199 diseases, the details are shown in Table 1.

TABLE 1. Basic information of dataset.

Types	Items	Numbers	Resources
Node	circRNA (C)	2223	circR2Disease, exoRBase
	Disease (D)	199	circR2Disease, MeSH
	miRNA (M)	996	HMDD
Edge	C-D	2970	circR2Disease
	C-M	13,408	circBank
	M-D	10,282	HMDD
Metapaths	CDC, CMC, CDMDC, CMDMC, DCD, DMD, DCMCD, DMCMD

Furthermore, we analyse the distribution frequency of each type of association, as shown in Table 2, (A) the number of circRNA-related diseases; (B) the number of disease-related circRNAs; (C) the number of circRNA-related miRNAs; (D) the number of miRNA-related circRNAs; (E) the number of miRNA-related diseases; (F) the number of disease-related miRNAs. It can be seen that most circRNAs are only related to one disease (about 80%), which demonstrated that the adjacency matrix of circRNA-disease heterogeneous network is very sparse.

TABLE 2. Frequency distribution of each type of association.

Numbers	Type
Numbers	A	B	C	D	E	F
1	51 (25.6%)	1834 (82.5%)	1 (0.1%)	148 (6.7%)	1(0.1%)	1 (0.5%)
2–5	62 (31.2%)	359 (16.1%)	699 (70.2%)	1344 (60.5%)	594 (59.6%)	51 (25.6%)
6–10	29 (14.6%)	23 (1%)	15 (1.5%)	357 (16.1%)	109 (10.9%)	16 (8%)
11–50	43 (21.6%)	7 (0.3%)	178 (17.9%)	342 (15.4%)	265 (26.6%)	65 (32.7%)
>50	14 (7%)	0 (0%)	103 (10.3%)	32 (1.4%)	27 (2.7%)	66 (33.2%)
Total	199	2223	996	2223	996	199

The overview of proposed model is shown in Figure 2, which mainly consists of three modules: heterogeneous network construction, feature extraction and association prediction. Specifically, the high-quality and sub-structural features of nodes can be obtained via metapath-based feature extractor, the low-level and linear features of nodes can be obtained via matrix factorization (MF)-based feature extractor, the local and nonlinear features can be obtained via GraphSAGE-based feature extractor, then the ensemble learning is used to fusion them and obtain the classification results of unknown associations.

Based on the assumption that similar circRNAs are tend to related to similar diseases, several kinds of information are introduced to calculate the similarity matrices of circRNAs and diseases. In circRNA space, the expression profile similarity and functional similarity are used to build the circRNA similarity network, in disease space, the semantic similarity, gaussian interaction profile (GIP) kernel similarity are used to construct the disease similarity network, furthermore, PathSim and HeteSim are used here, the details are shown as follows:

Definition 1.Heterogeneous graph.³⁵ A graph can be denoted as $G=\left(V,E\right)$ , where V is the set of nodes and E is the set of edges. ${\varGamma}_v$ and ${\varGamma}_e$ are the sets of node types and edge types, respectively, where there are two mappings satisfying: and ${\phi}_e:e\to {\varGamma}_e$ , if $\mid {\varGamma}_v\mid +\mid {\varGamma}_e\mid >2$ , then G is a heterogeneous graph, otherwise, G is homogeneous.

Definition 2.Metapath.³⁶ A metapath P is a special path that connects two entities in the form ${o}_1\overset{R_1}{\to }{o}_2\overset{R_2}{\to}\cdots \overset{R_{q-1}}{\to }{o}_q$ , which can be abbreviated as ${o}_1{o}_2\cdots {o}_q$ , $R={R}_1\circ {R}_2\circ \cdots \circ {R}_{q-1}$ is the composite relation between start node ${o}_1$ and target node ${o}_q$ , q is the length of path.

Definition 3.PathSim.³⁶ Given a symmetric metapath P, the PathSim between two objects x and y of the same type is defined as follows:

PathSim\left(x,y\right)=\frac{2\times \left|\left\{{p}_{x\to y}:{p}_{x\to y}\in P\right\}\right|}{\left|\left\{{p}_{x\to x}:{p}_{x\to x}\in P\right\}\right|+\left|\left\{{p}_{y\to y}:{p}_{y\to y}\in P\right\}\right|}

(1)

where

{p}_{x\to y}

is a metapath instance from x to y.

Definition 4.HeteSim.³⁷ Given a relevance path P corresponding to the relation R defined above, the HeteSim between two objects x and y is:

\begin{array}{c} HeteSim\left(x,y|R\right)= HeteSim\left(x,y|{R}_1\circ {R}_2\circ \cdots \circ {R}_l\right)\\ {}=\frac{1}{\left|O\left(x|{R}_1\right)\right|\left|I\left(y|{R}_l\right)\right|}\sum \limits_{i=1}^{\left|O\left(x|{R}_1\right)\right|}\sum \limits_{j=1}^{\left|I\left(y|{R}_l\right)\right|} HeteSim\left({O}_i\left(x|{R}_1\right),{I}_j\left(y|{R}_l\right)|{R}_2\circ \cdots \circ {R}_{l-1}\right)\end{array}

(2)

The HeteSim can be further simplified into the following form:

HeteSim\left(x,y|P\right)=\frac{T_{P_L}\left(x,:\right){\left({T}_{P_R^{-1}}\left(y,:\right)\right)}^{Transpose}}{{\left\Vert {T}_{P_L}\left(x,:\right)\right\Vert}_2{\left\Vert {T}_{P_R^{-1}}\left(y,:\right)\right\Vert}_2}

(3)

Assuming the middle node between x and y via path P is mid, then we can split P into

{P}_L=\left(x\cdots mid\right)

and

{P}_R=\left( mid\cdots y\right)

, and T is the transition probability matrix, which can be calculated as:

{T}_{XY}\left(x,y\right)=\frac{A_{XY}\left(x,y\right)}{\sum \limits_k{A}_{XY}\left(x,k\right)}

(4)

where A_XY is the adjacency matrix between node types X and Y.

As shown in Figure 3, the details of calculating the PathSim and HeteSim score between c₂ and c₄ is shown as follows, we can see there are 2 kinds of path instances under path P = CDC between c₂ and c₄.

PathSim

$PathSim\left(x,y\right)=\frac{2\times \left|\left\{{p}_{c_2\to {c}_4}:{p}_{c_2\to {c}_4}\in P\right\}\right|}{\left|\left\{{p}_{c_2\to {c}_2}:{p}_{c_2\to {c}_2}\in P\right\}\right|+\left|\left\{{p}_{c_4\to {c}_4}:{p}_{c_4\to {c}_4}\in P\right\}\right|}=\frac{2\times \mid \left\{{c}_2{d}_1{c}_4,{c}_2{d}_2{c}_4\right\}\mid }{\mid \left\{{c}_2{d}_1{c}_2,{c}_2{d}_2{c}_2\right\}\mid +\mid \left\{{c}_4{d}_1{c}_4,{c}_4{d}_2{c}_4,{c}_4{d}_3{c}_4\right\}\mid }=\frac{2\times 2}{2+3}=\frac{4}{5}=0.8$
HeteSim

First, split P into P_L = CD and P_R = DC, then the adjacency matrices of P_L and P_R are denoted as A and A^Transpose, respectively, and obtain the transition matrices T_CD and T_DC via row normalization.

A={A}_{CD}=\left[\begin{array}{cc}\begin{array}{c}1\\ {}1\\ {}0\end{array}& \begin{array}{c}0\kern0.5em 0\\ {}\begin{array}{cc}1& 0\end{array}\\ {}\begin{array}{cc}1& 0\end{array}\end{array}\\ {}\begin{array}{c}1\\ {}0\end{array}& \begin{array}{c}\begin{array}{cc}1& 1\end{array}\\ {}\begin{array}{cc}0& 1\end{array}\end{array}\end{array}\right],{T}_{CD}=\left[\begin{array}{cc}\begin{array}{c}1\\ {}1/2\\ {}0\end{array}& \begin{array}{c}\begin{array}{cc}0& 0\end{array}\\ {}\begin{array}{cc}1/2& 0\end{array}\\ {}\begin{array}{cc}1& 0\end{array}\end{array}\\ {}\begin{array}{c}1/3\\ {}0\end{array}& \begin{array}{c}\begin{array}{cc}1/3& 1/3\end{array}\\ {}\begin{array}{cc}0& 1\end{array}\end{array}\end{array}\right]

{A}^{Transpose}={A}_{DC}=\left[\begin{array}{ccc}1& 1& 0\kern0.5em 1\kern0.5em 0\\ {}0& 1& \begin{array}{ccc}1& 1& 0\end{array}\\ {}0& 0& \begin{array}{ccc}0& 1& 1\end{array}\end{array}\right],{T}_{DC}=\left[\begin{array}{ccc}1/3& 1/3& \begin{array}{ccc}0& 1/3& 0\end{array}\\ {}0& 1/3& \begin{array}{ccc}1/3& 1/3& 0\end{array}\\ {}0& 0& \begin{array}{ccc}0& 1/2& 1/2\end{array}\end{array}\right]

Then the HeteSim score between c₂ and c₄ is:

HeteSim\left({c}_2,{c}_4\right)=\frac{T_{CD}\left(2,:\right){\left({T}_{DC}\left(4,:\right)\right)}^{Transpose}}{{\left\Vert {T}_{CD}\left(2,:\right)\right\Vert}_2{\left\Vert {T}_{DC}\left(4,:\right)\right\Vert}_2}=\frac{\left(\frac{1}{2},\frac{1}{2},0\right){\left(\frac{1}{3},\frac{1}{3},\frac{1}{3}\right)}^{Transpose}}{{\left\Vert \left(\left(\frac{1}{2},\frac{1}{2},0\right)\right)\right\Vert}_2{\left\Vert \left(\frac{1}{3},\frac{1}{3},\frac{1}{3}\right)\right\Vert}_2}=\frac{\frac{1}{3}}{\frac{1}{\sqrt{6}}}=\frac{\sqrt{6}}{3}\approx 0.8165

2.2.1 Disease semantic similarity

From MeSH database, each disease can be expressed as a directed acyclic graph (DAG), a disease

{d}_i

can be represented as

{DAG}_{d_i}=\left({d}_i,{T}_{d_i},{E}_{d_i}\right)

, where

{T}_{d_i}

is the ancestor nodes of

{d}_i

(including

{d}_i

itself),

{E}_{d_i}

is the set of corresponding edges, then the semantic contribution of a disease

{d}_t

{DAG}_{d_i}

can be calculated by:

{D}_{d_i}\left({d}_t\right)=\left\{\begin{array}{c}1,i=t\\ {}\max \left\{\Delta \ast {D}_{d_i}\left({d}_{t^{\prime }}\right)\right\}, else\end{array}\right.

(5)

\Delta

is the semantic contributor factor (from previous studies, we set

\Delta =0.5

here), then the semantic value of disease

{d}_i

is defined as:

DV\left({d}_i\right)={\sum}_{d_t\in {T}_{d_i}}{D}_{d_i}\left({d}_t\right)

(6)

and the semantic similarity between

{d}_i

and

{d}_j

is defined as:

SS\left({d}_i,{d}_j\right)=\frac{\sum_{d_t\in {T}_{d_i}\cap {T}_{d_j}}\left({D}_{d_i}\left({d}_t\right)+{D}_{d_j}\left({d}_t\right)\right)}{DV\left({d}_i\right)+ DV\left({d}_j\right)}

(7)

2.2.2 Disease Gaussian Interaction Profile (GIP) kernel similarity

The GIP kernel similarity is widely used to measure the similarities among biomolecules, from HMDD v3.2 database, we can obtain the miRNA-disease association matrix MD, each column of MD can be considered as the interaction profile of disease, given two diseases

{d}_i

and

{d}_j

, the GIP kernel similarity between them can be calculated as follows:

GD\left({d}_i,{d}_j\right)=\exp \left\{-{\beta}_d{\left\Vert MD\left(:,i\right)- MD\left(:,j\right)\right\Vert}^2\right\},{\beta}_d=\frac{n}{\sum \limits_{i=1}^n{\left\Vert MD\left(:,i\right)\right\Vert}^2}

(8)

where MD(:,i) is the i-th column of MD, n is the number of disease.

The disease similarity matrix

{SD}^{bio}

is obtained by combining the semantic similarity and GIP kernel similarity, that is:

{SD}^{bio}\left({d}_i,{d}_j\right)=\left\{\begin{array}{c} SS\left({d}_i,{d}_j\right), SS\left({d}_i,{d}_j\right)\ exists\ \\ {} GD\left({d}_i,{d}_j\right), otherwise\end{array}\right.

(9)

2.2.3 CircRNA expression profile similarity

exoRBase integrated RNA expression profile information based on normalized RNA-seq data, for example, the expression profile information of circRNA

{c}_i

can be expressed as

{f}_i=\left({f}_{i1},{f}_{i2},\cdots, {f}_{ih}\right)

, and spearman correlation coefficient is used to measure the similarities among circRNAs.

SE\left({c}_i,{c}_j\right)=1-\frac{6\sum {d}_k^2}{h\left({h}^2-1\right)}

(10)

where

{d}_k={f}_{ik}-{f}_{jk}

is the difference of rank, h is the dimension of feature vector.

2.2.4 circRNA functional similarity

After obtaining the disease similarity matrix SD, we can define circRNA functional similarity as follows:

\begin{array}{c} SF\left({c}_i,{c}_j\right)=\frac{\sum_{t\in \left[1,u\right]} SD\left({d}_{it},N\left({c}_j\right)\right)+{\sum}_{t\in \left[1,v\right]} SD\left({d}_{jt},N\left({c}_i\right)\right)}{u+v},\\ {} SD\left({d}_i,N\left({c}_i\right)\right)=\max SD\left({d}_i,{d}_t\right),t\in \left[1,\left|N\left({c}_i\right)\right|\right]\end{array}

(11)

where

N\left({c}_i\right)=\left\{{d}_{i1},{d}_{i2},\cdots, {d}_{iu}\right\}

is the set of diseases-related to circRNA

{c}_i

Finally, combine the expression profile similarity and functional similarity, we can obtain the circRNA similarity

{SC}^{bio}

{SC}^{bio}\left({c}_i,{c}_j\right)=\left\{\begin{array}{c}0.5\times \left( SF\left({c}_i,{c}_j\right)+ SE\left({c}_i,{c}_j\right)\right)\ SE\left({c}_i,{c}_j\right)\ exists\ \\ {} SF\left({c}_i,{c}_j\right), otherwise\end{array}\right.

(12)

2.2.5 Integrated similarity for circRNAs and diseases

The disease similarity and circRNA similarity are calculated as follows:

\begin{array}{c} SD\left({d}_i,,,{d}_j\right)={0.5}^{\ast }{SD}^{bio}\left({d}_i,,,{d}_j\right)+{0.25}^{\ast}\left( HeteSim\left({d}_i,,,{d}_j\right)\right.\\ {}\left.+ Pathsim\left({d}_i,,,{d}_j\right)\right),i,j=1,2,\cdots, n\hfill \end{array}

(13)

\begin{array}{c} SC\left({d}_i,,,{d}_j\right)={0.5}^{\ast }{SC}^{bio}\left({c}_i,,,{c}_j\right)+{0.25}^{\ast}\left( HeteSim\left({c}_i,,,{c}_j\right)\right.\\ {}\left.+ Pathsim\left({c}_i,,,{c}_j\right)\right),i,j=1,2,\cdots, m\hfill \end{array}

(14)

2.3 Methods

2.3.1 Metapath-based feature extractor

As shown in Figure 3(A), we selected eight different metapaths on circRNA-miRNA-disease heterogeneous network, however, the numbers of circRNAs, miRNAs and diseases are different, we apply a node type-specific linear transformation layer here to project different types of nodes into same vector space, that is:

{h}_c^{\prime }={W}_c\bullet {S}_c,{h}_d^{\prime }={W}_d\bullet {S}_d

(15)

where

{W}_c\in {\mathbb{R}}^{m\times l}

{W}_d\in {\mathbb{R}}^{n\times l}

are the weight matrices,

{S}_c\in {\mathbb{R}}^m

{S}_d\in {\mathbb{R}}^n

are the original feature vectors of different types of nodes, here, we use the integrated similarity matrices of circRNAs and diseases as its original features, that is:

{S}_c\in {\mathbb{R}}^m

is the c-th row of SC,

{S}_d\in {\mathbb{R}}^n

is the d-th row of SD, l is the dimension of vector space.

A special metapath instance encoder is introduced here to transform the features of all nodes along the instance into a single vector:

{h}_{p\left({o}_1,{o}_t\right)}={f}_{\theta}\left({o}_1,{o}_t\right)={f}_{\theta}\left({h}_{o_1}^{\prime },{h}_{o_t}^{\prime },\left\{{h}_u^{\prime },\forall u\in \left\{p\left({o}_1,{o}_t\right)/\left\{{o}_1,{o}_t\right\}\right\}\right\}\right)

(16)

where

p\left({o}_1,{o}_t\right)

is a metapath instance connecting entities

{o}_1

and

{o}_t

Then, multi-head attention mechanism is used to aggregate instances under same metapath, the goal is learning the weight of each instance and the weighted summing of all instances is considered as the features of nodes.

\begin{array}{c}{e}_{o_1{o}_t}^p= LeakyReLU\left({\alpha}_p^T\bullet \left[{h}_{o_1}^{\prime}\parallel {h}_{p\left({o}_{1,}{o}_t\right)}\right]\right),{\alpha}_{o_1{o}_t}^p= softmax\left({e}_{o_1{o}_t}^p\right),\\ {}{h}_{o_1}^P={\parallel}_{k=1}^K\sigma \left(\sum \limits_{u\in {N}_{o_1}^p}{\left[{\alpha}_{o_1{o}_t}^p\right]}_k\bullet {h}_{p\left({o}_{1,}{o}_t\right)}\right)\end{array}

(17)

And the attention mechanism is also used to aggregate the information of different metapath as follows:

\begin{array}{c}{s}_{p_i}=\frac{1}{\left|{\Gamma}_{v_i}\right|}\sum \limits_{o_1\in {\Gamma}_{v_i}}\tanh \left({W}_{v_i}\bullet {h}_{o_1}^{p_i}+{b}_{v_i}\right),{e}_{p_i}={q}^{Transpose}\bullet {s}_{p_i},\\ {}{\beta}_{p_i}= softmax\left({e}_{p_i}\right),{h}_{o_1}^P=\sum \limits_{p_i\in P}{\beta}_{p_i}\bullet {h}_{o_1}^{p_i}\end{array}

(18)

where

{v}_i

, i = 1,2 corresponding to circRNA and disease nodes,

{\Gamma}_{v_i}

is the node-specific set,

\left|{\Gamma}_{v_i}\right|

is the number of nodes in

{\Gamma}_{v_i}

{p}_i

is the metapath instances related to node type i.

The objective function of metapath-based feature extractor is defined as follows:

L=\frac{1}{N}\sum \limits_i-\left({y}_{ij}\bullet \mathit{\log}\left({h}_c^{Transpose}\bullet {h}_d\right)+\left(1-{y}_{ij}\right)\mathit{\log}\left(1-{h}_c^{Transpose}\bullet {h}_d\right)\right)

(19)

where N is the number of samples.

2.3.2 Matrix factorization-based feature extractor

Matrix factorization (MF) can project features of circRNAs and diseases onto same low-dimensional vector space. As shown in Figure 3(B), the goal of MF is minimizing the following objective function:

H=C{D}^{Transpose}s.t.C\ge 0,D\ge 0, SC=C{C}^{Transpose}, SD=D{D}^{Transpose}

(20)

An indictor matrix W is introduced here, if there is a known association between circRNA and disease pair,

{W}_{ij}=1

, else,

{W}_{ij}=0

, then the objective function can be written as follows:

\begin{array}{c}L=\mathit{\arg}\underset{C,D}{\min }{\left\Vert W\bullet \left(H-C{D}^{Transpose}\right)\right\Vert}_F^2+\alpha \left({\left\Vert C\right\Vert}_F^2+{\left\Vert D\right\Vert}_F^2\right)\\ {}+\lambda \left({\left\Vert SC-C{C}^{Transpose}\right\Vert}_F^2+{\left\Vert D{D}^{Transpose}\right\Vert}_F^2\right)\end{array}

(21)

where H is the adjacency matrix of circRNA-disease association network,

C

and

D

are the latent feature matrices of circRNAs and diseases,

\alpha

and

\lambda

are the trade-off parameters,

{\left\Vert \bullet \right\Vert}_F^2

is the square of Frobenius norm, SC and SD are the similarity matrices of circRNAs and diseases, the alternating direction multiplier update rule is used here.

2.3.3 GraphSAGE-based feature extractor

Traditional graph convolutional networks (GCNs) update the node representations of the whole graph in each iteration, when the scale of the graph is large, the training strategy is undoubtedly time-consuming and even can not be updated, this promotes researchers to introduce the idea of mini-batch in GCN algorithms; therefore, GraphSAGE algorithm had been proposed.

The details of GraphSAGE algorithm can be summarized as follows:

Neighbour sampling: different from traditional GCN algorithms, GraphSAGE update the representation of the target node using the information of neighbours, specially, if the number of neighbours is greater than the pre-defined number of samples, the oversampling (resampling) strategy is used, conversely, if the number of neighbours is less than the pre-defined number of samples, the under-sampling technique is used, which is shown in Figure 3 (c).
Aggregation: for simplicity, the mean aggregator is used in this study, that is:

{h}_v^L\leftarrow \sigma \left(W\bullet MEAN\left({h}_v^{L-1}\cup {h}_v^{L-1},\forall u\in N(v)\right)\right)

(22)

where

{h}_v^0

is the original feature representation of node v, represented by the similarity matrices of circRNAs and diseases.

2.3.4 Model fusion via ensemble learning

In order to obtain the optimal performance, the ensemble learning is used here, in this study, some classic classifiers are chosen, support vector machine (SVM),³⁸ random forest (RF),³⁹ extreme gradient boosting (XGBoost),⁴⁰ light gradient boosting machine (LightGBM),⁴¹ gaussian naïve bayes (Gaussian NB),⁴² where RF is a variant of bagging, XGBoost and LightGBM are boosting algorithm. After obtain the classification results via different models and classifiers, a soft voting strategy is used to obtain the final predicted result of circRNA-disease pair (Algorithm 1).

ALGORITHM 1. Ensemble Learning based CircRNA-Disease Association prediction (ELCDA)

Input: circRNA-disease association matrix CD; circRNA-miRNA association matrix CM;

miRNA-disease association matrix MD; circRNA and disease similarity matrices SC, SD;

the dimension of vector space l; the number of heads in metapath-based feature extractor K;

the trade-off parameter in MF-based feature exactor λ;

1. Training the metapath-based, MF-based and GraphSAGE-based feature extractors, obtaining the embeddings of nodes, denoted as F₁, F₂ and F₃, respectively, and concatenate them as the final representations of nodes, denoted as $F= concat\left({F}_1,{F}_2,{F}_3\right)$ ;

2. Using selected classifiers to obtain the predicted results;

3. Using soft voting strategy to obtain the final predicted results.

Output: the final predicted association probability $p\in \left(0,1\right)$ .

3 EXPERIMENTS AND RESULTS

3.1 Evaluation metrics

To evaluate the performance of our model, we compared our propose model with other state-of-the-art methods under fivefold cross-validation (5-cv). Specifically, the known circRNA-disease associations in circR2Disease v2.0 is taken as the positive samples, and we randomly select negative samples with the same number of positive samples, and a balanced data set with 5940 samples can be obtain. The indicators to evaluate the model including AUC (the area under ROC curve), AUPR (the area under precision-recall curve), Accuracy, Recall and F1-score, we treat the association prediction as a binary classification problem, then the evaluate indicators can be defined as follows:

\begin{array}{c} Accuracy=\frac{TP+ TN}{TN+ TP+ FN+ FP}, Recall=\frac{TP}{TP+ FN},\\ {} Precision=\frac{TP}{TP+ FP},F1- score=\frac{2^{\ast }{Precision}^{\ast } Recall}{Precision+ Recall}\end{array}

(23)

3.2 Parameters analysis

In this section, we analyse two main parameters of ELCDA, first, the number of heads K in MAGNN, second, the aggregator used in GraphSAGE, the results are shown in Figure 4, when K is 8 and the aggregator is mean, ELCDA obtains the best performance.

3.3 Comparison with other methods

In this paper, we compare ELCDA with seven other state-of-the-art (SOTA) methods:

KATZHCDA (2018)¹³: predicting unknown associations based on KATZ measure;
CD-LNLP (2019)⁴³: predicting circRNA-disease associations via linear neighbour label propagation;
KATZCPDA (2019)⁴⁴: based on the original KATZHCDA model, taking into the impact of proteins to predict the associations between circRNAs and diseases;
icircDA-MF (2020)¹⁶: predicting the potential disease-associated circRNAs based on matrix factorization, and the circRNA-disease interaction profiles are then updated by the neighbour interaction profiles so as to correct the false negative associations;
DMFCDA (2021)⁴⁵: using deep matrix factorization to improve prediction of circRNA-disease associations;
GMNN2CD (2022)⁴⁶: using variational inference and graph Markov neural networks to predict circRNA-disease associations;
AGAEMDA (2023)⁴⁷: predicting unknown associations via node-level attention graph auto-encoder.

The ROC, PR curves are shown in Figure 5. From which we can observe that our proposed ELCDA has the best performance under both AUC and AUPR, which achieves 0.9289 and 0.9239 under 5-cv, outperforms all selected SOTA methods. Specifically, the AUPR values of KATZHCDA, CD-LNLP, KATZHCPDA, icircDA-MF, DMFCDA and GMNN2CD are significantly lower than AGAEMDA and ELCDA, cause the former methods didn't adopt any sample balance strategy, the ratio of positive and negative samples is close to 1:150, which indicates that the dataset used in this paper is extremely imbalanced, neglecting to perform data set balancing and preprocessing may lead to less-than-ideal results under AUPR and some other evaluation metrics. Other indicators are listed in Table 3, and the bold values are the maximums, from the results we can also observe that the performance of ELCDA is superior than other SOTA methods in most cases.

TABLE 3. The performance results of methods.

Method	Indicator
Method	Accuracy	Recall	F1-score
CD-LNLP	0.498	0.717	0.005
KATZHCDA	0.523	0.921	0.013
KATZCPDA	0.499	0.804	0.010
icircDA-MF	0.501	0.832	0.016
DMFCDA	0.501	0.832	0.016
GMNN2CD	0.500	0.862	0.005
AGAEMDA	0.774	0.911	0.795
ELCDA	0.856	0.832	0.852

Note: Bold value indicates the maximum value of each column.

3.4 Ablation studies

We use three feature extractors and various classifiers in this paper, the ablation studies are adopted here to illustrate the effectiveness of different module, the results are shown in Figure 6. It is obvious that our proposed model obtains the best performance on different metrics. Actually, SVM is a common and basic classifier, but not applicable to the case with lots of missing data; RF is a common bagging classifier, performs well in most cases, but overfitting may occur in noisy classification problems; XGBoost and LightGBM are variants of gradient boosting decision tree (GBDT) algorithm, which are faster and more robust, but not considering the concept that the optimal solution is a synthesis of all features; GaussianNB is a classifier based on naïve bayes, which is extremely fast, but performs poor of data with large size. Voting strategy is a classical ensemble learning algorithm, and compare with hard voting, soft voting strategy can achieve higher classification performance.

3.5 Case studies

Hepatocellular carcinoma (HCC), breast cancer (BC) and lung cancer (LC) are used to demonstrate the effectiveness of the proposed model.

At present, the global incidence of Liver Cancer is on the rise. It is estimated that by 2025, the annual number of liver cancer cases will exceed one million, where HCC is the most common type of LC, about 90% of the total number of cases.⁴⁸ HCC is a prototypical inflammation-associated cancer, which is the most common type of cancer among American adults, most patients have no symptoms in the early stages, HCC is a primary cancer that originates from hepatocytes in the extensively hardened liver tissue. Table 4 lists the top-20 precited circRNAs related to HCC, and the results show that 20 of the top-20 had been confirmed, demonstrates excellent predictive performance.

TABLE 4. Predicted top-20 circRNAs related to HCC.

Rank	circRNA	Evidence	Rank	circRNA	Evidence
1	hsa_circ_100084	32,323,783	11	hsa_circ_0001001	32,222,024
2	hsa_circ_0008514	31,004,447	12	hsa_circ_0049613	32,393,764
3	hsa_circ_0139897	31,456,215	13	hsa_circ_0018764	30,675,276
4	hsa_circ_0010090	32,907,351	14	hsa_circ_0091570	31,207,319
5	hsa_circ_0110102	33,891,564	15	hsa_circ_0008717	32,196,586
6	circSMG1.72	31,996,784	16	hsa_circ_0001141	28,636,993
7	circZEB	30,123,094	17	hsa_circ_0008450	30,556,306
8	circWHSC1	33,410,156	18	hsa_circ_0003731	30,630,697
9	hsa_circ_0001175	33,408,482	19	hsa_circ_0000098	30,092,792
10	hsa_circ_0019456	31,417,632	20	circGFRA1	332,154,221

BC is a malignant tumour that occurs in the epithelial tissue of the breast gland, according to the latest report released by the International Agency for Research on Cancer (IARC) in 2022, it has become “the most common cancer in the world” with 2.26 million new cases worldwide. Several factors can increase the risk of developing it, include age, family history of BC (mainly related to gene mutations), obesity and so on. Early-stage BC may not cause noticeable symptoms, the awareness and early detection campaigns can significantly improve the survival rates. Table 5 lists the predicted top-20 circRNAs related to breast cancer, from which we can see that 20 of the top-20 had been confirmed by biologists, which demonstrate ELCDA has the ability of giving the reliable candidate circRNA biomarkers of BC.

TABLE 5. Predicted top-20 circRNAs related to BC.

Rank	circRNA	Evidence	Rank	circRNA	Evidence
1	circKLHL24	33,781,094	11	hsa_circ_0001283	32,066,649
2	hsa_circ_000911	29,431,182	12	circTADA2A	32,002,039
3	hsa_circ_0089105	30,657,346	13	hsa_circ_0064923	30,785,332
4	hsa_circ_100438	29,431,182	14	hsa_circ_0103021	30,979,827
5	circNINL	33,479,730	15	circYY1	33,603,460
6	hsa_circ_0125597	31,729,134	16	hsa_circ_406697	28,484,086
7	hsa_circ_0069094	31,943,203	17	hsa_circ_0064923	30,785,332
8	hsa_circ_0068515	33,136,699	18	hsa_circ_100219	31,127,997
9	circNR3C2	33,530,981	19	hsa_circ_0004619	28,484,086
10	hsa_circ_0100213	30,979,827	20	hsa_circ_0000002	30,810,051

LC is one of the most common and deadliest cancers all over the world, there are two main types of lung cancer, non-small cell lung cancer and small cell lung cancer, and the former is more common, accounting for about 85% of all cases. Similar to BC, early diagnosis and treatment of it are crucial factors that can improve the chances of survival for individuals. The predicted top-20 circRNAs related to LC are listed in Table 6, and 16 of them were confirmed via relevant researches, which can also show that ELCDA is an effective model for predicting circRNA-disease associations.

TABLE 6. Predicted top-20 circRNAs related to LC.

Rank	circRNA	Evidence	Rank	circRNA	Evidence
1	circHIPK3	30,352,682	11	hsa_circ_0007331	unconfirmed
2	hsa_circ_0067934	33,155,212	12	hsa_circ_101975	unconfirmed
3	circABCB10	32,420,810	13	circANKRD12	31,185,953
4	circITCH	27,642,589	14	hsa_circ_0072088	unconfirmed
5	circZNF609	33,459,380	15	circMAN1A2	31,046,163
6	hsa_circ_0001821	27,928,058	16	hsa_circ_0008193	31,700,878
7	hsa_circ_0008717	32,572,881	17	hsa_circ_0012673	32,141,553
8	hsa_circ_0014235	33,292,236	18	hsa_circ_0079471	unconfirmed
9	circBANP	29,969,631	19	circMAN2B2	29,550,475
10	hsa_circ_0046264	29,891,014	20	hsa_circ_0022812	32,511,866

Furthermore, taking HCC as an example, as shown in Figure 7, circITCH can act as the sponge of hsa-miR-184, hsa-miR-224-5p, hsa-miR-20b-5p and hsa-miR-421, which indicates one of the mechanisms of action of circRNAs: circRNA can function as miRNA sponges by binding to them and preventing their interactions with target mRNAs, thereby affecting the occurrence and development of human complex diseases; thus, circITCH may have an inhibitory effect on HCC.⁴⁹ Researchers can work on the downstream mRNA and explore the potential role of circRNA by screening for the upstream circRNAs and identifying the corresponding circRNA-miRNA-mRNA pathway.

Actually, there are increasing researches focus on the interactions among non-coding RNAs (ncRNAs) and other biological entities to better understand their regulatory mechanisms in diseases. For instance, circRNA-miRNA,^{50, 51} miRNA-lncRNA^{52, 53} and metabolite-disease interaction predictions,^{54, 55} by investigating the interactions among ncRNAs, we can gain a better understanding their roles in cellular regulation and disease development. This knowledge can provide new insights and potential therapeutic targets for disease diagnosis and treatment. However, there are still many aspects that require further research to unravel the regulatory networks and mechanisms of ncRNAs.

4 CONCLUSIONS

With gradually deepening of researching, an increasing body of evidence suggests that circRNAs play a crucial role in the occurrence and development of human complex diseases, which can be regarded as biomarkers for diagnosis, treatment and prognosis. More and more studies have been conducted using the experimentally verified circRNA-disease associations with computational models, and most existing methods ignore the information carried by the heterogeneous network and the intermediate nodes (miRNAs).

To address these drawbacks, in this paper, we propose an ensemble learning-based model ELCDA for predicting circRNA-disease associations. Compared with the previous models, the HeteSim and PathSim are introduced here to enhance the model for extracting information from heterogeneous, and circRNA-miRNA, miRNA-disease associations are given, with not only provide more detailed biological information, but also expand the variety of nodes. As the number of node types increasing, more kinds of metapaths can be defined. In addition, this study also adopts GraphSAGE and MF-based feature extractor to obtain the comprehensive representations of nodes, and soft voting strategy is used to get the final predicted results. The results of numerous experiments indicate that ELCDA is outperforming than most SOTA models.

The proposed model still has shortcomings, which can be conducted in subsequent work in the further. First, the heterogeneous network can be constructed with different links, that is, other nodes can be introduced in the model, like gene, then circRNA-gene, gen-disease, gene–gene associations can be used to further enrich the biological information of circRNAs and diseases. Second, with more associations introduced, more kinds of metapaths can be selected, which will lead the model more effective and robust.

AUTHOR CONTRIBUTIONS

Jing Yang: Conceptualization (equal); data curation (equal); formal analysis (equal); methodology (lead); validation (equal); visualization (equal); writing – original draft (equal). Xiujuan Lei: Resources (supporting); supervision (lead); writing – review and editing (equal). Fa Zhang: Writing – review and editing (equal).

FUNDING INFORMATION

This work was supported by the National Natural Science Foundation of China under Grand (Nos. 62272288) and the Fundamental Research Funds for the Central Universities, Shaanxi Normal University (GK202302006).

CONFLICT OF INTEREST STATEMENT

The authors confirm that there are no conflicts of interest.

Open Research

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are openly available in [circR2Disease V2.0]/[https://doi.org/10.1016/j.gpb.2021.10.002.], reference number [9].

REFERENCES

1Lei X, Mudiyanselage TB, Zhang Y, et al. A comprehensive survey on computational methods of non-coding RNA and disease association prediction. Brief Bioinform. 2021; 22(4):bbaa350.
10.1093/bib/bbaa350
PubMed Web of Science® Google Scholar
2Hsu MT, Coca-Prados M. Electron microscopic evidence for the circular form of RNA in the cytoplasm of eukaryotic cells. Nature. 1979; 280(5720): 339-340.
10.1038/280339a0
CAS PubMed Web of Science® Google Scholar
3Glažar P, Papavasileiou P, Rajewsky N. circBase: a database for circular RNAs. RNA (New York, NY). 2014; 20(11): 1666-1670.
10.1261/rna.043687.113
CAS PubMed Web of Science® Google Scholar
4Liu M, Wang Q, Shen J, Yang BB, Ding X. Circbank: a comprehensive database for circRNA with standard nomenclature. RNA Biol. 2019; 16(7): 899-905.
10.1080/15476286.2019.1600395
PubMed Web of Science® Google Scholar
5Liu YC, Li JR, Sun CH, et al. CircNet: a database of circular RNAs derived from transcriptome sequencing data. Nucleic Acids Res. 2016; 44(D1): D209-D215.
10.1093/nar/gkv940
CAS PubMed Web of Science® Google Scholar
6Meng X, Hu D, Zhang P, et al. CircFunBase: a database for functional circular RNAs. Database. 2019; 2019:baz003.
10.1093/database/baz003
PubMed Google Scholar
7Li S, Li Y, Chen B, et al. exoRBase: a database of circRNA, lncRNA and mRNA in human blood exosomes. Nucleic Acids Res. 2018; 46(D1): D106-D112.
10.1093/nar/gkx891
CAS PubMed Web of Science® Google Scholar
8Fan C, Lei X, Fang Z, et al. CircR2Disease: a manually curated database for experimentally supported circular RNAs associated with various diseases. Database. 2018; 2018: bay004.
10.1093/database/bay044
PubMed Google Scholar
9Fan C, Lei X, Tie J, Zhang Y, Wu FX, Pan Y. CircR2Disease v2.0: an updated web server for experimentally validated circRNA-disease associations and its application. Genomics Proteomics Bioinformatics. 2022; 20(3): 435-445.
10.1016/j.gpb.2021.10.002
PubMed Web of Science® Google Scholar
10Zhao Z, Wang K, Wu F, et al. circRNA disease: a manually curated database of experimentally supported circRNA-disease associations. Cell Death Dis. 2018; 9(5): 475.
10.1038/s41419-018-0503-3
PubMed Google Scholar
11Yao D, Zhang L, Zheng M, Sun X, Lu Y, Liu P. Circ2Disease: a manually curated database of experimentally validated circRNAs in human disease. Sci Rep. 2018; 8(1): 11018.
10.1038/s41598-018-29360-3
PubMed Web of Science® Google Scholar
12Wang CC, Han CD, Zhao Q, Chen X. Circular RNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2021; 22(6):bbab286.
10.1093/bib/bbab286
PubMed Web of Science® Google Scholar
13Fan C, Lei X, Wu FX. Prediction of CircRNA-disease associations using KATZ model based on heterogeneous networks. Int J Biol Sci. 2018; 14(14): 1950-1959.
10.7150/ijbs.28260
CAS PubMed Web of Science® Google Scholar
14Zhao Q, Yang Y, Ren G, Ge E, Fan C. Integrating bipartite network projection and KATZ measure to identify novel CircRNA-disease associations. IEEE Trans Nanobioscience. 2019; 18(4): 578-584.
10.1109/TNB.2019.2922214
PubMed Web of Science® Google Scholar
15Yan C, Wang J, Wu FX. DWNN-RLS: regularized least squares method for predicting circRNA-disease associations. BMC Bioinformatics. 2018; 19(Suppl 19): 520.
10.1186/s12859-018-2522-6
CAS PubMed Google Scholar
16Wei H, Liu B. iCircDA-MF: identification of circRNA-disease associations based on matrix factorization. Brief Bioinform. 2020; 21(4): 1356-1367.
10.1093/bib/bbz057
CAS PubMed Web of Science® Google Scholar
17Ge E, Yang Y, Gang M, Fan C, Zhao Q. Predicting human disease-associated circRNAs based on locality-constrained linear coding. Genomics. 2020; 112(2): 1335-1342.
10.1016/j.ygeno.2019.08.001
CAS PubMed Web of Science® Google Scholar
18Peng L, Yang C, Huang L, Chen X, Fu X, Liu W. RNMFLP: predicting circRNA–disease associations based on robust nonnegative matrix factorization and label propagation. Brief Bioinform. 2022; 23(5): bbac155.
10.1093/bib/bbac155
PubMed Web of Science® Google Scholar
19Wang MN, Xie XJ, You ZH, Wong L, Li LP, Chen ZH. Combining K nearest neighbor with nonnegative matrix factorization for predicting Circrna-disease associations. IEEE/ACM Trans Comput Biol Bioinform. 2023; 20(5): 2610-2618.
10.1109/TCBB.2022.3180903
CAS PubMed Web of Science® Google Scholar
20Zhang Y, Lei X, Fang Z, Pan Y. CircRNA-disease associations prediction based on metapath2vec++ and matrix factorization. Big Data Mining and Analytics. 2020; 3(4): 280-291.
10.26599/BDMA.2020.9020025
Web of Science® Google Scholar
21Ding Y, Lei X, Liao B, Wu FX. Predicting miRNA-disease associations based on multi-view Variational graph auto-encoder with matrix factorization. IEEE J Biomed Health Inform. 2022; 26(1): 446-457.
10.1109/JBHI.2021.3088342
PubMed Web of Science® Google Scholar
22Zhang Y, Lei X, Dai C, Pan Y, Wu FX. Identify potential circRNA-disease associations through a multi-objective evolutionary algorithm. Inf Sci. 2023; 647:119437.
10.1016/j.ins.2023.119437
Web of Science® Google Scholar
23Wang L, You ZH, Li YM, Zheng K, Huang YA. GCNCDA: a new method for predicting circRNA-disease associations based on graph convolutional network algorithm. PLoS Comput Biol. 2020; 16(5):e1007568.
10.1371/journal.pcbi.1007568
CAS PubMed Web of Science® Google Scholar
24Bian C, Lei XJ, Wu FX. GATCDA: predicting circRNA-disease associations based on graph attention network. Cancer. 2021; 13(11): 2595.
10.3390/cancers13112595
CAS Google Scholar
25Zheng K, You Z, Li J, et al. iCDA-CGR: identification of circRNA-disease associations based on chaos game representation. PLoS Comput Biol. 2020; 16(5):e1007872.
10.1371/journal.pcbi.1007872
CAS PubMed Web of Science® Google Scholar
26Ji C, Liu ZH, Wang Y, Ni J, Zheng C. GATNNCDA: a method based on graph attention network and multi-layer neural network for predicting circRNA-disease associations. Int J Mol Sci. 2021; 22(16): 8505.
10.3390/ijms22168505
CAS PubMed Web of Science® Google Scholar
27Wang YY, Lei XJ, Pan Y. Predicting microbe-disease association based on heterogeneous network and global graph feature learning. Chin J Electron. 2022; 31(2): 345-353.
10.1049/cje.2020.00.212
Web of Science® Google Scholar
28Chen M, Jiang YJ, Lei XJ, et al. Drug-target interactions prediction based on signed heterogeneous graph neural networks. Chin J Electron. 2024; 33(1): 231-244.
10.23919/cje.2022.00.384
Web of Science® Google Scholar
29Chen Y, Wang Y, Ding Y, Su X, Wang C. RGCNCDA: relational graph convolutional network improves circRNA-disease association prediction by incorporating microRNAs. Comput Biol Med. 2022; 143:105322.
10.1016/j.compbiomed.2022.105322
CAS PubMed Web of Science® Google Scholar
30Guo Y, Yi M. THGNCDA: circRNA-disease association prediction based on triple heterogeneous graph network. Brief Funct Genomics. 2023; 2023: elad042.
10.1093/bfgp/elad042
Google Scholar
31Fu X, Zhang J, Meng Z, et al. MAGNN: Metapath aggregated graph neural network for heterogeneous graph embedding. Proc Web Conference. 2020; 2020: 2331-2341.
Google Scholar
32Zheng X, Ding H, Mamitsuka H, et al. Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 2013; Chicago. Association for Computing Machinery; 2013: 1025-1033.
10.1145/2487575.2487670
Google Scholar
33Hamilton WL, Ying Z, Leskovec J. Inductive representation learning on large graphs. NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems: 2017. Curran Associates Inc; 2017: 1025-1035.
Google Scholar
34Huang Z, Shi J, Gao Y, et al. HMDD v3.0: a database for experimentally supported human microRNA-disease associations. Nucleic Acids Res. 2019; 47(D1): D1013-D1017.
10.1093/nar/gky1010
CAS PubMed Web of Science® Google Scholar
35Sun Y, Han J. Mining heterogeneous information networks: a structural analysis approach. ACM SIGKDD Explor Newsl. 2013; 14(2): 20-28.
10.1145/2481244.2481248
Google Scholar
36Sun Y, Han J, Yan X, Yu PS, Wu T. PathSim: meta path-based top-K similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment. 2011; 2011: 992-1003.
10.14778/3402707.3402736
Google Scholar
37Shi C, Kong X, Huang Y, Yu P, Wu B. HeteSim: a general framework for relevance measure in heterogeneous networks. IEEE Trans Knowl Data Eng. 2014; 26(10): 2479-2492.
10.1109/TKDE.2013.2297920
Web of Science® Google Scholar
38Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B. Support vector machines. IEEE Intelligent Syst Appl. 1998; 13(4): 18-28.
10.1109/5254.708428
Web of Science® Google Scholar
39Breiman L. Random forests. Mach Learn. 2001; 45(1): 5-32.
10.1023/A:1010933404324
Web of Science® Google Scholar
40Chen T, Guestrin C. XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery Data Mining: 2016. Association for Computing Machinery; 2016: 785-794.
10.1145/2939672.2939785
Google Scholar
41Ke G, Meng Q, Finley T, et al. LightGBM: a highly efficient gradient boosting decision tree. Proceedings of the 31st International Conference on Neural Information Processing Systems: 2017. Curran Associates Inc.; 2017: 3149-3157.
Google Scholar
42Anand MV, KiranBala B, Srividhya SR, Younus M, Rahman MH. Gaussian Naïve Bayes algorithm: a reliable technique involved in the assortment of the segregation in cancer. Mob Inf Syst. 2022; 2022:2436946.
10.1155/2022/2436946
Web of Science® Google Scholar
43Zhang W, Yu C, Wang X, Liu F. Predicting CircRNA-disease associations through linear neighborhood label propagation method. IEEE Access. 2019; 7: 83474-83483.
10.1109/ACCESS.2019.2920942
Web of Science® Google Scholar
44Deng L, Zhang W, Shi Y, Tang Y. Fusion of multiple heterogeneous networks for predicting circRNA-disease associations. Sci Rep. 2019; 9(1): 9605.
10.1038/s41598-019-45954-x
PubMed Google Scholar
45Lu C, Zeng M, Zhang F, Wu FX, Li M, Wang J. Deep matrix factorization improves prediction of human CircRNA-disease associations. IEEE J Biomed Health Inform. 2021; 25(3): 891-899.
10.1109/JBHI.2020.2999638
PubMed Web of Science® Google Scholar
46Niu M, Zou Q, Wang C. GMNN2CD: identification of circRNA-disease associations based on variational inference and graph Markov neural networks. Bioinformatics (Oxford, England). 2022; 38(8): 2246-2253.
10.1093/bioinformatics/btac079
CAS PubMed Web of Science® Google Scholar
47Zhang H, Fang J, Sun Y, Xie G, Lin Z, Gu G. Predicting miRNA-disease associations via node-level attention graph auto-encoder. IEEE/ACM Trans Comput Biol Bioinform. 2023; 20(2): 1308-1318.
10.1109/TCBB.2022.3170843
CAS PubMed Web of Science® Google Scholar
48Llovet JM, Kelley RK, Villanueva A, et al. Hepatocellular carcinoma. Nat Rev Dis Prim. 2021; 7(1): 6.
10.1038/s41572-020-00240-3
PubMed Web of Science® Google Scholar
49Guo W, Zhang J, Zhang D, et al. Polymorphisms and expression pattern of circular RNA circ-ITCH contributes to the carcinogenesis of hepatocellular carcinoma. Oncotarget. 2017; 8(29): 48169-48177.
10.18632/oncotarget.18327
PubMed Web of Science® Google Scholar
50Guo LX, You ZH, Wang L, et al. A novel circRNA-miRNA association prediction model based on structural deep neural network embedding. Brief Bioinform. 2022; 23(5):bbac391.
10.1093/bib/bbac391
PubMed Web of Science® Google Scholar
51Qian Y, Zheng J, Jiang Y, Li S, Deng L. Prediction of circRNA-MiRNA association using singular value decomposition and graph neural networks. IEEE/ACM Trans Comput Biol Bioinform. 2023; 20(6): 3461-3468.
10.1109/TCBB.2022.3222777
CAS PubMed Web of Science® Google Scholar
52Zhang L, Yang P, Feng H, Zhao Q, Liu H. Using network distance analysis to predict lncRNA-miRNA interactions. Interdiscip Sci: Comput Life Sci. 2021; 13(3): 535-545.
10.1007/s12539-021-00458-z
CAS PubMed Web of Science® Google Scholar
53Wang W, Zhang L, Sun J, Zhao Q, Shuai J. Predicting the potential human lncRNA-miRNA interactions based on graph convolution network with conditional random field. Brief Bioinform. 2022; 23(6):bbac463.
10.1093/bib/bbac463
PubMed Web of Science® Google Scholar
54Lei X, Tie J, Pan Y. Inferring metabolite-disease association using graph convolutional networks. IEEE/ACM Trans Comput Biol Bioinform. 2022; 19(2): 688-698.
10.1109/TCBB.2021.3065562
PubMed Web of Science® Google Scholar
55Gao H, Sun J, Wang Y, et al. Predicting metabolite-disease associations based on auto-encoder and non-negative matrix factorization. Brief Bioinform. 2023; 24(5):bbad259.
10.1093/bib/bbad259
PubMed Web of Science® Google Scholar

Citing Literature

Volume28, Issue7

April 2024

e18180

Identification of circRNA-disease associations via multi-model fusion and ensemble learning

Abstract

1 INTRODUCTION

2 MATERIALS AND METHODS

2.1 Problem description

2.2 Materials

2.2.1 Disease semantic similarity

2.2.2 Disease Gaussian Interaction Profile (GIP) kernel similarity

2.2.3 CircRNA expression profile similarity

2.2.4 circRNA functional similarity

2.2.5 Integrated similarity for circRNAs and diseases

2.3 Methods

2.3.1 Metapath-based feature extractor

2.3.2 Matrix factorization-based feature extractor

2.3.3 GraphSAGE-based feature extractor

2.3.4 Model fusion via ensemble learning

ALGORITHM 1. Ensemble Learning based CircRNA-Disease Association prediction (ELCDA)

3 EXPERIMENTS AND RESULTS

3.1 Evaluation metrics

3.2 Parameters analysis

3.3 Comparison with other methods

3.4 Ablation studies

3.5 Case studies

4 CONCLUSIONS

AUTHOR CONTRIBUTIONS

FUNDING INFORMATION

CONFLICT OF INTEREST STATEMENT

Open Research

DATA AVAILABILITY STATEMENT

REFERENCES

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Identification of circRNA-disease associations via multi-model fusion and ensemble learning

Abstract

1 INTRODUCTION

2 MATERIALS AND METHODS

2.1 Problem description

2.2 Materials

2.2.1 Disease semantic similarity

2.2.2 Disease Gaussian Interaction Profile (GIP) kernel similarity

2.2.3 CircRNA expression profile similarity

2.2.4 circRNA functional similarity

2.2.5 Integrated similarity for circRNAs and diseases

2.3 Methods

2.3.1 Metapath-based feature extractor

2.3.2 Matrix factorization-based feature extractor

2.3.3 GraphSAGE-based feature extractor

2.3.4 Model fusion via ensemble learning

ALGORITHM 1. Ensemble Learning based CircRNA-Disease Association prediction (ELCDA)

3 EXPERIMENTS AND RESULTS

3.1 Evaluation metrics

3.2 Parameters analysis

3.3 Comparison with other methods

3.4 Ablation studies

3.5 Case studies

4 CONCLUSIONS

AUTHOR CONTRIBUTIONS

FUNDING INFORMATION

CONFLICT OF INTEREST STATEMENT

Open Research

DATA AVAILABILITY STATEMENT

REFERENCES

Citing Literature

Figures

References

Related

Information