Volume 28, Issue 9 e18345
ORIGINAL ARTICLE
Open Access

SCPLPA: An miRNA–disease association prediction model based on spatial consistency projection and label propagation algorithm

Min Chen

Min Chen

Hunan Institute of Technology, School of Computer Science and Engineering, Hengyang 421002, China

Contribution: Conceptualization (equal), Formal analysis (equal), Methodology (equal), Resources (equal), Software (equal), Supervision (equal), Writing - original draft (equal), Writing - review & editing (equal)

Search for more papers by this author
Yingwei Deng

Corresponding Author

Yingwei Deng

Hunan Institute of Technology, School of Computer Science and Engineering, Hengyang 421002, China

Correspondence

Yingwei Deng, Hunan Institute of Technology, School of Computer Science and Engineering, Hengyang 421002, China.

Email: [email protected]

Contribution: Conceptualization (equal), Formal analysis (equal), ​Investigation (equal), Methodology (equal), Resources (equal), Software (equal), Supervision (equal), Writing - original draft (equal), Writing - review & editing (equal)

Search for more papers by this author
Zejun Li

Zejun Li

Hunan Institute of Technology, School of Computer Science and Engineering, Hengyang 421002, China

Contribution: Funding acquisition (equal), Resources (equal)

Search for more papers by this author
Yifan Ye

Yifan Ye

Hunan Institute of Technology, School of Computer Science and Engineering, Hengyang 421002, China

Contribution: ​Investigation (equal), Validation (equal), Visualization (equal)

Search for more papers by this author
Lijun Zeng

Lijun Zeng

Hunan Institute of Technology, School of Computer Science and Engineering, Hengyang 421002, China

Contribution: Project administration (equal), Visualization (equal)

Search for more papers by this author
Ziyi He

Ziyi He

Hunan Institute of Technology, School of Computer Science and Engineering, Hengyang 421002, China

Contribution: ​Investigation (equal), Validation (equal), Visualization (equal)

Search for more papers by this author
Guofang Peng

Guofang Peng

Hunan Institute of Technology, School of Computer Science and Engineering, Hengyang 421002, China

Contribution: Visualization (equal)

Search for more papers by this author
First published: 02 May 2024

Min Chen and Yingwei Deng contributed equally to this work.

Abstract

Identifying the association between miRNA and diseases is helpful for disease prevention, diagnosis and treatment. It is of great significance to use computational methods to predict potential human miRNA disease associations. Considering the shortcomings of existing computational methods, such as low prediction accuracy and weak generalization, we propose a new method called SCPLPA to predict miRNA–disease associations. First, a heterogeneous disease similarity network was constructed using the disease semantic similarity network and the disease Gaussian interaction spectrum kernel similarity network, while a heterogeneous miRNA similarity network was constructed using the miRNA functional similarity network and the miRNA Gaussian interaction spectrum kernel similarity network. Then, the estimated miRNA–disease association scores were evaluated by integrating the outcomes obtained by implementing label propagation algorithms in the heterogeneous disease similarity network and the heterogeneous miRNA similarity network. Finally, the spatial consistency projection algorithm of the network was used to extract miRNA disease association features to predict unverified associations between miRNA and diseases. SCPLPA was compared with four classical methods (MDHGI, NSEMDA, RFMDA and SNMFMDA), and the results of multiple evaluation metrics showed that SCPLPA exhibited the most outstanding predictive performance. Case studies have shown that SCPLPA can effectively identify miRNAs associated with colon neoplasms and kidney neoplasms. In summary, our proposed SCPLPA algorithm is easy to implement and can effectively predict miRNA disease associations, making it a reliable auxiliary tool for biomedical research.

1 INTRODUCTION

MicroRNAs (miRNAs), with a length of approximately 20–25 nucleotides, are a class of non-coding RNAs that do not participate in protein coding,1-3 tissue differentiation,4 cell proliferation2, 3 and cell apoptosis.4-7 However, miR-30a-5p, miR-30d-5p and miR-30c-5p are known to contribute to atherosclerosis and ischemic events, which are related to the development of type 2 diabetes.8 Currently, the understanding of miRNAs is still in its infancy, and the known functions of miRNAs represent only a small fraction. Therefore, identifying miRNAs associated with diseases will help understand the regulatory mechanisms of miRNAs and the mechanisms underlying diseases or tumour development. This work has great significance for human disease prevention and treatment.

In the wake of the discovery of a large number of miRNAs, various databases have been developed to store relevant information about miRNAs. An increasing number of bioinformatics computational methods have been developed to predict associations between miRNAs and diseases and provide assistance for further biological experimental validation. Existing prediction methods can be divided into network, machine learning and matrix factorization-based methods.

Network-based methods mainly aim to construct relationship networks between miRNA and diseases, proteins, environmental factors, etc. Starting from the general hypothesis in biology that ‘functionally similar miRNAs are more likely to be associated with phenotypically similar diseases, and vice versa’, corresponding algorithms are designed based on the topological structure of a relationship network. In 2009, Jiang et al.9 first proposed a computational model based on hypergeometric distribution to predict miRNA–disease associations. They used the relationships between miRNA-regulated target genes to construct an miRNA functional similarity network. Xuan et al.10 and Chen et al.11 predicted unknown miRNA–disease associations by using the K-nearest neighbour algorithm, but the accuracy of these algorithms needs to be improved. Considering that global network similarity can improve prediction accuracy more effectively than local network similarity, Chen et al.12 proposed a method called NetCBI, which uses network consistency to predict associations between miRNAs and diseases. Chen et al. also proposed a series of miRNA–disease association methods13-15 by calculating graph Laplacian scores to obtain network consistency similarity. In 2012, Chen et al.16 proposed a random walk-based association prediction model called RWRMDA, which is simple to implement but cannot predict isolated diseases or new miRNAs without any known associations. Several random walk algorithms, such as MIDP,17 NDBM,18 Mugunga's method,19 GSTRW20 and NPRWR,21 have also been developed and achieved good prediction results. Zhan et al. proposed a model called NDALMA22 based on network distance analysis for predicting lncRNA–miRNA associations, and achieved good predictive performance. However, these algorithms heavily rely on known miRNA–disease (lncRNA–miRNA) associations.

Machine learning-based methods mainly aim to use classification algorithms, such as support vector machines, decision trees, random forests and naive Bayes classifiers, especially popular deep learning methods23 for lncRNA–disease association and miRNA–disease association prediction. For example, Jiang et al.24 and Xu et al.25 achieved good results in using support vector machines for prediction, but the prediction performance of these models is limited by the classifiers used, such as support vector machines and decision trees. Deep learning has also been applied to this field. Zhang et al.26 Ji et al.27 Sujamol et al.28 and Peng et al.29 applied deep autoencoders to predict miRNA–disease associations. Tang et al.30 Dong et al.31 Xuan et al.32 Sun et al.33 and Wang et al.34 respectively applied multi-layer convolutional neural networks for predicting miRNA–disease, metabolite–disease and lncRNA–miRNA associations. Additionally, the graph attention mechanism35, 36 has also been used in the association prediction field. These algorithms have been applied and achieved certain results in this field. However, these models still require positive and negative samples during training and have not solved the problem of selecting negative samples.

Matrix factorization-based methods have also attracted researchers' attention. In 2017, Li et al.37 used matrix completion algorithms to construct an MCMDA model for prediction of miRNA–disease associations. Chen et al. improved MCMDA and developed models such as IMCMDA38 and NCMCMDA.39 Many researchers have combined matrix factorization algorithms with other methods for prediction; in particular, the NIMCGCN40 model combines matrix completion algorithms with graph convolutional networks, the NIMGSA41 model combines graph autoencoders with self-attention mechanisms and the MDA-AENMF42 model combines a five-layer autoencoder. These models can solve the sparsity problem of heterogeneous biological data networks, but they have not effectively addressed the parameter selection problem. Additionally, many scholars have conducted extensive research in related fields,43-48 which is also of reference value.

In summary, existing prediction models can be used to predict miRNA–disease associations but still have shortcomings, such as complex algorithm design, high computational complexity and difficulty in parameter selection. Further research is thus needed in predicting miRNA–disease associations. In the present work, a novel method, namely, SCPLPA, is proposed for prediction of miRNA–disease associations and was developed starting from the perspective of the structure of heterogeneous graphs and the heterogeneity of content.

This study constructs a heterogeneous disease similarity network composed of a disease semantic similarity network and a disease Gaussian interaction spectrum kernel similarity network as well as a heterogeneous miRNA similarity network composed of an miRNA functional similarity network and an miRNA Gaussian interaction spectrum kernel similarity network. The label propagation algorithm is then implemented in both heterogeneous networks, and their results are integrated as the initial prediction scores for miRNA–disease associations. The matrices of the heterogeneous disease similarity network and the heterogeneous miRNA similarity network are projected into the initial prediction score matrix. The two spatial projection scores are integrated as the final prediction score. As a result, multiple evaluation metrics, including AUC, AUPR, ACC, MCC and F1, indicate that SCPLPA outperforms other state-of-the-art methods in terms of predictive performance. In addition, SCPLPA can predict the relationships between isolated diseases and new miRNAs. The AUC values for predicting isolated diseases and new miRNAs are 0.8412 and 0.8289, respectively. Two case results further validate the ability of SCPLPA to predict unknown miRNA associations related to diseases.

2 MATERIALS AND METHODS

2.1 Human miRNA–Disease association data

The experimentally validated miRNA–disease association data are from HMDD v2.049 MD n m × n d $$ {\mathrm{MD}}^{n_{\mathrm{m}}\times {n}_{\mathrm{d}}} $$ . If there is a known association between a miRNA m i $$ {m}_{\mathrm{i}} $$ and a disease node d j $$ {d}_{\mathrm{j}} $$ , it is set MD i , j $$ \mathrm{MD}\left(i,j\right) $$ to 1; otherwise, it is set to 0. The variables nm and nd represent the number of diseases and miRNAs, respectively.

2.2 Disease semantic similarity

Many scholars have proposed methods to measure the semantic similarity of diseases based on disease classification information described in MeSH (Medical Subject Headings).50 In this method, each disease d $$ d $$ is represented as a directed acyclic graph (DAG) DAG d = N d E d $$ \mathrm{DAG}(d)=\left(N(d),E(d)\right) $$ , where N d $$ N(d) $$ represents the ancestor node set of disease d $$ d $$ (including the disease d $$ d $$ itself) and E d $$ E(d) $$ represents the set of related connections. The similarity between diseases is calculated as follows:

Xuan et al.10 presented the contribution value of the ancestor node d a $$ {d}_{\mathrm{a}} $$ of disease d $$ d $$ to the disease d $$ d $$ as follows:
D d d a = log the number of N d the number of disease $$ {D}_{\mathrm{d}}\left({d}_{\mathrm{a}}\right)=-\log \left(\frac{\mathrm{the}\ \mathrm{number}\ \mathrm{of}\ N(d)}{\mathrm{the}\ \mathrm{number}\ \mathrm{of}\ \mathrm{disease}}\right) $$ ()
Based on Equation (1), the semantic value DV d $$ \mathrm{DV}(d) $$ of disease d $$ d $$ is defined as:
DV d = d a N d D d d a $$ \mathrm{DV}(d)=\sum \limits_{d_a\in N(d)}{D}_{\mathrm{d}}\left({d}_{\mathrm{a}}\right) $$ ()
The semantic similarity between disease d i $$ {d}_i $$ and d j $$ {d}_{\mathrm{j}} $$ is calculated using the following equation:
DD i , j = d t N d i N d j D d i d t + D d j d t DV d i + DV d j $$ \mathrm{DD}\left(i,j\right)=\frac{\sum \limits_{d_t\in N\left({d}_i\right)\cap N\left({d}_j\right)}{D}_{d_i}\left({d}_t\right)+{D}_{d_j}\left({d}_t\right)}{DV\left({d}_i\right)+ DV\left({d}_j\right)} $$ ()

The data are downloaded from the literature51 and named as DD n d × n d $$ {\mathrm{DD}}^{n_{\mathrm{d}}\times {n}_{\mathrm{d}}} $$ .

2.3 miRNA functional similarity

The functional similarity between diseases is calculated based on the semantic similarity of diseases. The specific process is described as follows.52

For any two miRNAs m i $$ {m}_{\mathrm{i}} $$ and m j $$ {m}_{\mathrm{j}} $$ , the sets of diseases associated with them are denoted as:
D ( m i ) = { d 1 ' , d 2 ' , , d m ' } = { d i ' } m D and D ( m j ) = { d 1 , d 2 , , d n } = { d j } n D
For a given disease d i ' $$ {d}_{i^{\hbox{'}}} $$ and a given set D m j $$ {D}^{\left({m}_j\right)} $$ of diseases, the degree of association between them is calculated as:
S d i ' D m j = max d t D m j DD d i ' d t $$ S\left({d}_{i^{\hbox{'}}},{D}^{\left({m}_j\right)}\right)=\underset{d_t\in {D}^{\left({m}_j\right)}}{\max}\left( DD\left({d}_{i^{\hbox{'}}},{d}_t\right)\right) $$ ()

DD d i ' d t $$ DD\left({d}_{i^{\hbox{'}}},{d}_t\right) $$ represents the semantic similarity value between disease d i ' $$ {d}_{i^{\hbox{'}}} $$ and disease d t $$ {d}_t $$ .

The functional similarity between any two miRNAs m i $$ {m}_i $$ and m j $$ {m}_j $$ is then represented as:
mm ij = d t D m i S d t D m j + d t D m j S d t D m i m + n $$ {mm}_{ij}=\frac{\sum \limits_{d_t\in {D}^{\left({m}_i\right)}}S\left({d}_t,{D}^{\left({m}_j\right)}\right)+\sum \limits_{d_t\in {D}^{\left({m}_j\right)}}S\left({d}_t,{D}^{\left({m}_i\right)}\right)}{m+n} $$ ()

In the above equation, m and n refer to the number of diseases associated with miRNA m i $$ {m}_i $$ and miRNA m j $$ {m}_j $$ , respectively.

The matrix MM n m × n m $$ {MM}^{n_m\times {n}_m} $$ is used to represent the miRNA functional similarity matrix.

2.4 Gaussian interaction spectral kernel similarity

When measuring the similarity between diseases by using semantic similarity method, the similarity between many diseases is directly represented as 0 due to missing data. The Gaussian kernel spectral similarity53 is introduced to compensate for this drawback. The similarity between disease d i $$ {d}_i $$ and d j $$ {d}_j $$ is defined as:
GD i , j = exp γ d mp d i mp d j 2 $$ \mathrm{GD}\left(i,j\right)=\exp \left(-{\gamma}_d\parallel \mathrm{mp}\left({d}_i\right)-\mathrm{mp}\left({d}_j\right){\parallel}^2\right) $$ ()
where mp 1 i $$ \mathrm{mp}\left({1}_i\right) $$ represents the number of miRNAs associated with disease d i $$ {d}_i $$ , and γ 1 $$ {\gamma}_1 $$ is the width of kernel spectrum and defined as:
γ d = 1 1 n d i = 1 n d mp d i 2 $$ {\gamma}_d=\frac{1}{\frac{1}{n_d}{\sum}_{i=1}^{n_d}\parallel mp\left({d}_i\right){\parallel}^2} $$ ()
The Gaussian kernel spectral similarity between miRNAs is calculated using the same method:
GL i , j = exp γ 1 dp m i dp m j 2 $$ \mathrm{GL}\left(i,j\right)=\mathit{\exp}\left(-{\gamma}_1\parallel dp\left({m}_i\right)- dp\left({m}_j\right){\parallel}^2\right) $$ ()
dp m i $$ \mathrm{dp}\left({m}_i\right) $$ indicates the number of diseases associated with miRNA m i $$ {m}_i $$ , and γ d $$ {\gamma}_d $$ is the width of the kernel spectrum and defined as follows:
γ m = 1 1 n m i = 1 n 1 dp m i 2 $$ {\gamma}_m=\frac{1}{\frac{1}{n_m}{\sum}_{i=1}^{n_1}\parallel dp\left({m}_i\right){\parallel}^2} $$ ()

2.5 Integration of disease similarity and miRNA similarity

The semantic similarity between diseases and the Gaussian spectral kernel similarity between diseases are used to construct the similarity between diseases by using the following formula:
DD f i , j = GD i , j + DD i , j 2 $$ {\mathrm{DD}}^{\mathrm{f}}\left(i,j\right)=\frac{\mathrm{GD}\left(i,j\right)+\mathrm{DD}\left(i,j\right)}{2} $$ ()

This heterogeneous disease similarity network is represented by the matrix DD f $$ {\mathrm{DD}}^{\mathrm{f}} $$ .

The similarity between miRNAs is constructed by integrating miRNA functional similarity and Gaussian kernel spectral similarity as follows: if the semantic similarity between miRNA m i $$ {m}_i $$ and miRNA m j $$ {m}_j $$ is 0, then the similarity between miRNA m i $$ {m}_i $$ and miRNA m j $$ {m}_j $$ is taken as the miRNA Gaussian kernel spectral similarity m i $$ {m}_i $$ between miRNA m j $$ {m}_j $$ and miRNA GM i , j $$ \mathrm{GM}\left(i,j\right) $$ ; otherwise, it is taken as the functional similarity between miRNA m i $$ {m}_i $$ and miRNA MM m j $$ {m}_j $$ . The formula is as follows:
MM f i , j = MM i , j GM i , j if MM i , j 0 otherwise $$ {\mathrm{MM}}^f\left(i,j\right)=\left\{\begin{array}{c}\mathrm{MM}\left(i,j\right)\\ {}\mathrm{GM}\left(i,j\right)\end{array}\ \genfrac{}{}{0pt}{}{\mathrm{if}\ \mathrm{MM}\left(i,j\right)\ne 0}{\mathrm{otherwise}}\right. $$ ()

This heterogeneous miRNA similarity network is represented by the matrix MM f $$ {\mathrm{MM}}^{\mathrm{f}} $$ .

2.6 SCPLPA

The algorithm consists of three steps. The first step involves constructing accurate disease similarity networks and miRNA similarity networks by using heterogeneous data sources (Equations 6-11). The second step involves using the label propagation algorithm to obtain estimated scores for miRNA–disease associations. The third step involves using the spatial consistency projection algorithm to obtain precise scores for miRNA–disease associations. The flowchart is shown in Figure 1.

Details are in the caption following the image
The flowchart of the whole modelling procedure.

2.6.1 Estimated scores for miRNA–Disease associations

The label propagation algorithm is applied separately to the heterogeneous disease similarity network and the heterogeneous miRNA similarity network to obtain initial scores for miRNA–disease associations. These initial scores are combined to obtain the estimated scores.

The label propagation algorithm in the heterogeneous disease network is defined as follows:
F D t + 1 = 1 α * DD * * F D t + α * MD T $$ {F}_{\mathrm{D}}\left(t+1\right)=\left(1-\alpha \right)\ast {\mathrm{D}\mathrm{D}}^{\ast \ast }{F}_{\mathrm{D}}(t)+{\alpha}^{\ast }{\mathrm{MD}}^{\mathrm{T}} $$ ()
In the above equation, F D t $$ {F}_{\mathrm{D}}(t) $$ represents the t-th iteration result of the label propagation algorithm; MD T $$ {\mathrm{MD}}^{\mathrm{T}} $$ is the transpose matrix of the known miRNA–disease association matrix MD $$ \mathrm{MD} $$ , α 0 , 1 $$ \alpha \in \left[0,1\right] $$ ; and DD * $$ {\mathrm{DD}}^{\ast } $$ is the normalized matrix of the heterogeneous disease similarity network DD f $$ {\mathrm{DD}}^{\mathrm{f}} $$ and is calculated as follows:
DD * i , j = DD f i , j / i = 1 nd DD i , j f i , j + j = 1 nd D i , j f i , j $$ {\mathrm{D}\mathrm{D}}^{\ast}\left(i,j\right)={\mathrm{D}\mathrm{D}}^{\mathrm{f}}\left(i,j\right)/\left(\sum \limits_{i=1}^{nd}{\mathrm{D}\mathrm{D}}_{\mathrm{i},\mathrm{j}}^{\mathrm{f}}\left(i,j\right)+\sum \limits_{j=1}^{nd}{\mathrm{D}}_{\mathrm{i},\mathrm{j}}^{\mathrm{f}}\left(i,j\right)\right) $$ ()

F D 0 = MD T $$ {F}_{\mathrm{D}}(0)={\mathrm{MD}}^{\mathrm{T}} $$ is iterated until F D t + 1 F D t < 10 6 $$ \left|{F}_{\mathrm{D}}\left(t+1\right)-{F}_{\mathrm{D}}(t)\right|<{10}^{-6} $$ , and the iteration is then stopped. The predicted result is the initial score for miRNA–disease associations based on the heterogeneous disease similarity network, represented by the matrix F D $$ {F}_D^{\infty } $$ .

The label propagation algorithm in the heterogeneous miRNA network is defined by the following iteration equation:
F M t + 1 = 1 β * MM * * F L t + β * MD $$ {F}_{\mathrm{M}}\left(t+1\right)=\left(1-\beta \right)\ast {\mathrm{M}\mathrm{M}}^{\ast \ast }{F}_{\mathrm{L}}(t)+{\beta}^{\ast}\mathrm{M}\mathrm{D} $$ ()
where β 0 , 1 $$ \beta \in \left[0,1\right] $$ , MM * $$ {\mathrm{MM}}^{\ast } $$ is the normalized matrix of the heterogeneous miRNA similarity network MM f $$ {\mathrm{MM}}^{\mathrm{f}} $$ and is calculated as follows:
MM * i , j = MM f i , j / i = 1 nm MM i , j f i , j + j = 1 nm MM i , j f i , j $$ {\mathrm{MM}}^{\ast}\left(i,j\right)={\mathrm{MM}}^{\mathrm{f}}\left(i,j\right)/\left(\sum \limits_{i=1}^{nm}{\mathrm{MM}}_{\mathrm{i},\mathrm{j}}^{\mathrm{f}}\left(i,j\right)+\sum \limits_{j=1}^{nm}{\mathrm{MM}}_{\mathrm{i},\mathrm{j}}^{\mathrm{f}}\left(i,j\right)\right) $$ ()

F M 0 = MD $$ {F}_{\mathrm{M}}(0)=\mathrm{MD} $$ is iterated until F M t + 1 F M t < 10 6 $$ \left|{F}_{\mathrm{M}}\left(t+1\right)-{F}_{\mathrm{M}}(t)\right|<{10}^{-6} $$ , and the iteration is then terminated. The probability space reaches a stable state and is denoted as F M $$ {F}_M^{\infty } $$ . This value is the initial score for miRNA–disease associations based on the heterogeneous miRNA similarity network.

The predicted results F L $$ {F}_L^{\infty } $$ and F D $$ {F}_D^{\infty } $$ are integrated as the estimated score for miRNA–disease associations by implementing the label propagation algorithm in the two networks:
F e = 1 δ * F D + δ * F M $$ {F}^{\mathrm{e}}=\left(1-\delta \right)\ast {F}_{\mathrm{D}}^{\infty }+{\delta}^{\ast }{F}_{\mathrm{M}}^{\infty } $$ ()

2.6.2 Accurate scores for miRNA–Disease Associations

In this stage, the spatial consistency projection algorithm is used to calculate the final predicted scores. The spatial consistency projection prediction based on the miRNA network refers to the following: in the integrated miRNA similarity matrices, if some miRNAs are highly similar to miRNA m i $$ {m}_{\mathrm{i}} $$ and other miRNAs highly similar to miRNAs m i $$ {m}_{\mathrm{i}} $$ are highly associated with disease d j $$ {d}_{\mathrm{j}} $$ , then the credibility of the association between miRNA m i $$ {m}_{\mathrm{i}} $$ and disease d j $$ {d}_{\mathrm{j}} $$ obtains a high score. The weight of the association between miRNA m i $$ {m}_{\mathrm{i}} $$ and disease d j $$ {d}_{\mathrm{j}} $$ is calculated, and the estimated association information between miRNA m i $$ {m}_{\mathrm{i}} $$ and disease d j $$ {d}_{\mathrm{j}} $$ is obtained in the previous stage; the estimated association information between disease d j $$ {d}_{\mathrm{j}} $$ and all miRNAs m k = k = 1 , 2 , , nm $$ {m}_{\mathrm{k}}=\left(k=1,2,\dots, \mathrm{nm}\right) $$ is also utilized and combined with the similarity between miRNA m i $$ {m}_{\mathrm{i}} $$ and other miRNAs to calculate the credibility score between each miRNA m i $$ {m}_{\mathrm{i}} $$ and disease d j $$ {d}_{\mathrm{j}} $$ . The formula is as follows:
MD pm i , j = MM f i : × F e : j F e : j $$ {\mathrm{MD}}_{\mathrm{pm}}\left(i,j\right)=\frac{{\mathrm{MM}}^{\mathrm{f}}\left(i,:\right)\times {F}^{\mathrm{e}}\left(:,j\right)}{\left\Vert {F}^{\mathrm{e}}\left(:,j\right)\right\Vert } $$ ()

In the above formula, F e : j $$ \left\Vert {F}^{\mathrm{e}}\left(:,j\right)\right\Vert $$ is the 2-norm of F e : j $$ {F}^{\mathrm{e}}\left(:,j\right) $$ .

A similar approach is used to calculate the predicted scores of the spatial consistency projection based on the disease network:
MD pd i , j = DD f j : × F e T : i F e T ( : i ) $$ {\mathrm{MD}}_{\mathrm{pd}}\left(i,j\right)=\frac{{\mathrm{DD}}^{\mathrm{f}}\left(j,:\right)\times {\left({F}^{\mathrm{e}}\right)}^T\left(:,i\right)}{\left\Vert {\left({F}^{\mathrm{e}}\right)}^T\Big(:,i\Big)\right\Vert } $$ ()
Finally, MD pm $$ {\mathrm{MD}}_{\mathrm{pm}} $$ and MD pd $$ {\mathrm{MD}}_{\mathrm{pd}} $$ are synthesized to obtain the final prediction score.
MD * = ε * MD pm + 1 ε * MD pd T $$ {\mathrm{MD}}^{\ast }={\varepsilon}^{\ast }{\mathrm{MD}}_{\mathrm{pm}}+\left(1-\varepsilon \right)\ast {\mathrm{MD}}_{\mathrm{pd}}^{\mathrm{T}} $$ ()

3 RESULTS

3.1 Evaluation metrics

We evaluated the performance of SCPLPA using LOOCV (leave-one-out cross-validation), where each miRNA–disease association was selected as a test sample object once, with all other miRNA–disease associations used as the training set until all miRNA–disease associations were tested once. By setting different thresholds and plotting the ROC (receiver operating characteristic) curve with TPR (true positive rate or sensitivity) as the y-axis and FPR (false positive rate or 1—Specificity) as the x-axis, the AUC (area under the ROC curve) was calculated. The curve plotted with recall rate on the x-axis and precision on the y-axis is known as the PR (precision-recall) curve. The area under the PR curve is referred to as the AUPR (area under the PR curve) value.

The formulas for TPR, FPR, precision and recall are as follows:
TPR = TP TP + FN $$ \mathrm{TPR}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $$ ()
FPR = FP FP + TN $$ \mathrm{FPR}=\frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}} $$ ()
Precision = TP TP + FP $$ \mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} $$ ()
Recall = TP TP + FN $$ \mathrm{Recall}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $$ ()

The TP (true positive) in the above formulas refers to the number of correctly predicted positive samples, that is the number of positive samples predicted as positive. FP (false positive) refers to the number of incorrectly predicted positive samples, that is the number of negative samples predicted as positive. TN (true negative) refers to the number of correctly predicted negative samples, that is the number of negative samples predicted as negative. FN (false negative) refers to the number of incorrectly predicted negative samples, that is the number of positive samples predicted as negative.

In addition to AUC and AUPR, we also used other metrics, including accuracy (ACC), F1-score (F1) and Matthew's correlation coefficient (MCC), to evaluate the performance of the model. They are defined as follows:
ACC = TP + TN TP + TN + FP + FN $$ \mathrm{ACC}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} $$ ()
F 1 = 1 1 Precision + 1 Recall $$ \mathrm{F}1=\frac{1}{\frac{1}{\mathrm{Precision}}+\frac{1}{\mathrm{Recall}}} $$ ()
MCC = TP * TN FP * FN TP + FP TP + FN TN + FP TN + FN $$ \mathrm{MCC}=\frac{{\mathrm{TP}}^{\ast}\mathrm{TN}-{\mathrm{FP}}^{\ast}\mathrm{FN}}{\sqrt{\left(\mathrm{TP}+\mathrm{FP}\right)\left(\mathrm{TP}+\mathrm{FN}\right)\left(\mathrm{TN}+\mathrm{FP}\right)\left(\mathrm{TN}+\mathrm{FN}\right)}} $$ ()

3.2 Effect of parameter selection

In the Equations 12 and 14, α $$ \alpha $$ and β $$ \beta $$ represent the probabilities of receiving initial label information in the label propagation algorithm for miRNA–disease associations, while 1 α $$ 1-\alpha $$ and 1 β $$ 1-\beta $$ control the rate at which information from neighbours is retained. For simplicity, α $$ \alpha $$ and β $$ \beta $$ are set to be the same size. The estimated score for miRNA–disease associations is calculated by weighting the prediction results F L $$ {F}_{\mathrm{L}}^{\infty } $$ and F D $$ {F}_{\mathrm{D}}^{\infty } $$ from the heterogeneous miRNA network and the heterogeneous disease network by using the label propagation algorithm, with δ $$ \delta $$ representing the proportion of the two prediction results. The precision score for miRNA–disease associations is calculated by weighting the prediction scores based on miRNA spatial consistency projection and disease network spatial consistency projection, with ε $$ \varepsilon $$ representing the proportion of the two prediction results. This section mainly discusses the effect of these parameters on the predictive performance of SCPLPA.

In the first step, the optimal values for α $$ \alpha $$ and β $$ \beta $$ are determined. Here, parameters δ $$ \delta $$ and ε $$ \varepsilon $$ are initially set to 0.5, with a step size of 0.1. Parameters α $$ \alpha $$ (or β $$ \beta $$ ) are increased from 0.1 to 0.9 with a step size of 0.1, and leave-one-out cross-validation is performed to calculate AUC (Figure 2). When β $$ \beta $$ is set to 0.9, the AUC value is maximized at 0.9335. Therefore, parameters α $$ \alpha $$ and β $$ \beta $$ are set to 0.9. The optimal value for δ $$ \delta $$ is then determined. Based on the obtained values of α $$ \alpha $$ = β $$ \beta $$ = 0.9, the parameter ε $$ \varepsilon $$ is set to 0.5 and then the parameter δ $$ \delta $$ is increased to 0.9 with a step size of 0.1. The cross-validation is performed again to calculate the AUC values. When δ $$ \delta $$ is 0.6, the AUC is maximized at 0.9346 (Figure 2). Therefore, let δ $$ \delta $$ = 0.9. Finally, in the case of α $$ \alpha $$ = β $$ \beta $$ = 0.9 and δ $$ \delta $$ = 0.9, the parameter ε $$ \varepsilon $$ is increased from 0.1 to 0.9 with a step size of 0.1. When ε $$ \varepsilon $$ is 0.6, the AUC value is maximized at 0.9356. Thus, the following optimal parameter values are obtained: α $$ \alpha $$ = β $$ \beta $$ = 0.9, δ $$ \delta $$ = 0.9, ε $$ \varepsilon $$ = 0.6.

Details are in the caption following the image
Influence of parameter variation on model predictive accuracy.

3.3 Comparison with state-of-the-art methods

To the best of our knowledge, MDHGI,54 NSEMDA,55 RFMDA56 and SNMFMDA57 are excellent computational methods used to predict miRNA–disease associations. These methods utilize information similar to SCPLPA and can be used for predicting associations between isolated diseases and new miRNAs. Here, SCPLPA is compared with these methods through the parameter selection described in their respective papers. The AUC value is used as the performance metric to evaluate the prediction performance. LOOCV is performed to compare the prediction results (Figure 3). The AUC values for SCPLPA, MDHGI, NSEMDA, RFMDA and SNMFMDA are 0.9356, 0.8945, 0.8899, 0.8891 and 0.9007, respectively. To enhance the persuasiveness of our experiments, we compared SCPLPA with several other models based on AUPR, ACC, MCC and F1 values. As shown in Table 1, the AUPR value of SCPLPA is 0.4596, while MDHGI, NSEMDA, RFMDA and SNMFMDA are 0.3367, 0.3198, 0.3345 and 0.3489, respectively. SCPLPA is, respectively, higher than the other control methods by 26.74%, 30.42%, 27.22% and 24.09%. The ACC values of SCPLPA, MDHGI, NSEMDA, RFMDA and SNMFMDA are 0.5503, 0.5607, 0.5321, 0.5215 and 0.5317, respectively. SCPLPA is 1.89% lower than that of MDHGI, but respectively higher than NSEMDA, RFMDA and SNMFMDA by 3.31%, 5.23% and 3.38%.The MCC values of SCPLPA, MDHGI, NSEMDA, RFMDA and SNMFMDA are 0.1762, 0.1507, 0.1472, 0.1356 and 0.1681, respectively. SCPLPA is higher than the other comparison methods by 14.47%, 16.46%, 23.04% and 4.60%, respectively. The F1 values of SCPLPA, MDHGI, NSEMDA, RFMDA and SNMFMDA are 0.1102, 0.1023, 0.1054, 0.1012 and 0.1045, respectively. SCPLPA is higher than the other comparison methods by 7.17%, 4.36%, 8.17% and 5.17%, respectively. From these indicators, we can see that the performance of SCPLPA is significantly better than the other four methods. Overall, SCPLPA outperforms the other prediction models in terms of predictive performance.

Details are in the caption following the image
The ROC curves and AUC values of SCPLPA compared with other methods.
TABLE 1. Comparative experimental analysis of SCPLPA and other four methods.
Method AUC AUPR ACC MCC F1
SCPLPA 0.9356 0.4596 0.5503 0.1762 0.1102
MDHGI 0.8945 0.3367 0.5607 0.1507 0.1023
NSEMDA 0.8899 0.3198 0.5321 0.1472 0.1054
RFMDA 0.8891 0.3345 0.5215 0.1356 0.1012
SNMFMDA 0.9007 0.3489 0.5317 0.1681 0.1045

3.4 Prediction of new miRNAs and isolated diseases

New miRNAs have not been widely associated with specific diseases or biological functions in existing literature or databases. These miRNAs may be newly discovered, or their functions and mechanisms may not be fully understood. Rapid and accurate identification of the relationship between new miRNAs and diseases would greatly enhance our understanding of the molecular mechanisms of diseases. However, predicting the association between new miRNAs and diseases poses a significant challenge because of unknown association information. Therefore, the model cannot be directly used for prediction. The following procedure is performed once for each miRNA to further evaluate the performance of the SCPLPA model in predicting new miRNA–disease associations: first, the known associations between miRNAs to be queried and all diseases are removed, and it is simulated as a new miRNA; SCPLPA is then used for prediction. This process is repeated until each new miRNA is used as a test sample. The prediction results are evaluated using the ROC curve and AUC value. Figure 4 shows that SCPLPA achieves an AUC value of 0.8412, indicating good performance in predicting new miRNA–disease associations.

Details are in the caption following the image
Results of SCPLPA for new miRNAs and isolated diseases.

Diseases with completely unknown association information with miRNAs are named as isolated diseases. The prediction of the association between isolated diseases and miRNA is a challenging but promising research area. The association data between the disease to be predicted and all miRNAs are removed, and SCPLPA is used for prediction until each miRNA is tested once. From Figure 4, it can be seen that the AUC value is 0.8289, indicating that SCPLPA can effectively address the problem on the prediction of associations between isolated diseases and miRNAs.

3.5 Case analysis

Colon and kidney neoplasms were selected as case studies to demonstrate the predictive ability of the proposed SCPLPA model for disease–miRNA associations. All of the prediction results were validated in two independent databases, namely, HMDD v3.258 and dbDEMC 2.0.59

Colon neoplasm is a tumour that poses a threat to human health and presents a complex pathological and physiological landscape.60 Identifying miRNAs associated with colon neoplasms plays a crucial role in understanding the pathogenesis, treatment and prognosis of these tissues. The HMDD v2.0 database contains 78 known miRNA–colon neoplasm associations, which were used as training samples to predict potential miRNAs associated with colon neoplasms. Table 2 lists the top 50 predicted miRNAs related to colon neoplasms and their supporting evidence obtained using the SCPLPA model. Among these miRNAs, 49 candidate genes were confirmed in the HMDD v3.2 and dbDEMC 2.0 databases, and only hsa-mir-367 was not validated. We believe that in the near future, biologists will further reveal the relationship of these miRNAs to colon neoplasms through experiments.

TABLE 2. The top 50 colon neoplasm-related miRNAs.
Rank miRNA name Evidences Rank miRNA name Evidences
1 hsa-mir-135a dbDEMC 26 hsa-mir-34b dbDEMC
2 hsa-mir-135b HMDD, dbDEMC 27 hsa-mir-193a dbDEMC
3 hsa-mir-18b HMDD, dbDEMC 28 hsa-mir-425 dbDEMC
4 hsa-mir-625 dbDEMC 29 hsa-mir-129 dbDEMC
5 hsa-mir-139 dbDEMC 30 hsa-mir-99a dbDEMC
6 hsa-mir-185 dbDEMC 31 hsa-mir-149 dbDEMC
7 hsa-mir-375 HMDD, dbDEMC 32 hsa-mir-34c dbDEMC
8 hsa-mir-497 dbDEMC 33 hsa-mir-409 dbDEMC
9 hsa-mir-215 HMDD, dbDEMC 34 hsa-mir-373 dbDEMC
10 hsa-mir-25 HMDD, dbDEMC 35 hsa-mir-103a dbDEMC
11 hsa-mir-27a HMDD, dbDEMC 36 hsa-mir-429 HMDD, dbDEMC
12 hsa-mir-224 HMDD, dbDEMC 37 hsa-mir-124 dbDEMC
13 hsa-mir-302c dbDEMC 38 hsa-mir-96 HMDD, dbDEMC
14 hsa-mir-186 dbDEMC 39 hsa-mir-148a HMDD, dbDEMC
15 hsa-mir-338 dbDEMC 40 hsa-mir-339 HMDD, dbDEMC
16 hsa-mir-151a dbDEMC 41 hsa-mir-93 HMDD, dbDEMC
17 hsa-mir-183 dbDEMC 42 hsa-mir-182 dbDEMC
18 hsa-mir-542 dbDEMC 43 hsa-mir-335 HMDD, dbDEMC
19 hsa-mir-345 dbDEMC 44 hsa-mir-320a dbDEMC
20 hsa-mir-708 dbDEMC 45 hsa-mir-203 HMDD, dbDEMC
21 hsa-mir-194 HMDD, dbDEMC 46 hsa-mir-100 dbDEMC
22 hsa-mir-130a HMDD, dbDEMC 47 hsa-mir-153 dbDEMC
23 hsa-mir-199b dbDEMC 48 hsa-mir-526a dbDEMC
24 hsa-mir-200a HMDD, dbDEMC 49 hsa-mir-302d dbDEMC
25 hsa-mir-367 Unconfirmed 50 hsa-mir-95 dbDEMC

Kidney neoplasm is a common tumour that has an increasing incidence rate. It has multiple histological subtypes, each has its own unique molecular characteristics. The most common subtype is clear cell renal cell carcinoma, which accounts for 75% of all cases. The 5-year survival rate of clear cell renal cell carcinoma is less than 10%.61 Hence, predicting miRNAs associated with kidney neoplasms is of great practical significance.

The HMDD v2.0 database contains only seven known miRNA–kidney neoplasm-associated pairs. These pairs were used as known information to implement SCPLPA and predict potential miRNAs associated with kidney neoplasms for the discovery of new molecular associations as prognostic or therapeutic markers. As shown in Table 3, all the top 50 predicted kidney neoplasm-related miRNAs have been confirmed in HMDD v3.2 and dbDEMC 2.0. The two cases demonstrate that the SCPLPA model exhibits satisfactory performance in predicting new potential miRNA–disease associations.

TABLE 3. The top 50 kidney neoplasm-related miRNAs.
Rank miRNA name Evidences Rank miRNA name Evidences
1 hsa-mir-155 HMDD, dbDEMC 26 hsa-mir-134 dbDEMC
2 hsa-mir-146a dbDEMC 27 hsa-mir-7 dbDEMC
3 hsa-mir-122 HMDD, dbDEMC 28 hsa-mir-17 HMDD, dbDEMC
4 hsa-mir-34a HMDD, dbDEMC 29 hsa-mir-142 dbDEMC
5 hsa-mir-221 dbDEMC 30 hsa-mir-708 HMDD
6 hsa-mir-125b dbDEMC 31 hsa-mir-9 dbDEMC
7 hsa-mir-16 dbDEMC 32 hsa-mir-184 dbDEMC
8 hsa-mir-29a dbDEMC 33 hsa-mir-106b dbDEMC
9 hsa-mir-210 HMDD, dbDEMC 34 hsa-mir-148a dbDEMC
10 hsa-mir-31 dbDEMC 35 hsa-mir-19a dbDEMC
11 hsa-mir-29b dbDEMC 36 hsa-mir-27a HMDD, dbDEMC
12 hsa-mir-199a HMDD, dbDEMC 37 hsa-mir-1207 dbDEMC
13 hsa-mir-26a dbDEMC 38 hsa-mir-19b dbDEMC
14 hsa-mir-145 dbDEMC 39 hsa-mir-373 dbDEMC
15 hsa-mir-133a dbDEMC 40 hsa-let-7b dbDEMC
16 hsa-mir-222 dbDEMC 41 hsa-mir-200a HMDD, dbDEMC
17 hsa-mir-196a dbDEMC 42 hsa-mir-126 HMDD, dbDEMC
18 hsa-mir-206 dbDEMC 43 hsa-mir-137 dbDEMC
19 hsa-mir-20a dbDEMC 44 hsa-mir-30b dbDEMC
20 hsa-mir-1 dbDEMC 45 hsa-mir-34c dbDEMC
21 hsa-mir-200b dbDEMC 46 hsa-mir-212 dbDEMC
22 hsa-mir-15b dbDEMC 47 hsa-let-7a dbDEMC
23 hsa-mir-218 dbDEMC 48 hsa-mir-92a dbDEMC
24 hsa-mir-29c dbDEMC 49 hsa-mir-124 dbDEMC
25 hsa-mir-223 dbDEMC 50 hsa-mir-204 dbDEMC

All miRNA associations related to the disease to be validated were removed before implementing SCPLPA to test its predictive performance for isolated diseases. For colon neoplasms, 78 known colon neoplasm–miRNA associations were deleted and SCPLPA was used to predict potential miRNA–lung neoplasm associations. All the top 50 predicted miRNAs were supported by evidence in HDMM3.2 and dbDEMC databases (Table 4). Similarly, seven known kidney neoplasm–miRNA associations were deleted, and the SCPLPA model was used to predict kidney neoplasm-related miRNAs. The top 50 predicted associations were supported by evidence in HDMM3.2 and dbDEMC (Table 5).

TABLE 4. The top 50 colon neoplasms-related miRNA candidates predicted by SCPLPA with removed all known colon neoplasms-miRNA associations and the confirmation of these associations.
Rank miRNA name Evidences Rank miRNA name Evidences
1 hsa-mir-145 HMDD, dbDEMC 26 hsa-let-7b HMDD, dbDEMC
2 hsa-mir-218 HMDD, dbDEMC 27 hsa-mir-101 HMDD, dbDEMC
3 hsa-mir-200c HMDD, dbDEMC 28 hsa-mir-19a HMDD, dbDEMC
4 hsa-mir-126 HMDD, dbDEMC 29 hsa-mir-221 HMDD, dbDEMC
5 hsa-mir-125b HMDD, dbDEMC 30 hsa-mir-210 HMDD, dbDEMC
6 hsa-let-7a HMDD, dbDEMC 31 hsa-mir-124 dbDEMC
7 hsa-mir-34a HMDD, dbDEMC 32 hsa-mir-222 HMDD, dbDEMC
8 hsa-mir-200b HMDD, dbDEMC 33 hsa-mir-148a HMDD, dbDEMC
9 hsa-mir-21 HMDD, dbDEMC 34 hsa-mir-203 HMDD, dbDEMC
10 hsa-mir-16 HMDD, dbDEMC 35 hsa-let-7c HMDD, dbDEMC
11 hsa-mir-143 HMDD, dbDEMC 36 hsa-let-7d HMDD, dbDEMC
12 hsa-mir-31 HMDD, dbDEMC 37 hsa-mir-25 HMDD, dbDEMC
13 hsa-mir-34c dbDEMC 38 hsa-mir-214 dbDEMC
14 hsa-mir-27a HMDD, dbDEMC 39 hsa-mir-199a dbDEMC
15 hsa-mir-155 HMDD, dbDEMC 40 hsa-mir-135a dbDEMC
16 hsa-mir-183 dbDEMC 41 hsa-mir-181a HMDD, dbDEMC
17 hsa-mir-20a HMDD, dbDEMC 42 hsa-mir-196a HMDD, dbDEMC
18 hsa-mir-200a HMDD, dbDEMC 43 hsa-mir-18b HMDD, dbDEMC
19 hsa-mir-17 HMDD, dbDEMC 44 hsa-mir-125a HMDD, dbDEMC
20 hsa-mir-92a HMDD, dbDEMC 45 hsa-mir-146b dbDEMC
21 hsa-mir-34b dbDEMC 46 hsa-mir-205 HMDD, dbDEMC
22 hsa-mir-375 HMDD, dbDEMC 47 hsa-mir-107 HMDD, dbDEMC
23 hsa-mir-182 dbDEMC 48 hsa-mir-142 HMDD, dbDEMC
24 hsa-mir-18a HMDD, dbDEMC 49 hsa-mir-127 HMDD, dbDEMC
25 hsa-mir-10b HMDD, dbDEMC 50 hsa-mir-9 dbDEMC
TABLE 5. The top 50 kidney neoplasms-related miRNA candidates predicted by SCPLPA with removed all known kidney neoplasms-miRNA associations and the confirmation of these associations.
Rank miRNA name evidences Rank miRNA name evidences
1 hsa-mir-145 dbDEMC 26 hsa-mir-18a dbDEMC
2 hsa-mir-218 dbDEMC 27 hsa-let-7b dbDEMC
3 hsa-mir-200c HMDD, dbDEMC 28 hsa-mir-10b dbDEMC
4 hsa-mir-126 HMDD, dbDEMC 29 hsa-mir-182 dbDEMC
5 hsa-mir-125b dbDEMC 30 hsa-mir-221 dbDEMC
6 hsa-mir-200b dbDEMC 31 hsa-mir-210 HMDD, dbDEMC
7 hsa-mir-34a HMDD, dbDEMC 32 hsa-let-7c dbDEMC
8 hsa-let-7a dbDEMC 33 hsa-mir-203 HMDD, dbDEMC
9 hsa-mir-21 HMDD, dbDEMC 34 hsa-mir-375 dbDEMC
10 hsa-mir-34c dbDEMC 35 hsa-mir-127 dbDEMC
11 hsa-mir-200a HMDD, dbDEMC 36 hsa-mir-9 dbDEMC
12 hsa-mir-143 dbDEMC 37 hsa-mir-124 dbDEMC
13 hsa-mir-20a dbDEMC 38 hsa-let-7f dbDEMC
14 hsa-mir-27a HMDD, dbDEMC 39 hsa-mir-199a HMDD, dbDEMC
15 hsa-mir-92a dbDEMC 40 hsa-let-7i dbDEMC
16 hsa-mir-155 HMDD, dbDEMC 41 hsa-mir-222 dbDEMC
17 hsa-mir-16 dbDEMC 42 hsa-mir-19b dbDEMC
18 hsa-mir-101 dbDEMC 43 hsa-mir-100 dbDEMC
19 hsa-mir-17 HMDD, dbDEMC 44 hsa-mir-142 dbDEMC
20 hsa-mir-31 dbDEMC 45 hsa-mir-214 HMDD, dbDEMC
21 hsa-let-7d dbDEMC 46 hsa-mir-146b dbDEMC
22 hsa-mir-183 HMDD, dbDEMC 47 hsa-mir-223 dbDEMC
23 hsa-mir-34b dbDEMC 48 hsa-mir-125a dbDEMC
24 hsa-mir-19a dbDEMC 49 hsa-mir-148a dbDEMC
25 hsa-mir-205 dbDEMC 50 hsa-mir-146a dbDEMC

The above experimental results further demonstrate the reliability of SCPLPA in predicting miRNAs related to isolated diseases. The model also addresses the limitation of many current miRNA–disease association prediction models in predicting miRNAs related to isolated diseases.

4 DISCUSSION

The association between miRNAs and diseases has attracted research attention. Variations and dysregulation of miRNAs can lead to various diseases. As such, identifying and predicting the association between miRNAs and diseases is beneficial for understanding the function and pathogenesis of miRNAs. Existing biological experimental methods for identifying miRNA–disease associations are time consuming and labour intensive. Computational prediction methods can serve as effective supplementary tools for experimental validation. Predicting potential miRNA–disease associations through computational methods has become a hot topic in bioinformatics, resulting in the development of related prediction models. However, future works should address few issues, such as low prediction accuracy, difficulty in obtaining negative samples and challenges in predicting associations for isolated diseases and new miRNAs.

This paper proposes an SCPLPA model based on network consistency projection and a label propagation algorithm to predict potential miRNA–disease associations. SCPLPA not only performs well in predicting unknown miRNA–disease interactions but also effectively predicts isolated diseases and new miRNAs.SCPLPA was compared with four state-of-the-art models, namely, MDHGI, NSEMDA, RFMDA and SNMFMDA, to evaluate its performance. The ACC value of SCPLPA is 0.5503, while MDHGI, NSEMDA, RFMDA and SNMFMDA are 0.5607, 0.5321, 0.5215 and 0.5317, respectively. The AUC values of the five models obtained through LOOCV are 0.9356, 0.8945, 0.8899, 0.8891 and 0.9007, respectively. Furthermore, the AUPR value of SCPLPA is 0.4596, while MDHGI, NSEMDA, RFMDA and SNMFMDA are 0.3367, 0.3198, 0.3345 and 0.3489, respectively. SCPLPA's AUPR outperforms various state-of-the-art models by at least 24.09%. This indicates that in the given datasets with imbalanced positive and negative samples, SCPLPA's predictive performance has a clear advantage over other state-of-the-art models, demonstrating better robustness in handling imbalanced datasets. Additionally, the MCC values of SCPLPA, MDHGI, NSEMDA, RFMDA and SNMFMDA are 0.1762, 0.1507, 0.1472, 0.1356 and 0.1681, respectively. The F1 values of SCPLPA, MDHGI, NSEMDA, RFMDA and SNMFMDA are 0.1102, 0.1023, 0.1054, 0.1012 and 0.1045, respectively. SCPLPA also has a slight lead in F1 and MCC values. In conclusion, compared to the other four state-of-the-art models, SCPLPA can improve robustness in imbalanced datasets while maintaining high prediction accuracy, showing superior performance in miRNA–disease association tasks. Each disease (miRNA) was simulated as an isolated disease (new miRNA) to evaluate the prediction performance of SCPLPA for new miRNAs and isolated diseases. Cross-validation was then performed for each disease (miRNA). The AUC values of SCPLPA are 0.8289 (0.8412). Colon and kidney neoplasms were selected for case analysis to further validate the reliability of the SCPLPA model in predicting the relationship between potential miRNAs and diseases. In the top 50 rankings and the corresponding disease-related miRNA predictions, the accuracy levels verified by the HDMM3.2 and dbDEMC databases are 98% and 100%, respectively. In the prediction of the isolated disease cases, all the top 50 rankings were confirmed by the two databases. The reliable predictions of SCPLPA provide insights for the identification of potential miRNA biomarkers and contribute to future research on the involvement of miRNAs in human disease mechanisms.

The outstanding predictive performance of SCPLPA is mainly due to two reasons. First, it integrates disease semantic similarity data and disease Gaussian interaction profile kernel similarity data to construct a heterogeneous disease similarity network. It also integrates miRNA functional similarity data and miRNA Gaussian interaction profile kernel similarity data to construct a heterogeneous miRNA similarity network, which can more accurately characterize the similarity between diseases and miRNAs. Second, the SCPLPA method combines the label propagation algorithm and network consistency projection sub-models. The label propagation algorithm estimates lncRNA–disease associations, alleviates the sparsity of known miRNA–disease association data and addresses the positive and unlabelled learning problem. Consistency information between different networks is obtained, thereby solving the problems on predicting isolated diseases and new miRNAs and improving the accuracy of predicting potential miRNA–disease associations. Although SCPLPA can effectively predict miRNA–disease associations but has certain limitations. First, integrating more omics data can construct more accurate disease similarity networks and miRNA similarity networks. Second, our algorithm is based on the prediction of known miRNA–disease associations, which may lead to biased results towards diseases with known associated miRNAs. Inspired by various association prediction methods such as drug–target interaction prediction62 and ligand–receptor interactions,63-66 we plan to explore boosting-based or deep learning-based models to enhance microRNA–disease prediction in future research.

AUTHOR CONTRIBUTIONS

Min Chen: Conceptualization (equal); formal analysis (equal); methodology (equal); resources (equal); software (equal); supervision (equal); writing – original draft (equal); writing – review and editing (equal). Yingwei Deng: Conceptualization (equal); formal analysis (equal); investigation (equal); methodology (equal); resources (equal); software (equal); supervision (equal); writing – original draft (equal); writing – review and editing (equal). Zejun Li: Funding acquisition (equal); resources (equal). Yifan Ye: Investigation (equal); validation (equal); visualization (equal). Lijun Zeng: Project administration (equal); visualization (equal). Ziyi He: Investigation (equal); validation (equal); visualization (equal). Guofang Peng: Visualization (equal).

ACKNOWLEDGEMENTS

The work was supported by the Nature Science Foundation of Hunan Province, China (Grant No. 2024JJ7115) and the National Natural Science Foundation of China (Grant No. 62172158).

    CONFLICT OF INTEREST STATEMENT

    The authors confirm that there are no conflicts of interest.

    DATA AVAILABILITY STATEMENT

    All datasets generated for this study are included in the article/supplementary material.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.