NGCN: Drug-target interaction prediction by integrating information and feature learning from heterogeneous network
Abstract
Drug-target interaction (DTI) prediction is essential for new drug design and development. Constructing heterogeneous network based on diverse information about drugs, proteins and diseases provides new opportunities for DTI prediction. However, the inherent complexity, high dimensionality and noise of such a network prevent us from taking full advantage of these network characteristics. This article proposes a novel method, NGCN, to predict drug-target interactions from an integrated heterogeneous network, from which to extract relevant biological properties and association information while maintaining the topology information. It focuses on learning the topology representation of drugs and targets to improve the performance of DTI prediction. Unlike traditional methods, it focuses on learning the low-dimensional topology representation of drugs and targets via graph-based convolutional neural network. NGCN achieves substantial performance improvements over other state-of-the-art methods, such as a nearly 1.0% increase in AUPR value. Moreover, we verify the robustness of NGCN through benchmark tests, and the experimental results demonstrate it is an extensible framework capable of combining heterogeneous information for DTI prediction.
1 INTRODUCTION
The design and development of new drugs are a long process due to their high risk, long cycle and large investment. Also, the side effects of drugs on unexpected diseases and drug interactions have been shown to be potential risks to human health. Traditional biological experiments are effective in finding drug-target interactions, whereas they are usually time-consuming and costly.1, 2 Thus, computation approaches for detecting drug-target interactions have recently become one of the most important parts of pharmacology development. With the growth of various drugs, targets and their interaction data, the computation-based methods not only make predicting drug-target interactions more economical and effective but also enhance the experiment reliability since they assist in explaining the mechanism of drug actions and their potential target activities.
- The approach using molecular docking requires a known 3D structure of proteins, whereas the complex structures of known protein ligands are scarce and generally unavailable.
- The approach by ligand similarity employs the knowledge of known ligand interactions to make predictions. Nevertheless, if the target has insufficient ligands, the results may be poor.
- Machine learning is the most popular and effective approach at present, which can fully explore the relevant characteristics of drugs and the potential drug-target interactions.
In recent years, many machine learning-based methods have been proposed to predict potential DTIs. They mainly consist of the kernel method, matrix decomposition and multi-source information integration.
According to chemical and genomic information, Yamanishi et al.6 used nuclear regression for DTI prediction and constructed a BLM model using bipartite graphs. Van Laarhoven et al.7 defined a gaussian interactive section core depending on the topological characteristics of the adjacency matrix and then used the kernel least squares (KRLS) algorithm to predict DTIs. Pahikkala et al.8 also employed the Kronecker regularized least squares (KRLS) algorithm, but they utilised the drug characterization based on 2D compound similarity and the Smith-Waterman similarity characterization of the target. The kernel-based methods only employ simple linear combinations, relying on several individual kernels to generate the final kernel matrix. This may be inappropriate if the linearity between the kernels is not obvious.
Matrix factorization is also widely used for DTI prediction. The dual-nucleated Bayesian matrix decomposition (KBMF2K) proposed by Gonen et al.9 maps target proteins and drug compounds into the subspace of Bayesian by estimating the interaction network and using similarity in the subspace. Hao et al.10 established a drug-target prediction model called DNILMF based on logical matrix decomposition. This model constructs two new kernel matrices, performs nonlinear diffusion between these two matrices and the two original similarity matrices, and predicts drug-target interactions by gathering neighbour information. Ding et al.11 proposed a multiple kernel-based triple collaborative matrix factorization (MK-TCMF) method to predict DTIs. Multi-kernel learning (MKL) algorithm can regulate the weight of each kernel matrix according to the prediction error. The aforementioned methods utilise direct drug-target associations. This is challenging because the known information about the interaction is often incomplete.
With the rapid development of bioinformatics, various drugs, proteins, genes and other types of data have also been adopted for DTI prediction. Wan et al.12 constructed a large integrated network by combining data from multiple heterogeneous networks, captured the topological characteristics of the integrated network by using neighbourhood aggregation technology13 and reconstructed the topological representation of all relational matrices. Yu et al.14 developed an ensemble model (KenDTI) based on both biochemical characteristics of drugs via network integration and molecular sequences via word embedding to predict DTIs. Shao et al.15 regarded DTI prediction as a link prediction problem and proposed an end-to-end model based on heterogeneous graphs with attention mechanisms (DTI-HETA). Fu et al.16 proposed a multi-view graph convolutional network (MVGCN) framework for link prediction in biological networks by combining the similarity network to build a multi-view heterogeneous network and obtain node attributes. In addition, a Neighbourhood Information Aggregation (NIA) layer was designed for inter- and intra-domain information updating. Ren et al.17 integrated a large number of unlabeled drug molecular map information and target information and designed a pre-training framework, MGP-DR(molecular graph pretraining for drug representation), for drug pair representation learning. The model used a self-supervised learning strategy to mine contextual information within and between drug molecules to predict drug–drug interactions and drug combinations. The graph convolutional neural network was utilised to obtain the embedded representation of the drugs and targets. The performance of network prediction tasks using graph convolution technology for large-scale graph data has been significantly improved18 owing to the application of graph neural networks.19 In multi-source data processing, it is usually easy to concatenate the features of different data sources. Therefore, how to make full use of the contributions of data from varied sources to efficiently fuse the DTI prediction is the key to improve the DTI prediction accuracy.
Motivated by the recent success of deep learning techniques in learning powerful representations from complex data,20-23 Zhang et al.24 introduced related datasets for DTI prediction. Excluding the previously mentioned self-supervised learning framework, MGPDR, introduced by Ren et al.,17 Chu et al.25 proposed the model, HGRL-DTA, which was a novel approach for learning drug-target binding affinity prediction through hierarchical graph representation. By incorporating both global affinity relationships and local chemical structures of drugs/target molecules, and utilising message broadcasting strategies, the model can synergistically integrate hierarchical information. The heterogeneous graph automatic meta-path learning-based DTI prediction method (HampDTI), proposed by Wang et al.,26 employed a node-type specific graph convolutional network (NSGCN) to learn the embedding of drugs and targets using meta-paths learned from a heterogeneous graph. The embedding from multiple meta-path graphs has been combined to predict new DTIs.
The advantage of a deep learning method is its ability to identify hidden interactions between drugs and targets. However, they still have room for improvement in the following two aspects: (1) DTI prediction is to discover new DTIs. How to select truly interaction-free drug-target pairs is a thorny issue; (2) the fact that deep learning methods perform well on test datasets does not mean that they can also achieve good performance on discovering real drug.
This paper proposes a novel NGCN to predict DTIs. It can integrate various information from heterogeneous data sources, extract drug and target information from heterogeneous networks and reduce the feature information of drug or target to a low-dimensional feature representation. Based on these low-dimensional feature vectors, the spectral graph-based convolutional neural (GCN) network is further applied to learn the drug or target features and avoid inaccuracy caused by the noise and incompleteness of large-scale biological data. We compare NGCN with other methods to demonstrate its effectiveness and gradually increase the number of networks to prove the integration capability of NGCN. The results demonstrate that NGCN is promising for drug-target interaction prediction.
2 PRELIMINARIES
Drug-target interaction prediction of network syncretic aims to conduct prediction tasks by jointly utilising different views to exploit the complementarity.
- Gather multiple networks to build a large integrated network and extract information for prediction.
- Extract feature information from each network and then fuse them for similarity or correlation prediction.
It is difficult to distinguish the discrepancies between different networks while constructing large integrated networks. And if the number of integrated networks is too large, computations on such a network will become challenging due to the increasing network complexity.
Extracting information from each network and making fusion predictions are the primary ways for drug-target interaction prediction. The process is mainly composed of three steps: (1) extracting drug or protein information from each network; (2) feature fusion and dimensionality reduction; and (3) correlation prediction or drug relocation prediction based on extracted feature information.
Information extraction on a single network is the key step in network fusion. Common feature extraction consists of matrix decomposition and random walk with restart (RWR). The former usually decomposes the incidence matrix into two eigenvectors and minimises the loss of vector reconstruction. However, this strategy might lead to information loss and fail to capture the global characteristics of the incidence matrix.
3 METHOD
The diffusion state is inaccurate, partially because the network data set in the experiment is noisy and incomplete. Luo et al.27 improved the diffusion component analysis method (DCA)28 and proposed the clusDCA for dimension reduction in the form of effective matrix decomposition. It is combined in our proposed model, NGCN, herein.
The NGCN first conducts the RWR process on each drug or protein within each similar network to acquire the distribution of each drug or protein node, termed as the diffusion state. The diffusion state captures its topological relationship with all other nodes in the heterogeneous network. Subsequently, the improved clusDCA algorithm is employed to compute the low-dimensional representation of the nodes. Leveraging the learned low-dimensional features of drugs and proteins (where each row in the low-dimensional drug features represents a feature vector of a drug and each column in the low-dimensional protein features is a feature vector of a protein), NGCN executes spectral graph convolution to further refine the features of drugs and proteins. Finally, the drug-target matrix is reconstructed to identify unknown drug-target interactions. Details of the NGCN model are depicted in Figure 1.

3.1 Diffusion state of nodes by RWR
Our network data consists of homogeneous interaction networks, such as PPI network, and heterogeneous interaction networks, such as protein-disease association networks. For the input homogeneous interaction networks (e.g. drug–drug interaction networks), we compute the “diffusion state” of each drug or target by directly running the RWR algorithm on each of these networks. As for heterogeneous interaction networks, we need to build similarity networks (e.g. to build protein–protein similarity network through protein-disease association networks), perform the RWR on the derived similarity networks and then run the RWR process on these similarity networks to obtain the diffusion states of drugs or proteins. Overall, we construct similarity networks for drugs, based on (i) drug–drug interactions, (ii) drug-disease associations and (iii) drug-side-effect associations. In the similar way, we construct similarity networks for proteins, based on (i) protein–protein interactions and (ii) protein-disease associations.
Then the diffusion state of each network can be obtained by running the RWR process on each similarity network, as described in Equ 2.
3.2 Performing feature reduction and feature extraction
3.3 Updating feature information
Although we have obtained the low-dimensional representation of drug or target nodes, the node features need to be further updated due to the noisy and uncertain biological information. Here, we use the spectral graph-based convolutional neural network for updating features.
3.4 Reconstructing drug-target matrix
By minimising the final objective function, gradient descent training can be carried out.
3.5 Pseudocode of NGCN
The pseudocode for NGCN is provided in Algorithm 1 below.
ALGORITHM 1. : Pseudocode of NGCN
-
Run random walk with restart on multi-networks;
-
Use diffusion component analysis (clusDCA) to perform feature reduction and feature extraction on the diffusion state set of the drugs and the diffusion state set of the proteins;
-
Apply spectral graph-based convolutional neural network to update the features of drugs and targets;
- Reconstruct the drug-target matrix ;
return ;
- Step 1: the diffusion state for drug or target is derived by performing RWR algorithm (as shown in Equ 2) on each network.
- Step 2: clusDCA takes the diffusion state set of the drug and the diffusion state set of the protein as input to perform feature reduction for the node features, and obtain important topological feature information of nodes from the diffusion states.
- Step 3: the spectral graph-based convolutional neural network is constructed according to Equ 11. Target features and the drug features mentioned above are updated.
- Step 4: the drug-target matrix is reconstructed by Equ 13, after obtaining the updated features and .
4 EXPERIMENTAL RESULTS
4.1 Dataset
In the whole training process, the dataset of our experiment is the same as that used by Luo et al.27 There are four types of nodes in the dataset including drug nodes, protein nodes, disease nodes and side effect nodes. There was no exception; those isolated nodes were excluded.
The dataset includes two kinds of similarity network and six types of association networks. The latter consists of drug-protein association network,30 protein–protein association network,31 drug–drug interaction network,30 drug-disease network32 and protein-disease association network32 and drug-side effect network.33 These networks can be used to construct corresponding similarity networks with respect to proteins and drugs. Among them, the former is generated by the similarity of the gene sequence of proteins, and the latter is constructed by the similarity of the medical chemical structure.
4.2 Superiority in DTI prediction
A drug-target pair with a known interaction is considered a positive sample, and a drug-target pair with an unknown interaction is generally viewed as a negative sample. To measure the performance of NGCN in predicting DTIs, we first performed 10-fold cross-validation on all positive pairs and a set of randomly sampled negative pairs, whose number was 10 times as many as that of positive samples. This scenario basically stimulated the practical situation in which the DTIs are sparsely labelled. For each fold, a randomly chosen subset of 90% positive and negative pairs was used as training data to construct the heterogeneous networks and then train the parameters of NGCN, and the remaining 10% positive and negative pairs were held out as the test set.
We compared NGCN with six baseline methods, including NeoDTI,12 DTINet,27 BLMNII,34 MOLIERE,35 NetLapRLS36 and HNM.37 Two evaluation indicators including AUPR (the area under the precision-recall curve) and AUROC (the area under the receiver operating characteristic curve) were used to measure performance.
In Figure 2, we can observe that NGCN has better performance than other methods, which is higher than the best method. In addition to known DTI data, the chemical structure, protein sequence information and other properties of drugs and targets can also be determined through their various functional roles in biological systems, such as protein–protein interactions and drug-disease associations. By integrating disparate information from heterogeneous data sources, methods such as DTINet, NeoDTI and HNM can further improve the accuracy of DTI predictions. However, there are still some limitations to these approaches that need to be addressed. For example, HNM method only considers three different types of data to make relationship prediction, thus discarding a lot of valuable information. In addition, methods such as BLMNII and MOLIERE only take relatively simple forms (such as bilinear linear or log-linear functions), which may not be sufficient to capture complex hidden features behind heterogeneous data. The reason for NGCN's excellent performance lies in its initial utilization of RWR to compute the diffusion state of nodes for each network, followed by its integration with clusDCA for dimensionality reduction operations. In this manner, the noise in the data is substantially reduced. The spectral graph convolutional neural network is then employed to further learn drug or target features. Unlike DTINet, where predictions are solely based on dimensionless diffusion states, NGCN enhances its predictive capability by optimizing features using the graph convolution model, thereby achieving superior results.

To verify the performance of NGCN under sparse positive samples, we changed the number of samples and specified the proportion 1:10 for positive and negative examples. It is observed that the performance of all other algorithms decreased. In contrast, NGCN still achieved the best prediction performance. This shows that even in the case of sparse labelling, the prediction performance of other methods is still inferior to the NGCN method. In addition, we performed statistical significance tests at the 95% confidence level on the results of the NGCN and NeoDTI (the best performance method in the comparison experiment) using 10-fold cross-validation. The results show that the observed differences between the two methods are statistically significant.
Since the data may be redundant, for example, there are multiple homologous proteins for one protein or multiple highly similar drugs for one drug in the dataset, which may negatively affect the performance. Therefore, we applied the same strategy as Luo et al. to reduce the impact of data redundancy by removing drug-target associations of similar drugs or targets in the drug-target interaction matrix. We eliminated drug-target associations in which the Jaccard similarity in the association network was greater than 0.6, the structure similarity score in a medicinal chemical similarity network exceeds 0.6, and the identity score in the protein–protein sequence similarity network exceeds 0.4.
In the experiment, we kept the ratio 1:1 for negative and positive samples. As expected, after the deletion of similarity, NGCN performance declined but was still superior to other baseline methods.
4.3 Effects of NGCN components
In this paper, we propose a multi-network integration algorithm, termed as NGCN and apply it on drug-target interactions prediction using GCN model. We use GCN to aggregate neighbourhood features to further improve the availability of features. The spectral-based graph convolution network (GCN) method introduces filters from the perspective of graph signal processing to define graph convolution, where the graph convolution operation is interpreted as removing noise from the graph information. In order to evaluate the performance of GCN part, we implemented a multi-networks integration framework without updating features (i.e. use the spectral-based graph convolutional neural network for updating features), to evaluate the effects of the proposed NGCN. We compared our method, NGCN, with these various approaches to validate the effects of the feature updating operation, and the experimental results are reported in Table 1. The results show that the feature updating operation of our proposed NGCN algorithm demonstrates substantial superiority on the task of predicting drug-target interactions.
Feature-update | Drug-dimension | Protein-dimension | AUPR | AUROC |
---|---|---|---|---|
NO | 100 | 200 | 0.889 | 0.863 |
YES | 100 | 200 | 0.901 | 0.880 |
NO | 200 | 200 | 0.894 | 0.875 |
YES | 200 | 200 | 0.914 | 0.895 |
NO | 100 | 400 | 0.924 | 0.904 |
YES | 100 | 400 | 0.926 | 0.901 |
NO | 200 | 400 | 0.921 | 0.900 |
YES | 200 | 400 | 0.943 | 0.910 |
- Note: The best performance results are highlighted in bold.
4.4 Robustness
In the experiment, we mainly evaluated the influence of parameters and the robustness of NGCN. The robustness of NGCN was tested by changing the number of networks related to the drugs or target, the feature dimension and the hyperparameters of NGCN. All experimental results were obtained by adopting 10-fold cross-validation.
We start from examining the effects from aggregating multiple heterogeneous networks on the predicted results. We only used drug-protein association matrices (i.e. drug similarity network, drug–drug association network, protein–protein association networks, protein similarity network and drug-protein association network) to conduct performance evaluation. Through training, we observed that the prediction performance was significantly reduced compared to the original model, NGCN, which obtained the features from all networks. We also increased the number of networks associated with disease and side-effects. Under expectation, it is observed that the prediction performance could be improved by adding drug- and protein-related networks. Experiments show that aggregating heterogeneous information in the networks generated by multiple data sources is able to improve the prediction accuracy. Furthermore, we applied NGCN to predict drug-target interactions under different feature dimension conditions and compared the AUPR values of the predicted results. According to the experiment of Wang et al.,29 the dimension of the feature vector in the diffusion state dimension of 10%–20% achieved the best results. We expanded the scope of the study to 10% to 30%, and we set the drug dimension to 80, 110, 140, 170, 200 and protein dimension to 200, 250, 300, 350 and 400. From the observations, there was little impact on the predicted results (see Figure 3).

We further investigated the impact of hyperparameters on experimental performance. Here, we mainly studied the influence of restart random walk probability p on the experimental results. In the test, we considered the restart probability value between 0.4 and 0.7 to observe the performance stability under different probabilities. In Figure 3, it can be seen that when the restart probability is varied from 0.4 to 0.7, NGCN achieves stable performance. Thus, these parameters have little impact on the experimental performance.
To validate the robustness and scalability of our proposed approach, we evaluated it on four drug-target interaction datasets created by Yamanishi et al.6: Enzyme, Ion Channels, G-protein-coupled receptors (GPCR) and Nuclear receptors. The datasets are available at http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/. We compared NGCN with six other prediction methods in terms of prediction effects using 10-fold cross-validation. Table 2 shows that NGCN outperforms other methods, indicating that our approach can be applied to other drug-target interaction prediction scenarios.
Dataset | NetLapRLS | BLMNII | HNM | DTINet | MOLIERE | NeoDTI | NGCN |
---|---|---|---|---|---|---|---|
Enzyme | 0.945 | 0.951 | 0.949 | 0.953 | 0.961 | 0.957 | 0.973 |
Ion channels | 0.953 | 0.957 | 0.924 | 0.971 | 0.963 | 0.961 | 0.975 |
GPCR | 0.928 | 0.939 | 0.940 | 0.937 | 0.944 | 0.945 | 0.957 |
Nuclear receptors | 0.891 | 0.914 | 0.894 | 0.921 | 0.913 | 0.938 | 0.944 |
- Note: The best performance results are highlighted in bold.
4.5 Identification of new targets for known drugs
We analysed the predicted scores of DTIs in the results. In the unknown DTIs prediction, we selected the top 10 predicted scores of DTIs, and three of these DTIs can be supported by previous studies in the literature. For example, nifedipine is a drug that has been approved to suppress spontaneous arrhythmia, and our NGCN predicted that SCN5A, which interacts with nifedipine, plays an important role in ventricular arrhythmia.38, 39 COX-2 encoded by the PTGS2 gene is an inducible enzyme that can be highly induced by pro-inflammatory cytokines and tumour promoters in various cells. And nifedipine inhibits the expression of COX-2 of human OA chondrocytes.40 This interaction was also predicted by NGCN. In addition, nifedipine has a good clinical effect on high-altitude pulmonary oedema and has been approved for adjunctive treatment.41 Nifedipine was predicted by NGCN to interact with NR3C1, and NR3C1 gene polymorphisms are associated with high-altitude pulmonary edema.42 In general, the new DTIs predicted by NGCN are supported by literature, which further demonstrates the powerful predictive ability of our model.
5 CONCLUSIONS
The challenge of integrating information from multiple networks for DTI prediction mainly arises from the complexity and heterogeneity of different drug-related networks, as well as from the high-dimensional, incomplete and noisy nature of data. To solve this problem, we propose a novel method called NGCN, which updates features through GCN by fusing features from multiple networks. NGCN analyses the structural characteristics of each network through a network diffusion process and extracts low-dimensional hidden vectors of the network. It has demonstrated significant improvement over baseline approaches for DTI prediction by leveraging updated features via convolutional optimization. Moreover, NGCN is an extensible framework that can incorporate more information about drugs and targets, offering flexibility to enhance features and integrate more heterogeneous information to improve the prediction accuracy. In our future work, we will focus on two main aspects to enhance our approach. Firstly, we will enhance the utilisation of biological information by integrating more diverse network data into our framework, leading to a more comprehensive understanding of drug-target interactions. Secondly, we will address the issue of significant differences in node degrees within the graph network to ensure effective extraction of information from low-degree nodes. These enhancements aim to achieve more precise and reliable prediction of drug-target interactions.
AUTHOR CONTRIBUTIONS
Junyue Cao: Conceptualization (equal); formal analysis (equal); funding acquisition (equal); investigation (equal); methodology (equal); project administration (equal); software (equal); writing – original draft (equal); writing – review and editing (equal). Qingfeng Chen: Conceptualization (equal); funding acquisition (equal); project administration (equal); supervision (equal). Junlai Qiu: Software (equal); validation (equal); writing – review and editing (equal). Yiming Wang: Software (equal); validation (equal). Wei Lan: Validation (equal); writing – review and editing (equal). Xiaojing Du: Visualization (equal); writing – review and editing (equal). Kai Tan: Validation (equal).
ACKNOWLEDGEMENTS
This work was supported in part by the National Natural Science Foundation of China (Grant 61963004) and the Specific Research Project of Guangxi for Research Bases and Talents (Grant 2022AC21066).
CONFLICT OF INTEREST STATEMENT
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Open Research
DATA AVAILABILITY STATEMENT
The source code and data can be available at https://github.com/Junyue28/NGCN/.