ORIGINAL ARTICLE

Open Access

Enabling the Application of Graph Neural Networks on Graphs With Unknown Connectivity

Corresponding Author

Jorge García-Carrasco

[email protected]

orcid.org/0000-0003-3174-083X

Lucentia Research Group—Department of Software and Computing Systems, University of Alicante, San Vicente del Raspeig, Spain

Correspondence:

Jorge García-Carrasco ([email protected])

Search for more papers by this author

Alejandro Maté,

Alejandro Maté

Lucentia Research Group—Department of Software and Computing Systems, University of Alicante, San Vicente del Raspeig, Spain

Search for more papers by this author

Juan Trujillo,

Juan Trujillo

Lucentia Research Group—Department of Software and Computing Systems, University of Alicante, San Vicente del Raspeig, Spain

Search for more papers by this author

Jorge García-Carrasco,

Corresponding Author

Jorge García-Carrasco

[email protected]

orcid.org/0000-0003-3174-083X

Lucentia Research Group—Department of Software and Computing Systems, University of Alicante, San Vicente del Raspeig, Spain

Correspondence:

Jorge García-Carrasco ([email protected])

Search for more papers by this author

Alejandro Maté,

Alejandro Maté

Lucentia Research Group—Department of Software and Computing Systems, University of Alicante, San Vicente del Raspeig, Spain

Search for more papers by this author

Juan Trujillo,

Juan Trujillo

Lucentia Research Group—Department of Software and Computing Systems, University of Alicante, San Vicente del Raspeig, Spain

Search for more papers by this author

First published: 18 June 2025

https://doi.org/10.1111/exsy.70088

Funding: This work has been co-funded by the AETHER-UA project (PID2020-112540RB-C43) funded by Spanish Ministry of Science and Innovation, and the BALLADEER (PROMETEO/2021/088) project, funded by the Conselleria de Innovación, Universidades, Ciencia y Sociedad Digital (Generalitat Valenciana). Jorge García-Carrasco holds a predoctoral contract (CIACIF/2021/454) granted by the Conselleria de Innovación, Universidades, Ciencia y Sociedad Digital (Generalitat Valenciana). This work is part of the TSI-100927-2023-6 Project, funded by the Recovery, Transformation and Resilience Plan from the European Union Next Generation through the Ministry for Digital Transformation and the Civil Service. This work has also been partially funded by the Sectorial Data Spaces projects: European Mobility (TSI-100121-2024-10) and AGROVAL-AI (TSI-100122-2024-10) funded by the Spanish State Secretariat for Digitization and Artificial Intelligence.

Share a link

Email
Wechat
Bluesky

ABSTRACT

Graph Neural Networks (GNNs) have proven to be reliable methods for working with graph-structured data. However, it is common to find graphs with partially or fully inaccessible connectivity patterns, hindering the direct application of GNNs to the task at hand. To tackle this problem, several Graph Structure Learning (GSL) methods have been proposed, with the objective of jointly optimizing both the graph structure and the GNN model by adding loss terms that enforce desired graph properties. These properties, such as sparseness and connectivity of similar nodes, can have a drastic impact on the performance of a GNN. However, current methods offer little control on the desired degree of sparseness, which may lead to non-optimal connectivity and reduced efficiency. In this paper, we propose a new method called Adaptative Sparsification Graph Learning (ASGL), which enables fine-grained, linear control over the total number of edges in the resulting learned graph via a novel perturbation-based loss term. ASGL not only provides flexibility in sparsity control but also improves both accuracy and computational efficiency, outperforming state-of-the-art methods in most benchmarks. We demonstrate its robustness through extensive experiments and highlight how adjusting sparsity enables optimizing the trade-off between accuracy, complexity, and interpretability.

1 Introduction

There is a wide variety of data that can be naturally and efficiently expressed as graphs, such as social networks (Sen et al. 2008; Rossi and Ahmed 2015; Namata et al. 2012), traffic networks (Cui et al. 2019), 3D mesh data (Lin et al. 2021), molecular structures (X. Wang et al. 2019) or brain EEG data (Zhong et al. 2022), among others. Entities can be represented as nodes in the graph, whereas the relationship between two entities can be represented as edges connecting the different nodes.

Graph Neural Networks (GNNs) are Deep Learning (DL) architectures designed for extracting relevant patterns on graph data. Due to their representation learning capabilities, GNN models have been successfully applied to a variety of tasks, such as social network-related tasks (e.g., sentiment analysis (Y. Zhou et al. 2024)), forecasting hotel cancellations (Herrera et al. 2024), recommender systems (H. Wang et al. 2018; X. Zhang et al. 2023), emotion recognition via brain EEG signals (Zhong et al. 2022) and bioinformatics-related tasks such as drug development and discovery or the prediction of molecular properties (X.-M. Zhang et al. 2021), among others.

GNNs are particularly good at node classification tasks, that is, classifying nodes in a given graph. However, it is common to find graphs with partially or fully inaccessible connectivity patterns, hindering the direct application of GNNs to the node classification task. For example, one may have a record of previous terrorist attacks and the relationships among them stored as a graph. If, unfortunately, another attack occurs, we would not be able to directly apply a GNN for classifying the node, as the connectivity of this node remains unknown. Consequently, there has been an increasing amount of proposals revolving around the theme of Graph Structure Learning (GSL) (Zhu et al. 2021). GSL methods aim to jointly learn a proper connectivity pattern and meaningful representations for the task at hand in order to properly perform such task on graphs where the connectivity is unknown or unavailable.

GSL methods can be divided into three categories: metric learning, probabilistic modeling, and direct optimization approaches. The former ones compute the weight of an edge between two models via a distance measure between both node representations. For example, (R. Li et al. 2018) uses the Mahalanobis distance, (Yu et al. 2021) uses the inner product of the embeddings of the two nodes, (X. Wang et al. 2020) uses the cosine distance, and attentive approaches such as the GAT (Veličković et al. 2018) learn edge weights via the attention mechanism. However, these approaches can often oversimplify complex relationships by relying on fixed similarity metrics, limiting their adaptability.

The probabilistic modeling approaches are based on the assumption that the graph is generated via a sampling process and focus on modeling the probability of sampling edges with learnable parameters. For example, (Franceschi et al. 2019) model the edges by sampling from a Bernoulli distribution with learnable parameters, (Zheng et al. 2020) remove task-irrelevant edges by parameterizing the sparsification process with a neural network, and (T. Wu et al. 2020) use the attentional weights of GAT as parameters for the sampling edge distribution. Falling into this category, it is also worth mentioning Bayesian Network structure learning (Kitson et al. 2023). However, despite the flexibility of these methods, they can suffer from high computational overhead.

Finally, direct optimization approaches treat learning edge connectivities as an optimization problem, where the graph structure is learned along with the GNN. These methods include additional regularization terms in the loss function that enforce desirable properties of the graph connectivity, such as feature smoothness, sparsity, and low-rank. For example, (L. Yang et al. 2019) obtain a refined graph by including a refinement loss term based on the assumption that neighboring nodes tend to share the same label. GLNN (Gao et al. 2020) incorporates several terms to enforce properties such as sparsity and feature smoothness. Similarly, (Jin et al. 2020) includes a term for enforcing low-rank and applies it to obtain GNN models robust to adversarial attacks. Likewise, (Wan and Kokel 2021) reformulates the problem of structure learning as a meta-learning optimization problem by treating the graph structure as hyperparameters. In (W. Zhao et al. 2023), the authors successfully attempt to learn a generalizable graph structure learning model that is trained with multiple source graphs and can be directly adapted for inference on new unseen target graphs.

Despite the advances obtained so far, current direct optimization GSL methods offer little control over the degree of sparseness of the graph (i.e., the number of edges), which is one of the most important properties to take into account, as the total number of edges has a direct impact on the performance, computational cost, and complexity of the GNN model.

More specifically, current direct optimization GSL methods impose sparseness by adding a L1 regularization term on the adjacency matrix. Even though L1 regularization has been used extensively in many situations to impose sparseness, it might not be the most suitable option for GSL, because (i) it offers little to no control over the total number of edges of the learned graph and (ii) it is incompatible with learning unweighted (i.e., binary) graphs.

To this end, we propose Adaptative Sparsification for Graph Structure Learning (ASGL) a novel GSL method that allows to fully control the degree of sparseness of the learned graph. Compared to traditional GSL methods, ASGL introduces a novel sparsification mechanism that enables precise, linear control over graph sparsity via a dedicated hyperparameter

a

, overcoming the limited flexibility of conventional L1-based regularization approaches. This flexibility makes ASGL particularly well-suited for graphs with unknown or partially missing connectivity, where controlling the structure is crucial for optimizing both performance and interpretability. To summarize, our contributions are:

We propose ASGL, a direct optimization GSL method that enables full control over the degree of sparseness of the learned graph thanks to the addition of a novel loss term based on perturbation-based interpretability methods.
We prove the robustness of ASGL compared to the state-of-the-art via a series of node classification experiments on different datasets, where the connectivity between the nodes is fully or partially unavailable and has to be learned.
We demonstrate that using L1 regularization to control the degree of sparseness is suboptimal. Conversely, the hyperparameter $a$ in ASGL exhibits a linear correlation with the degree of sparseness, allowing for seamless tuning and resulting in improved performance.

The rest of the paper is structured as follows. Section 2 presents the required background. Section 3 formally presents the ASGL method. Section 4 describes the experiments performed to prove the robustness of the model. Section 5 shows the results, as well as several analyses of different aspects of ASGL. Finally, the conclusions are presented in Section 6.

2 Background

In this section, we present the problem statement and the necessary background to understand our approach.

2.1 Problem Statement

Let $\mathcal{G}=\left(\mathcal{V},\mathrm{\mathcal{E}}\right)$ denote an undirected graph, where $\mathcal{V}=\Big\{{v}_1,{v}_2,\dots$ , ${v}_N\Big\}$ represent the $N$ different nodes and $\mathrm{\mathcal{E}}=\left\{\left({v}_i,{v}_j\right)\mid i,j\in \left[1\dots N\right]\right\}$ the set of connections between the different nodes. The connectivity of the graph can also be specified by the adjacency matrix $A\in {\mathrm{\mathbb{R}}}^{N\times N}$ , which can be binary ( ${A}_{ij}=1$ if nodes $i$ and $j$ are connected, $0$ otherwise) or weighted. In this problem setup, the adjacency matrix is completely or partially unavailable, thus the connections between the different nodes are unknown.

Let each node ${v}_i$ have an associated feature vector ${\mathbf{x}}_i\in {\mathrm{\mathbb{R}}}^F$ , where $F$ is the number of features. The features of every node in the graph can be represented in a feature matrix $X\in {\mathrm{\mathbb{R}}}^{N\times F}$ . Moreover, the set of nodes is partially labeled, that is, some nodes have an associated label ${y}_i\in \mathrm{\mathbb{N}}$ indicating its class. The objective is to predict the labels of the unlabeled nodes by exploiting the given graph information, that is, obtain a function $f\left(X,A\right)=Y$ , where $Y$ contains the predicted labels for every node. This task is denoted as semi-supervised node classification.

However, the adjacency matrix (i.e., the connectivity between the nodes) is unknown, drastically increasing the difficulty of the task. Hence, the goal is to jointly obtain a node classifier while learning a proper adjacency matrix, that is, jointly optimizing both $f\left(X,A\right)$ and $A$ .

2.2 Graph Convolutional Network

Convolutional Neural Networks (CNNs) have achieved impressive results when working with Euclidean, grid-like data such as images. Similarly, Graph Convolutional Networks (GCNs) are the generalization of CNNs to graph-structured data. However, the non-Euclidean nature of graphs presents several difficulties, and many approximations and variations of the convolution operation in graph data have been presented (Kipf and Welling 2017; Veličković et al. 2018; F. Wu et al. 2019; Hamilton et al. 2017).

The authors of (Kipf and Welling 2017) present a simple and well-behaved graph convolutional operator, which is a first-order approximation of the spectral graph convolution by using Chebyshev polynomials. Specifically, the layerwise propagation rule is expressed as follows:

{H}^{\left(l+1\right)}=\sigma \left({\overset{\sim }{D}}^{-\frac{1}{2}}{\overset{\sim }{A}\overset{\sim }{D}}^{-\frac{1}{2}}{H}^{(l)}{W}^{(l)}\right)

(1)

where

{H}^{(l)}\in {\mathrm{\mathbb{R}}}^{N\times {D}_l}

are the activations of the

l

th layer,

{H}^0=X

{W}^{(l)}\in {\mathrm{\mathbb{R}}}^{D_l\times {D}_{l+1}}

is a learnable weight matrix,

\overset{\sim }{A}=A+{I}_N

is the adjacency matrix with added self-connections,

\overset{\sim }{D}

is the degree matrix of

\overset{\sim }{A}

(

{\overset{\sim }{D}}_{ii}={\sum}_j{\overset{\sim }{A}}_{ij}

\sigma

is an activation function such as ReLU, and

{H}^{\left(l+1\right)}\in {\mathrm{\mathbb{R}}}^{N\times {D}_{l+1}}

are the obtained activations in the next

\left(l+1\right)

th layer.

2.3 Graph Structure Learning

As previously mentioned, direct optimization GSL methods work by jointly optimizing the GNN model and the adjacency matrix representing the connectivity of the graph. This can be done by adding regularization terms that enforce desired properties for the adjacency matrix such as feature smoothness or sparseness and imposing certain constraints such as symmetricity. A formal presentation of such regularization terms and constraints is presented below.

2.3.1 Feature Smoothness

Feature smoothness is motivated by the assumption that nodes with similar features are more likely to be connected. This phenomenon has been observed in many situations. For example, connected papers in a citation network are likely to talk about similar topics (Kipf and Welling 2017), people with similar attributes are more likely to be connected in a social network (McPherson et al. 2001), and so on. The most common approach to enforce this property is via the following term:

{\mathrm{\mathcal{L}}}_{fs}=\sum \limits_{i,j}\;{A}_{ij}{\left({\mathbf{x}}_i-{\mathbf{x}}_j\right)}^2= tr\left({X}^T LX\right)

(2)

where

X

is the feature matrix,

A

is the adjacency matrix,

L=D-A

is the graph Laplacian of

A

D

is the degree matrix of

A

and

tr(A)={\sum}_i{A}_{ii}

is the trace operator.

If the $i$ th and $j$ th nodes are connected ( ${A}_{ij}\ne 0$ ) then its features should be similar, that is, ${\left({\mathbf{x}}_i-{\mathbf{x}}_j\right)}^2$ should be small in order to keep the loss term from growing quadratically larger. On the other hand, if two nodes have considerably different features, that is, ${\left({\mathbf{x}}_i-{\mathbf{x}}_j\right)}^2$ is large, then it is less likely that there is a connection between them ( ${A}_{ij}=0$ ), as the loss term would otherwise be considerable increased.

2.3.2 Sparseness

Similarly, sparseness is a topological property that can be observed across many graphs and domains (K. Zhou et al. 2013). Typically, nodes tend to form communities, where most of the nodes are only connected to a small number of neighbors.

The current approach to enforce sparsity in direct optimization GSL methods is simply to add L1 regularization to the adjacency matrix, such as in the GLNN method (Gao et al. 2020):

{\mathrm{\mathcal{L}}}_s={\left\Vert A\right\Vert}_1=\sum \limits_{ij}\mid {A}_{ij}\mid

(3)

While it is certain that L1 regularization imposes sparseness (James et al. 2013) it presents several disadvantages when applied in a GSL setup. First, it forces all values of

A

to be small, which is inconvenient when working with unweighted graphs (i.e.,

{A}_{ij}

is either 0 or 1). Second, it offers little to no control on the desired degree of sparseness of the adjacency matrix. Finally, the obtained adjacency matrix can vary considerably when changing the regularization strength (Crabbé and van der Schaar 2021).

Therefore, we propose an alternative to the L1 loss based on perturbation-based interpretability methods (Fong and Vedaldi 2017)

{\mathrm{\mathcal{L}}}_s(A)={\left\Vert \mathrm{vecsort}(A)-{\mathbf{r}}_a\right\Vert}^2

(4)

where

{\left\Vert \mathbf{x}\right\Vert}^2=\sqrt{\sum_i\;{x}_i^2}

denotes the vector Euclidean norm,

\mathrm{vecsort}

is a function that arranges the elements of

A

into a vector sorted in ascending order and

{\mathbf{r}}_a

is a vector containing

\left(1-a\right)\cdot N\cdot N

zeros followed by

a\cdot N\cdot N

ones, where

a\in \left[0,1\right]

Essentially, what this regularization term does is to encourage a fraction $a$ of the adjacency matrix values to become 1 while the rest of the values are pulled to 0. Compared to the L1 loss term, it presents the following advantages:

First, the number of edges of the learned graph is now clearly controlled by the hyperparameter $a$ , which enables experimenting with a different number of connections. Increasing the value of $a$ directly increases the complexity of the graph, which can improve the performance (or overfit, if the graph is too complex), whereas reducing $a$ simplifies the number of connection, which can reduce the performance but the resulting graph is simpler, therefore more understandable, which can be a required quality for some tasks. Second, unlike the L1 loss, it enforces the resulting graph to be binary (i.e., the connection between two nodes is either 1 or 0), which is a better approximation when working with unweighted graphs as well as it reduces the computational costs (i.e., the learned graph is simpler) and increases the interpretability of the adjacency matrix.

3 Methodology

Figure 1 shows an overview of the methodology. First, the graph learning layer receives graph node data and predicts the structure (i.e., the connection between nodes) of the graph. Then, the node information together with the learned edges are used by the graph node classifier to predict the label of every node. Finally, the ground truth labels are used to compute the loss, which will have two components, that is, the task loss and the ASGL loss, that will be used to update the weights of the graph node classifier and the adjacency matrix via backpropagation, respectively.

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

Overview of the proposed ASGL method.

3.1 Graph Learning Module

The adjacency matrix A will be learned via direct optimization, where different loss terms will enforce certain desirable properties. Specifically, we propose the ASGL loss, which consists of a feature smoothing term and a novel sparseness term:

\begin{array}{c}{\mathrm{\mathcal{L}}}_{ASGL}={\lambda}_1{\mathrm{\mathcal{L}}}_{fs}+{\lambda}_2{\mathrm{\mathcal{L}}}_s\hfill \\ {}\kern3em ={\lambda}_1 tr\left({X}^T LX\right)+{\lambda}_2{\left\Vert \mathrm{vecsort}(A)-{\mathbf{r}}_a\right\Vert}^2\end{array}

(5)

As previously mentioned, the feature smoothing term will tend to establish a connection between nodes with similar features, whereas the sparseness term will enable us to fully control the total number of edges on the learned graph.

3.2 Graph Node Classifier

Given the node features and the proposed adjacency matrix, a two-layer GCN (Kipf and Welling 2017) will be used to perform the node classification task. More formally, the model takes the following form:

Z=f\left(X,A\right)=\mathrm{softmax}\left(\hat{A}\kern0.22em \mathrm{ReLU}\left(\hat{A}{XW}^{(0)}\right){W}^{(1)}\right)

(6)

where

\hat{A}={\overset{\sim }{D}}^{-\frac{1}{2}}{\overset{\sim }{A}\overset{\sim }{D}}^{-\frac{1}{2}}

, and

Z\in {\mathrm{\mathbb{R}}}^{N\times C}

contains the predicted probabilities for each node to belong to a given class. The classifier will be trained by minimizing the cross-entropy loss:

{\mathrm{\mathcal{L}}}_{CE}=-\sum \limits_{t\in \mathcal{T}}\sum \limits_{c=1}^C\;{Y}_{tc}\ln \left({Z}_{tc}\right)

(7)

where

\mathcal{T}

is the set of labeled nodes and

Y\in {\left\{0,1\right\}}^{N\times C}

, where the

i

th row contains the one-hot encoded label of the

i

th node.

3.3 Optimization

The objective of ASGL is to jointly train the Graph Node Classifier while learning the unknown connectivity of the graph via the Graph Learning Module. Hence, the total loss will be a combination of the cross-entropy loss and the ASGL loss:

\mathrm{\mathcal{L}}={\mathrm{\mathcal{L}}}_{CE}+{\mathrm{\mathcal{L}}}_{ASGL}

(8)

Then, the parameters of both the GCN and the adjacency matrix can be tuned via backpropagation to minimize the loss term. However, preliminary experiments showed that updating both the GCN and the adjacency matrix in a single step is often unstable. To overcome this, we propose a two-step optimization scheme:

First, a forward pass is performed and the cross-entropy loss term, ${\mathrm{\mathcal{L}}}_{CE}$ , is computed.
While keeping the adjacency matrix frozen, the parameters of the classifier are updated via backpropagation to minimize such loss term.
Once that the parameters of the classifier have been updated, perform another forward pass and compute ${\mathrm{\mathcal{L}}}_{ASGL}$ .
Update the adjacency matrix via backpropagation to minimize ${\mathrm{\mathcal{L}}}_{ASGL}$ .

In other words, the optimization procedure is separated into two steps where the cross-entropy loss is used to train the classifier, followed by updating the adjacency matrix with the ASGL loss. The total loss, which is a sum of the two terms, is only used for monitoring the improvement of the model during the training procedure. The pseudocode for this optimization procedure is shown in Algorithm 1.

Instead of using additional loss terms, there are a few other properties which can be directly imposed, therefore reducing the complexity of the problem. Specifically, we impose the adjacency matrix to be symmetric ( $A={A}^T$ ), which essentially implies that if node $i$ is connected to node $j$ , then node $j$ is also connected to node $i$ . Enforcing symmetricity also implies that the number of learnable parameters in the adjacency matrix is reduced from $N\times N$ to $N\left(N+1\right)/2$ .

We will also impose the adjacency matrix to have no self loops, that is,

{A}_{ii}=0\kern1em \forall i\in \left[0,..,N\right]

, which implies that nodes are not connected to themselves. Finally, the values of the adjacency matrix will be clamped into the

\left[0,1\right]

range to improve numerical stability:

{A}_{ij}\leftarrow \left\{\begin{array}{ll}1& {A}_{ij}>1\\ {}{A}_{ij}& 0\le {A}_{ij}\le 1\\ {}0& {A}_{ij}<0\end{array}\right.

This is purely done for practical reasons: during training, the values of A will approach either 0 or 1 due to the sparseness term ${\mathrm{\mathcal{L}}}_s$ , but will never exactly reach these values due to the nature of gradient descent. Instead, the values will oscillate either around 0 or 1. By clamping the values, we reassure that the adjacency matrix is truly binary.

Figure 2 shows an illustrative example of how clamping works during the optimization. Specifically, it shows the evolution of the values of a 2 × 2 randomly initialized adjacency matrix trained via backpropagation to minimize ${\mathrm{\mathcal{L}}}_s$ . The experiment is performed twice, first with no clamping, and then applying clamping at every optimization step.

The first experiment clearly shows the previously-mentioned oscillatory behavior: initially, the values higher than 0.5 will approach 1, and the values lower than 0.5 will approach 0. As training progresses, the values will be closer to their target values. However, due to the nature of the backpropagation algorithm, there will be a point where the desired target is overshot, that is, the update rule of the values of the adjacency matrix will cause the values to be either higher than 1 or lower than 0. The following update will cause the opposite effect, therefore generating an oscillatory behavior of the values of the adjacency matrix around the target values.

On the other hand, if clamping is applied, the behavior will be exactly the same until the values overshoot its targets. When a value surpasses either 0 or 1, it will be exactly clamped to 0 or 1 respectively and therefore will not be affected by the gradients from ${\mathrm{\mathcal{L}}}_s$ , that is, the gradients flowing from the sparseness loss term will be zero, as the values have reached the exact target value thanks to clamping, avoiding the oscillatory behavior and obtaining a truly binary matrix.

For the sake of completeness, Figure 3 shows the effect of the proposed sparseness loss ${\mathrm{\mathcal{L}}}_s$ on the adjacency matrix, for different values of $a$ , together with the imposed properties of symmetricity, no self-loops and clamping. As it can be seen, the resulting matrices are binary and the number of edges can be easily controlled via the parameter $a$ : for lower values of $a$ , there are less edges on the learned graph, that is, the adjacency matrix is sparse, whereas for higher values of $a$ the number of edges is higher, thus the degree of sparseness of the adjacency matrix is lower.

ALGORITHM 1. ASGL optimization.

4 Experiments

We validate our proposal via a series of experiments based on semi-supervised node classification where the adjacency matrix (i.e., the connectivity of the graph) is completely or partially missing. Specifically, our proposal will be evaluated against GLNN (Gao et al. 2020) and a GCN classifier without any GSL method to act as a baseline. In the latter case, the adjacency matrix will be computed as follows and will be kept fixed during training:

Compute the Euclidean distance between every pair of feature vectors, obtaining a matrix with values ${D}_{ij}={\left\Vert {X}_i-{X}_j\right\Vert}^2$ .
Avoid self-loops by setting elements in the diagonal to the maximum distance ${D}_{ii}={\max}_{ij}\left({D}_{ij}\right)$
Compute the adjacency matrix by setting an edge between the top-k closer pairs, that is, ${A}_{ij}=1\kern1em \forall \left(i,j\right)\in {\mathrm{top}}_k(D)$ and zero otherwise, resulting in a binary, sparse adjacency matrix.

The three methods will be evaluated in two scenarios:

The adjacency matrix is completely unknown; therefore, it is randomly initialized by sampling from a uniform distribution and learned during training by the GSL methods. In the baseline case, the adjacency matrix is initialized as previously mentioned and remains fixed during training.
Information about the adjacency matrix is provided only for the training data, whereas the rest of the adjacency matrix is randomly initialized and has to be learned during training. In this scenario, an extra loss term is added to encode the available ground truth:

{\mathrm{\mathcal{L}}}_{gt}={\left\Vert {A}_{train}-{A}_{gt}\right\Vert}^2

(9)

where

{A}_{train}

contains the adjacency matrix values of the training split and

{A}_{gt}

are the ground truth labels. This term will be included into the total loss together with a weighing term

{\lambda}_{gt}

. In the baseline case, the adjacency matrix is initialized as previously-mentioned, but the available values from the training split are replaced into such initialization. Again, the adjacency matrix remains fixed during training in this case.

4.1 Datasets

We consider two citation network datasets, Citeseer (Rossi and Ahmed 2015) and Cora (Sen et al. 2008), as well as the Terrorist Attacks (B. Zhao et al. 2006) graph network dataset.

Regarding the citation networks, each dataset contains a set of scientific publications which are classified into classes according to their topic of study. Each publication is represented as a node, which contains a bag-of-words vector indicating which words in the vocabulary are present in the document. If the ith publication cites the jth, then there will be an edge between nodes $i$ and $j$ , ${A}_{ij}={A}_{ji}=1$ , hence forming an undirected binary graph. Regarding the evaluation scheme, the datasets will be split into a training, validation and test split stated by (Z. Yang et al. 2016).

The Terrorist Attacks dataset consists of 1293 terrorist attacks classified according to the type of attack. Each attack is described by a feature vector containing a total of 106 distinct features. In this case, each attack will be considered as a node, whereas the edges will indicate the relation between the attacks. As this dataset is smaller than the previously-mentioned, the methods will be evaluated by performing stratified 5-fold cross validation, taking $1/5$ of the dataset for training, and using the remaining $4/5$ for evaluation purposes. Notice that the size of the training set is considerably smaller than the validation set, unlike other common Machine Learning setups. The reason behind this decision is that, given that the main advantage of ASGL is that its ability to learn the connectivity of a given graph, it is important to evaluate it on setups where the training set is relatively small, which translates to a setup where the connectivity is either fully unknown, or only a small number of connections are known.

Table 1 shows statistics for each dataset, that is, the number of documents (or nodes), the number of features per document, the number of edges, and the number of classes, as well as the number of nodes in each split (i.e., training, validation and test splits).

TABLE 1. Statistics of the datasets.

Dataset	# Nodes	Train/val/test	# Features per node	# Edges	# Classes
Terrorist (B. Zhao et al. 2006)	1293	258/1035/—	106	3172	6
Cora (Sen et al. 2008)	2708	1208/500/1000	1433	5429	7
Citeseer (Rossi and Ahmed 2015)	3312	1827/500/1000	3703	4732	6

4.2 Implementation Details

We implemented the proposed method with the PyTorch (Paszke et al. 2019) and PyTorch Geometric (Fey and Lenssen 2019) libraries. We use two separate instances of Adam optimizers, one for the GCN and another for the adjacency matrix, both with a learning rate of $0.01$ . The adjacency matrix $A$ is randomly initialized from a uniform distribution between 0 and 1. In the experiments where the connectivity of the training split is given, we include such partial values into the randomly initialized $A$ . To perform a proper comparison, the adjacency matrix in the baseline case is initialized to have around $10\%$ of the possible connections. Similarly, the hyperparameter ${\lambda}_2$ from GLNN is set, via trial-and-error, to obtain a similar amount of sparseness on the learned graphs. This value depends on the setup, but is typically around ${10}^{-7}$ . One of the main advantages of ASGL is that the degree of sparseness can be controlled simply by setting $a=0.1$ . Regarding the weighting term for the feature smoothing loss, we follow the setup of GLNN and set ${\lambda}_1=0.1$ . The input features are row normalized for numerical stability. The models were trained for a maximum of $4000$ epochs with an early stopping strategy, where training stopped if validation loss did not improve after 500 epochs. The source code of the implementation will be made publicly available upon acceptance.

5 Results

Table 2 shows the results of the evaluation of the different methods on the three datasets when no information about the connectivity is provided, that is, the adjacency matrix has to be completely learned from scratch. To provide a better comparison, the experiment was repeated five times for every method and dataset, and the final result is the mean accuracy across the five repetitions. As it can be seen, ASGL obtains the best results on the Terrorist dataset, obtaining an average accuracy of $79.32\%$ . ASGL also obtains the best results on the Cora dataset, surpassing the baseline by $15.28\%$ and GLNN by $7.56\%$ . Regarding the CiteSeer dataset, ASGL obtains a lower result than GLNN, but are still comparable. As expected, the accuracies are generally low since no information about the connectivity of the graph is given and the methods have to completely learn the adjacency matrix from scratch.

TABLE 2. Results of the evaluation with no information about the connectivity.

Method	Terrorist	Citeseer	Cora
GCN (Kipf and Welling 2017)	$75.22\pm 7.01$	$21.52\pm 2.39$	$37.96\pm 2.51$
GLNN (Gao et al. 2020)	$78.84\pm 6.55$	54.68 $\pm$ 13.56	$45.68\pm 16.13$
ASGL	79.32 $\pm$ 2.26	$49.20\pm 5.92$	53.24 $\pm$ 7.43

Note: Bold values indicate the highest accuracy among all the rows within each respective table.

Table 3 shows the results of the evaluation when the adjacency matrix is partially known, that is, the edges connecting nodes on the training split are given. In this case, the baseline GCN has no improvement in terms of accuracy; in fact, the accuracy slightly decreases on the CiteSeer dataset. This behavior is expected, as there is no GSL method that enables the generalization of the structure of the training split onto the test split. On the other hand, both GLNN and ASGL have considerable improvements in accuracy for the experiments with the Citeseer and Cora datasets when compared to the previous results presented in Table 2, indicating that both methods are able to take advantage of the partial information about the connectivity of the training split. However, in the case of the Terrorist dataset, the accuracy remains similar and even decreases for the GLNN method. This is probably due to the fact that the Terrorist dataset is considerably less complex than the Citeseer and Cora datasets; therefore, it is easier for the GCN model to learn meaningful representations even though the graph structure is initialized at random, that is, GSL methods do not influence that much the resulting accuracy.

TABLE 3. Results of the evaluation with partial information about the connectivity.

Method	Terrorist	Citeseer	Cora
GCN (Kipf and Welling 2017)	$76.28\pm 5.89$	$17.96\pm 0.61$	$38.18\pm 1.74$
GLNN (Gao et al. 2020)	$74.76\pm 11.10$	$47.96\pm 20.08$	63.98 $\pm$ 6.19
ASGL	79.00 $\pm$ 4.23	67.94 $\pm$ 3.96	$57.90\pm 9.24$

Note: Bold values indicate the highest accuracy among all the rows within each respective table.

In general, both ASGL and GLNN have similar mean accuracies across the experiments, but ASGL tends to have less variation on the results, that is, the standard deviation of the accuracies is lower, indicating that the results are more robust. Given that the main difference between GLNN and ASGL is the loss term controlling the sparseness, we hypothesize that, while L1 regularization encourages all values of the adjacency matrix to become small (i.e., make all edges disappear), our proposed loss term only encourages the $\left(1-a\right)$ percent of edges to disappear. We believe that this is the main cause of GLNN being less stable during training: the probability to learn a suboptimal adjacency matrix is higher, therefore leading to a lower accuracy and a higher variability across repetitions.

In addition to the increase in robustness, ASGL has a clear advantage over GLNN as it gives a clear mechanism for controlling the sparseness of the learned graph, which will be covered in the next section.

To summarize, we found the following insights by analyzing the results:

GSL methods greatly improve the accuracy when the adjacency matrix is fully or partially unavailable, especially when the graph starts to become larger and more complex. For simpler datasets, the accuracy boost is considerable, but less drastic.
Both ASGL and GLNN give similar results, but ASGL tends to be more stable, that is, have less variability in the accuracy across repeated experiments.
The accuracy of ASGL increases when partial information about the adjacency matrix is provided, compared to the experiments when no information was provided and the adjacency matrix has to be fully learned from scratch. Hence, the results show that it is capable of taking advantage of such information. On simpler graphs, such as Terrorist, the increase in accuracy is negligible.

5.1 Fine-Tuning the Learned Graph Sparseness via ASGL

The degree of sparseness of the learned graph can have a great impact on the performance of the algorithm. Being able to clearly specify the degree of sparseness is desirable, as it enables the study of the tradeoff between the number of learned edges: a large number of edges can lower the performance and increase the computational cost, whereas a small number of edges can improve the interpretability of the graph but can also reduce the performance. Therefore, experiments with different degrees of sparseness are required in order to find the optimal degree of sparseness.

Figure 4 shows the percentage of edges on the learned graph when training on the Cora dataset for different values of

a

(in the case of ASGL) and

{\lambda}_2

(in the case of GLNN). Interestingly, two important differences can be noted between both methods:

In the case of ASGL, there is a one-to-one linear correlation between the hyperparameter $a$ and the percentage of edges on the learned graph. On the other hand, it is considerably more difficult to control the degree of sparseness on GLNN, as there is a small range of values of ${\lambda}_2$ which can have impact on the sparseness. More specifically, for small values of ${\lambda}_2$ , the percentage of edges on the learned graph is kept at a fixed value ( $\sim 20\%$ ). If ${\lambda}_2$ keeps increasing, it reaches a point where the sparseness can be controlled, until it reaches a stagnation point, where the percentage of edges remains 0 for increasingly larger values of ${\lambda}_2$ .
Another important difference is that, on the case of ASGL, the degree of sparseness on the learned graph can be varied from $0\%$ (lowest number of edges, maximum sparseness) to $100\%$ (higher number of edges, minimum sparseness) of the total possible edges, whereas using the L1 loss limits the range of possible sparseness into the range $0\%-20\%$

This analysis shows one of the key advantages of ASGL, which is that the degree of sparseness (i.e., the total number of edges) on the learned graph can be easily controlled, unlike L1-based alternatives such as GLNN, where the degree of sparseness is constrained into a small range which is also difficult to control. Therefore, evaluating ASGL with different values of $a$ enables finding the optimal value for the task at hand.

To illustrate this, we repeated the evaluation of ASGL on the CiteSeer dataset for different values of

a

. Specifically, Figure 5 shows the accuracy obtained by varying the degree of sparseness of the learned graph by changing the value of

a

. As it can be seen, the optimal value of

a

in this case is of

0.1

, whereas diverging from this value considerable decreases the final accuracy. This phenomena can be understood with the terms of underfitting and overfitting:

If the capacity of a model is small, it might not have enough power to properly learn from the training dataset, leading to underfitting (i.e., low accuracy on validation set because the model does not have enough capacity)
On the other hand, if the capacity of the model is too large and no regularization is applied, it will be able to learn from the training dataset so well that it will overfit to this dataset and will not be able to generalize to the validation dataset.

In both extremes, the resulting models are not able to properly generalize; therefore, a middle ground has to be found to obtain a proper accuracy. As shown by this experiment, the same happens with the sparseness of the adjacency matrix: A smaller number of edges (more sparse) corresponds to underfitting, whereas a larger number of edges (less sparse) corresponds to overfitting. With ASGL, the sparseness can be easily controlled so that the optimal adjacency matrix can be found.

Moreover, the sparsity of the learned graph also has an impact on the training duration. Figure 6 shows the training duration for different values of $a$ . It can clearly be seen that the time required for the same amount of epochs increases as $a$ (or the number of edges) increases. This is expected, as more edges imply more complexity in terms of message propagation.

To summarize, given the results, it is clear that the degree of sparseness of the learned graph can have a considerable impact on both the accuracy and the computational cost of the model; therefore, requiring performing different trials with varying degrees of sparseness, which can be easily done by ASGL, unlike L1-based counterparts.

It is also insightful to consider the two extremes. When $a=0$ , the graph exhibits maximum sparsity, meaning there are no edges between nodes. This situation is akin to training a classifier to predict the class of each node based solely on its input features, as there are no edges to provide relational information. Consequently, this lack of information typically results in lower accuracy, although the computational cost of training is reduced. In contrast, when $a=1$ , the graph is fully connected, with every node linked to every other node. This often leads to overfitting and significantly increases the computational cost. In fact, we were unable to conduct experiments with larger values of $a$ due to excessive memory requirements.

5.2 Ablation Study

We also conducted an ablation study to quantify how each loss component of ASGL contributed to the total performance. Table 4 shows the results of the ablation study of ASGL on the Cora dataset. First, it is shown that completely removing connectivity information ( ${\lambda}_{gt}=0$ ) causes the accuracy to drop around $5\%$ lower, indicating that ASGL is able to take advantage of the partially available connectivity data.

TABLE 4. Results of the ablation study of ASGL on the Cora dataset.

Model	Accuracy
ASGL	$57.90\pm 9.24$
${\lambda}_{gt}=0$	$53.24\pm 7.43$
${\lambda}_{gt}=0$ , ${\lambda}_2=0$	$39.72\pm 8.37$
${\lambda}_{gt}=0$ , ${\lambda}_1=0$	$27.44\pm 8.55$
${\lambda}_{gt}=0$ , ${\lambda}_1=0$ , ${\lambda}_2=0$	$31.62\pm 0.52$

Further removing the sparseness regularizer ( ${\lambda}_{gt}=0$ , ${\lambda}_2=0$ ) decreases the accuracy for approximately $13\%$ with respect to the latter experiment. In this case, as no sparseness is enforced, the learned graph has a considerable large number of edges, which clearly impacts its performance and computational cost.

On the other hand, removing the feature smoothness term while keeping the sparseness regularizer ( ${\lambda}_{gt}=0$ , ${\lambda}_1=0$ ) decreases the accuracy even lower than the previous experiment. In this case, the learned graph is sparse, but the edges are randomly learned due to the fact that there is no feature smoothing term that enforces connections between similar nodes. Hence, as expected, the results are worse than in the previous experiment. Previously, even though the resulting graph had a large number of edges, some of those edges were relevant for the prediction. In this case, however, the number of edges was considerably smaller but also randomly selected.

Interestingly, removing every term ( ${\lambda}_{gt}=0$ , ${\lambda}_1=0$ , ${\lambda}_2=0$ ) yields better results than adding just the sparseness term ( ${\lambda}_{gt}=0$ , ${\lambda}_1=0$ ). However, this is expected to happen, as only applying the sparseness term results in a graph with a small number of randomly learned edges. Similarly, in this experiment the edges are also randomly set, but there is no sparseness enforced, that is, the number of edges is higher, therefore more information can be propagated and the accuracy is expected to be slightly higher.

6 Conclusion

We have proposed ASGL, a direct optimization GSL method for semi-supervised node classification where the connectivity of the graph is fully or partially unavailable. By performing multiple experiments on several datasets, we have shown that ASGL achieves competitive results compared to the state-of-the-art.

Moreover, we have shown that current methods based on L1 regularization offer little to no control over the degree of sparseness of the learned graph structure, whereas ASGL enables full control via a hyperparameter $a$ , which is linearly correlated to the percentage of edges on the learned graph.

This work provides both theoretical and practical insights into controlling the degree of sparseness in graph learning. From a theoretical perspective, our approach offers a more principled mechanism for sparsity control compared to conventional L1 regularization. While L1 regularization is commonly used to enforce sparsity, it has inherent limitations in range control, as it saturates at a certain point and does not allow fine-grained manipulation of the number of edges. In contrast, our method enables precise and linear control over the sparsity level through a dedicated hyperparameter, providing greater flexibility and interpretability in graph structure learning. From a practical standpoint, the ability to adjust the density of the learned graph is highly desirable for real-world applications, where different domains exhibit varying levels of connectivity. Some problems require highly dense graphs to capture intricate relationships, while others benefit from sparser structures for improved efficiency and generalization. By allowing explicit control over graph density, our method empowers practitioners to tailor the learned graph structure to the specific needs of their applications, striking a balance between computational cost, model interpretability, and predictive performance.

We also show that being able to properly fine-tune the degree of sparseness of the learned graph can considerably improve performance, as well as allow studying the trade-off between the number of edges: as shown by the performed experiments, too many edges can lower the performance as well as increase the computational cost; a smaller amount of edges reduces the computational cost and is easier to interpret, but performance may also decrease. Hence, being able to properly control the degree of sparseness enables finding the optimal result.

Despite the advantages and improvements of ASGL, it is also important to acknowledge its limitations. One limitation is its applicability to highly dynamic graphs, where the structure changes frequently over time. In such scenarios, ASGL may struggle to adapt quickly to evolving connectivity patterns, potentially leading to suboptimal performance. However, it is important to remark that dynamic graph structure learning remains an open challenge (Z. L. Li et al. 2023). Additionally, given that ASGL has a loss term that encourages that similar nodes should be connected, it is susceptible to noisy feature data. However, it is important to remark that these limitations were already present in previous direct optimization approaches to GSL.

Future work will be directed towards solving these challenges. Specifically, we will explore adaptive mechanisms for dynamic graphs, such as online or incremental learning, to enhance the applicability of ASGL in evolving network structures. Additionally, we aim to investigate robustness to noisy features and the integration of ASGL with other GNN architectures, such as Graph Attention Networks, to improve performance across diverse real-world applications. Another key focus will be on optimizing the joint learning process of both the node classifier and the adjacency matrix. Currently, the two-step optimization scheme can cause the adjacency matrix to converge rapidly, limiting the information flow between the learned graph structure and the classifier. Addressing this limitation will be crucial in refining the ability of ASGL to generalize and adapt to complex graph scenarios.

Acknowledgments

This work has been co-funded by the AETHER-UA project (PID2020-112540RB-C43) funded by the Spanish Ministry of Science and Innovation, and the BALLADEER (PROMETEO/2021/088) project, funded by the Conselleria de Innovación, Universidades, Ciencia y Sociedad Digital (Generalitat Valenciana). Jorge García-Carrasco holds a predoctoral contract (CIACIF/2021/454) granted by the Conselleria de Innovación, Universidades, Ciencia y Sociedad Digital (Generalitat Valenciana). This work is part of the TSI-100927-2023-6 Project, funded by the Recovery, Transformation, and Resilience Plan from the European Union Next Generation through the Ministry for Digital Transformation and the Civil Service.

Conflicts of Interest

The authors declare no conflicts of interest.

Open Research

Data Availability Statement

The data that support the findings of this study are openly available in GitHub at https://github.com/jgcarrasco/asgl.

References

Crabbé, J., and M. van der Schaar. 2021. “ Explaining Time Series Predictions With Dynamic Masks.” In Proceedings of the 38th International Conference on Machine Learning, edited by M. Meila and T. Zhang, vol. 139, 2166–2177. PMLR. https://proceedings.mlr.press/v139/crabbe21a.html.
Google Scholar
Cui, Z., K. Henrickson, R. Ke, and Y. Wang. 2019. “Traffic Graph Convolutional Recurrent Neural Network: A Deep Learning Framework for Network-Scale Traffic Learning and Forecasting.” IEEE Transactions on Intelligent Transportation Systems 21, no. 11: 4883–4894.
10.1109/TITS.2019.2950416
Web of Science® Google Scholar
Fey, M., and J. E. Lenssen. 2019. “ Fast Graph Representation Learning With PyTorch Geometric.” In ICLR Workshop on Representation Learning on Graphs and Manifolds. ICLR Press.
Google Scholar
Fong, R. C., and A. Vedaldi. 2017. “ Interpretable Explanations of Black Boxes by Meaningful Perturbation.” In Proceedings of the IEEE International Conference on Computer Vision, 3429–3437. IEEE.
10.1109/ICCV.2017.371
Google Scholar
Franceschi, L., M. Niepert, M. Pontil, and X. He. 2019. “ Learning Discrete Structures for Graph Neural Networks.” In Proceedings of the 36th International Conference on Machine Learning. IEEE.
Google Scholar
Gao, X., W. Hu, and Z. Guo. 2020. “ Exploring Structure-Adaptive Graph Learning for Robust Semi-Supervised Classification.” In 2020 IEEE International Conference on Multimedia and Expo (ICME), 1–6. IEEE.
Google Scholar
Hamilton, W., Z. Ying, and J. Leskovec. 2017. “ Inductive Representation Learning on Large Graphs.” In Advances in Neural Information Processing Systems, edited by I. Guyon, vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf.
Google Scholar
Herrera, A., A. Arroyo, A. Jiménez, and A. Herrero. 2024. “Forecasting Hotel Cancellations Through Machine Learning.” Expert Systems 41, no. 9: e13608. https://doi.org/10.1111/exsy.13608.
10.1111/exsy.13608
Web of Science® Google Scholar
James, G., D. Witten, T. Hastie, and R. Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
10.1007/978-1-4614-7138-7
Google Scholar
Jin, W., Y. Ma, X. Liu, X. Tang, S. Wang, and J. Tang. 2020. “ Graph Structure Learning for Robust Graph Neural Networks.” In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 6674. Association for Computing Machinery. https://doi.org/10.1145/3394486.3403049.
10.1145/3394486.3403049
Google Scholar
Kipf, T. N., and M. Welling. 2017. “ Semi-Supervised Classification With Graph Convolutional Networks.” In International Conference on Learning Representations. ICLR. https://openreview.net/forum?id=SJU4ayYgl.
Google Scholar
Kitson, N. K., A. C. Constantinou, Z. Guo, Y. Liu, and K. Chobtham. 2023. “A Survey of Bayesian Network Structure Learning.” Artificial Intelligence Review 56, no. 8: 8721–8814. https://doi.org/10.1007/s10462-022-10351-w.
10.1007/s10462-022-10351-w
Web of Science® Google Scholar
Li, R., S. Wang, F. Zhu, and J. Huang. 2018. “ Adaptive Graph Convolutional Neural Networks.” In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32. AAAI Press.
10.1609/aaai.v32i1.11691
Google Scholar
Li, Z. L., G. W. Zhang, J. Yu, and L. Y. Xu. 2023. “Dynamic Graph Structure Learning for Multivariate Time Series Forecasting.” Pattern Recognition 138: 109423.
10.1016/j.patcog.2023.109423
Web of Science® Google Scholar
Lin, K., L. Wang, and Z. Liu. 2021. “ Mesh Graphormer.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12939–12948. IEEE.
10.1109/ICCV48922.2021.01270
Google Scholar
McPherson, M., L. Smith-Lovin, and J. M. Cook. 2001. “Birds of a Feather: Homophily in Social Networks.” Annual Review of Sociology 27: 415–444. https://doi.org/10.1146/annurev.soc.27.1.415.
10.1146/annurev.soc.27.1.415
Web of Science® Google Scholar
Namata, G., B. London, L. Getoor, B. Huang, and U. Edu. 2012. “ Query-Driven Active Surveying for Collective Classification.” In 10th International Workshop on Mining and Learning With Graphs, vol. 8, 1. Association for Computing Machinery (ACM).
Google Scholar
Paszke, A., S. Gross, F. Massa, et al. 2019. “ Pytorch: An Imperative Style, High-Performance Deep Learning Library.” In Advances in Neural Information Processing Systems, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.
Google Scholar
Rossi, R., and N. Ahmed. 2015. “The Network Data Repository With Interactive Graph Analytics and Visualization.” Proceedings of the AAAI Conference on Artificial Intelligence 29, no. 1: 4292–4293. https://doi.org/10.1609/aaai.v29i1.9277.
10.1609/aaai.v29i1.9277
Google Scholar
Sen, P., G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. 2008. “Collective Classification in Network Data.” AI Magazine 29, no. 3: 93. https://doi.org/10.1609/aimag.v29i3.2157.
10.1609/aimag.v29i3.2157
Web of Science® Google Scholar
Veličković, P., G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. 2018. “ Graph Attention Networks.” In International Conference on Learning Representations. ICLR Press. https://openreview.net/forum?id=rJXMpikCZ.
Google Scholar
Wan, G., and H. Kokel. 2021. “Graph Sparsification via Meta-Learning. DLG@ AAAI”.
Google Scholar
Wang, H., F. Zhang, J. Wang, et al. 2018. “ Ripplenet: Propagating User Preferences on the Knowledge Graph for Recommender Systems.” In Proceedings of the 27th CIKM ACM International Conference on Information and Knowledge Management, 417–426. ACM Press.
10.1145/3269206.3271739
Google Scholar
Wang, X., Z. Li, M. Jiang, S. Wang, S. Zhang, and Z. Wei. 2019. “Molecule Property Prediction Based on Spatial Graph Embedding.” Journal of Chemical Information and Modeling 59, no. 9: 3817–3828. https://doi.org/10.1021/acs.jcim.9b00410.
10.1021/acs.jcim.9b00410
CAS PubMed Web of Science® Google Scholar
Wang, X., M. Zhu, D. Bo, P. Cui, C. Shi, and J. Pei. 2020. “ AM-GCN: Adaptive Multi-Channel Graph Convolutional Networks.” In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 12431253. Association for Computing Machinery. https://doi.org/10.1145/3394486.3403177.
10.1145/3394486.3403177
Google Scholar
Wu, F., A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger. 2019. “ Simplifying Graph Convolutional Networks.” In Proceedings of the 36th International Conference on Machine Learning, edited by K. Chaudhuri and R. Salakhutdinov, vol. 97, 6861–6871. PMLR.
Google Scholar
Wu, T., H. Ren, P. Li, and J. Leskovec. 2020. “ Graph Information Bottleneck.” In Advances in Neural Information Processing Systems, edited by H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, vol. 33, 20437–20448. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/ebc2aa04e75e3caabda543a1317160c0-Paper.pdf.
Google Scholar
Yang, L., Z. Kang, X. Cao, D. Jin, B. Yang, and Y. Guo. 2019. “ Topology Optimization Based Graph Convolutional Network.” In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, 4054–4061. IJCAI Press. https://doi.org/10.24963/ijcai.2019/563.
10.24963/ijcai.2019/563
Google Scholar
Yang, Z., W. W. Cohen, and R. Salakhutdinov. 2016. “ Revisiting Semi-Supervised Learning With Graph Embeddings.” In Proceedings of the 33rd International Conference on International Conference on Machine Learning, vol. 48, 4048. ICLR Press.
Google Scholar
Yu, D., R. Zhang, Z. Jiang, Y. Wu, and Y. Yang. 2021. “ Graph-Revised Convolutional Network.” In Machine Learning and Knowledge Discovery in Databases, edited by F. Hutter, K. Kersting, J. Lijffijt, and I. Valera, 378–393. Springer International Publishing.
10.1007/978-3-030-67664-3_23
Google Scholar
Zhang, X., B. Wu, and Y. Ye. 2023. “Graph Attentive Matrix Factorization for Social Recommendation.” Expert Systems 40, no. 9: e13385. https://doi.org/10.1111/exsy.13385.
10.1111/exsy.13385
Web of Science® Google Scholar
Zhang, X.-M., L. Liang, L. Liu, and M.-J. Tang. 2021. “Graph Neural Networks and Their Current Applications in Bioinformatics.” Frontiers in Genetics 12. https://doi.org/10.3389/fgene.2021.690049.
10.3389/fgene.2021.690049
Web of Science® Google Scholar
Zhao, B., P. Sen, and L. Getoor. 2006. “ Event Classification and Relationship Labeling in Affiliation Networks.” In Proceedings of the Workshop on Statistical Network Analysis (SNA) at the 23rd International Conference on Machine Learning (ICML), 271–280. Springer.
Google Scholar
Zhao, W., Q. Wu, C. Yang, and J. Yan. 2023. Glow: Universal and Generalizable Structure Learning for Graph Neural Networks. In Proceedings of the 29th ACM SIGKDD Conference On Knowledge Discovery and Data Mining.
Google Scholar
Zheng, C., B. Zong, W. Cheng, et al. 2020. “ Robust Graph Representation Learning via Neural Sparsification.” In Proceedings of the 37th International Conference on Machine Learning, edited by H. Daumé and A. Singh, vol. 119, 11458–11468. PMLR. https://proceedings.mlr.press/v119/zheng20d.html.
Google Scholar
Zhong, P., D. Wang, and C. Miao. 2022. “Eeg-Based Emotion Recognition Using Regularized Graph Neural Networks.” IEEE Transactions on Affective Computing 13, no. 3: 1290–1301. https://doi.org/10.1109/TAFFC.2020.2994159.
10.1109/TAFFC.2020.2994159
Web of Science® Google Scholar
Zhou, K., H. Zha, and L. Song. 2013. “ Learning Social Infectivity in Sparse Low-Rank Networks Using Multidimensional Hawkes Processes.” In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, edited by C. M. Carvalho and P. Ravikumar, vol. 31, 641–649. PMLR. https://proceedings.mlr.press/v31/zhou13a.html.
Google Scholar
Zhou, Y., Y. Hu, H. Liu, P. Huang, and D. Dai. 2024. “Linking Negative Herding Effect to Online Public Opinion: A New BIGADG Approach for Sentiment Classification.” Expert Systems 41, no. 8: e13587. https://doi.org/10.1111/exsy.13587.
10.1111/exsy.13587
Web of Science® Google Scholar
Zhu, Y., W. Xu, J. Zhang, Q. Liu, S. Wu, and L. Wang. 2021. “Deep Graph Structure Learning for Robust Representations: A Survey. arXiv preprint arXiv:2103.03036, 14”.
Google Scholar

Volume42, Issue8

August 2025

e70088

Enabling the Application of Graph Neural Networks on Graphs With Unknown Connectivity

ABSTRACT

1 Introduction

2 Background

2.1 Problem Statement

2.2 Graph Convolutional Network

2.3 Graph Structure Learning

2.3.1 Feature Smoothness

2.3.2 Sparseness

3 Methodology

3.1 Graph Learning Module

3.2 Graph Node Classifier

3.3 Optimization

ALGORITHM 1. ASGL optimization.

4 Experiments

4.1 Datasets

4.2 Implementation Details

5 Results

5.1 Fine-Tuning the Learned Graph Sparseness via ASGL

5.2 Ablation Study

6 Conclusion

Acknowledgments

Conflicts of Interest

Open Research

Data Availability Statement

References

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Enabling the Application of Graph Neural Networks on Graphs With Unknown Connectivity

ABSTRACT

1 Introduction

2 Background

2.1 Problem Statement

2.2 Graph Convolutional Network

2.3 Graph Structure Learning

2.3.1 Feature Smoothness

2.3.2 Sparseness

3 Methodology

3.1 Graph Learning Module

3.2 Graph Node Classifier

3.3 Optimization

ALGORITHM 1. ASGL optimization.

4 Experiments

4.1 Datasets

4.2 Implementation Details

5 Results

5.1 Fine-Tuning the Learned Graph Sparseness via ASGL

5.2 Ablation Study

6 Conclusion

Acknowledgments

Conflicts of Interest

Open Research

Data Availability Statement

References

Figures

References

Related

Information