Volume 2025, Issue 1 9994758
Research Article
Open Access

Comprehensive Network Analysis of Lung Cancer Biomarkers Identifying Key Genes Through RNA-Seq Data and PPI Networks

Meshrif Alruily

Meshrif Alruily

Department of Computer Science , College of Computer and Information Sciences , Jouf University , Sakaka , 72388 , Saudi Arabia , ju.edu.sa

Search for more papers by this author
Murtada K. Elbashir

Murtada K. Elbashir

Information Systems Department , College of Computer and Information Sciences , Jouf University , Sakaka , 72388 , Saudi Arabia , ju.edu.sa

Search for more papers by this author
Mohamed Ezz

Mohamed Ezz

Department of Computer Science , College of Computer and Information Sciences , Jouf University , Sakaka , 72388 , Saudi Arabia , ju.edu.sa

Search for more papers by this author
Bader Aldughayfiq

Bader Aldughayfiq

Information Systems Department , College of Computer and Information Sciences , Jouf University , Sakaka , 72388 , Saudi Arabia , ju.edu.sa

Search for more papers by this author
Majed Abdullah Alrowaily

Majed Abdullah Alrowaily

Department of Computer Science , College of Computer and Information Sciences , Jouf University , Sakaka , 72388 , Saudi Arabia , ju.edu.sa

Search for more papers by this author
Hisham Allahem

Hisham Allahem

Information Systems Department , College of Computer and Information Sciences , Jouf University , Sakaka , 72388 , Saudi Arabia , ju.edu.sa

Search for more papers by this author
Mohanad Mohammed

Mohanad Mohammed

Department of Mathematics , School of Mathematics , Statistics and Computer Science , University of KwaZulu-Natal , Pietermaritzburg Private Bag X01, Scottsville , 3209 , South Africa , ukzn.ac.za

Search for more papers by this author
Elsayed Mostafa

Elsayed Mostafa

Engineering & Research International (ERI) , Riyadh , Saudi Arabia

Search for more papers by this author
Ayman Mohamed Mostafa

Corresponding Author

Ayman Mohamed Mostafa

Information Systems Department , College of Computer and Information Sciences , Jouf University , Sakaka , 72388 , Saudi Arabia , ju.edu.sa

Search for more papers by this author
First published: 10 February 2025
Academic Editor: Eugenio Vocaturo

Abstract

This study addresses the pressing need for improved lung cancer diagnosis and treatment by leveraging computational methods and omics data analysis. Lung cancer remains a leading cause of cancer-related deaths globally, highlighting the urgency for more effective diagnostic and therapeutic approaches. Current diagnostic methods, such as imaging and biopsies, suffer from limitations in sensitivity, specificity, and accessibility, often due to factors such as poor data quality, small sample sizes, and variability in data sources. These limitations highlight the necessity for the development of advanced noninvasive techniques. Computational methods utilizing omics data have shown promise in overcoming these challenges by comprehensively understanding the molecular pathways involved in lung cancer. We propose a novel approach that utilizes RNA-Seq data and employs LASSO regression with attention mechanisms to identify lung cancer biomarkers. Our results demonstrate the effectiveness of this approach in identifying potential biomarkers for lung cancer, including well-known genes such as TP53, EGFR, KRAS, ALK, and PIK3CA, validating the model’s ability to uncover key genes associated with lung cancer development and progression. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses revealed significant associations of the identified genes with critical biological processes and pathways, including protein synthesis, folding, cell adhesion, gene regulation, and immune responses. The PPI network analysis, constructed using the STRING database and Cytoscape application, highlighted a highly interconnected interaction landscape, with central hub genes playing pivotal roles in lung cancer progression. RPSA emerged as a crucial hub gene, consistently identified across different centrality measures. This study sheds light on the potential of computational methods and omics data analysis in improving lung cancer diagnosis and treatment, offering new insights for future research directions and personalized medicine strategies.

1. Introduction

According to the leading global mortality statistics of cancer, lung cancer is one of the most dangerous types of cancer [1]. Essentially, lung cancer can be described as a malignant neoplastic disease characterized by the process of growth of the detached lung cell mass into tumors [2]. Lung cancer comes in two primary forms: non-small cell lung cancer (NSCLC) and small cell lung cancer (SCLC). It is noted that the most common type of lung cancer is NSCLC [3]. Some of the causes of lung cancer include tobacco smoking, exposure to tobacco smoke, radon, asbestos in the environment, and genetic factors [4] As with most diseases, lung cancer symptoms might differ; however, several general signs are more evident and they include weight loss, wheezing, chest pain, and a cough that will not go away [5]. Morphological examinations such as computed tomography and ordinary radiography are often used as conventional diagnostic techniques in lung cancer, and biopsies are usually performed for confirmation [6]. Despite their effectiveness, these traditional methods have limitations that make them less effective. For example, they may fail to identify small or early-stage tumors, producing false negative results. Sputum cytology may be imprecise, particularly when obtaining a small or random sputum sample [7]. These methods also have limited screening capabilities, are often costly and inaccessible, time-consuming, and provide limited information. This highlights the need for more advanced and noninvasive diagnostic approaches, such as molecular biomarker testing and advanced imaging, to improve early detection and treatment outcomes in lung cancer. Compared to traditional techniques, computational methods utilizing omics data have demonstrated considerable promise in improving lung cancer detection and therapy [8]. These methods can potentially overcome traditional diagnostic methods’ drawbacks by offering a more thorough understanding of the molecular pathways underlying lung cancer by combining data from many omics layers [9].

Recent advances in high-throughput technologies such as genomic, imaging, proteomic, and gene expression data have paved the way for discovering biomarkers. These biomarkers are being utilized for early diagnosis and prognosis of lung cancer. Biomarkers play a vital role in lung cancer diagnosis prediction; among these are lung cancer micro-RNA (miRNAs); some potential miRNAs have also been extensively used for modeling [10]. Researchers have used lung ADC and LCSG-1 datasets focusing on Group 1 and Group 1 combined healthy controls, bronchial lung cancer datasets, LC–MS proteomics, and TCGA datasets for predicting and identifying lung cancer biomarkers. Xue JM et al. analyzed the genomes of lung, liver, kidney, cervical, and breast cancers to identify gene coexpression characteristics across various cancers [11]. They found that genes such as TOP2A, ECT2, RRM2, ANLN, NEK2, ASPM, BUB1B, CDK1, DTL, and PRC1 are linked to patient prognosis, with ASPM potentially associated with immune cell infiltration. These genes could serve as therapeutic targets, aiding personalized treatments and basket clinical trials. GO and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses indicated that these differentially expressed genes (DEGs) are involved in critical processes such as cell division and the cell cycle.

In addition, the study highlighted the importance of further research on the interplay between these genes and immune responses. Park et al. created an approach for early lung cancer diagnosis that addresses the critical need to improve the current low survival rate [12]. They integrated mRNA expression, DNA methylation, and DNA sequencing data to create a multiomics data-affinitive artificial intelligence algorithm based on a graph convolutional network. With a high macro F1-score of 93.7%, their prediction model demonstrated minimal rates of false positives and negatives. Both distinct and shared biological processes for lung adenocarcinoma and lung squamous cell carcinoma were identified using Gene Ontology (GO) enrichment and pathway analysis, which also improved the precision and accuracy of NSCLC diagnosis by identifying a large number of new and reliable biomarkers. Maharjan et al. developed a computational framework to identify lung cancer biomarker genes using gene expression data from the GEO database [13].

The researchers analyzed nontreatment and treatment studies and discovered 32 biomarker genes with different prognostic characteristics. Nontreatment biomarkers generally indicated better survival for low-expression groups, while treatment biomarkers showed better survival for high-expression groups. These biomarkers are linked to crucial processes such as tumor progression, cell cycle regulation, and ubiquitination, highlighting their potential for diagnosis and therapeutic development in lung cancer. Luo et al. [14] present an integrated approach to identifying plasma exosomal biomarkers for early detection and metastasis prediction in NSCLC. Two distinct panel biomarkers were identified by isolating and characterizing exosomes from NSCLC patients and healthy individuals and performing a comprehensive proteomics analysis. The first panel, including FGB, FGG, and VWF, shows potential for early diagnosis and correlates with patient survival. The second panel, featuring CFHR5, C9, and MBL2, is associated with metastasis, with CFHR5 notably linked to overall survival. These findings highlight the promise of exosomal biomarkers in improving NSCLC management, aiding in early detection, and monitoring metastasis, ultimately enhancing patient outcomes. Xie et al. introduced a novel interdisciplinary approach to the early detection of lung cancer by integrating metabolomics with machine learning methods [15]. Their research involves analyzing plasma metabolites from 110 lung cancer patients and 43 healthy individuals, identifying a combination of six metabolic biomarkers that can accurately discriminate between Stage I lung cancer patients and healthy individuals.

Previous lung cancer biomarker identification studies have typically focused on conventional methods, leading to insufficient prediction accuracy and a lack of advanced network analysis, which fails to capture complex protein–protein interactions [16]. In addition, these studies often blur the distinction between prognostic and predictive biomarkers and need help with clinical validation and translation. In contrast, the proposed approach offers a novel solution using RNA-Seq data and least absolute shrinkage and selection operator regression (LASSO) with an attention mechanism to identify potential lung cancer biomarkers. In this study, we identify key lung cancer biomarkers, including well-known oncogenes such as tumor protein p53 (TP53), epidermal growth factor receptor (EGFR), KRAS proto-oncogene (KRAS), anaplastic lymphoma kinase (ALK), and phosphatidylinositol-4, 5-bisphosphate 3-kinase catalytic subunit alpha (PIK3CA).

We compared the results with other biomarker identification studies to ensure clinical relevance. Our method is expected to improve the accuracy of biomarker identification and patient outcome predictions, facilitating personalized treatment strategies and bridging the gap between biomarker discovery and clinical application for better patient management and outcomes.

2. Materials and Methods

2.1. Dataset

The lung cancer gene expression dataset was obtained from the Pan-Cancer Atlas [17] using R Studio, with the query formulated through the GDCquery function of the TCGAbiolinks library [18]. The Genomic Data Commons (GDC), established by the NCI, provides a unified data storage for sharing data across cancer genomic studies. Various parameters are required for the GDCquery function, including project, legacy, data.category, data.type, platform, file.type, experimental.strategy, and sample.type. The project argument specifies the project to be downloaded from the Pan-Cancer Atlas, with “TCGA-LUAD” being the value for lung cancer. The legacy argument, set to true in this case, directs the query to the legacy repository for unmodified copies of previously stored TCGA Data Portal data. Each project has a specific data category; hence, the data.category is set to “Gene expression” for lung cancer. The data.type argument filters the files to download based on data type, set here as “Gene expression quantification” for RNA-Seq read counting per gene. The platform parameter allows the selection of platforms, with “Illumina HiSeq” chosen in this case. For file.type, “results” is used for the legacy database. Experimental.strategy offers various options such as RNA-Seq, miRNA-Seq, and genotyping array, with RNA-Seq selected for generating the gene expression profile. The sample type is set to “(“primary solid tumor” and “solid tissue normal”)” download gene expression data for normal and tumor cases only. The downloaded LUAD data are then transformed into a matrix format, where columns represent samples or cases, and rows represent genomic ranges of interest. The LUAD dataset comprises 598 clinical samples and 60,660 genes. To reduce the number of genes, preprocessing steps are applied to select genes that positively contribute to lung cancer development and progression.

2.2. DEGs

Differential expression analysis (DEA) is a critical step in RNA-Seq data analysis, particularly for identifying DEGs across experimental conditions [19]. DEA involves comparing the expression levels of genes between different sample groups to identify genes that show significant changes in expression. We used the DESeq2 package in R to perform DEA analysis on the lung cancer RNA-Seq data. DESeq2 is widely used in bioinformatics and computational biology for identifying genes that are differentially expressed between different experimental conditions, such as different treatments or disease states. DESeq2 employs a negative binomial distribution-based model to estimate the genewise dispersion parameter and then test for differential expression using statistical tests such as the Wald test. The negative binomial distribution is well-suited for modeling RNA-Seq count data because it accommodates the overdispersion often observed in such data, where the variance is greater than the mean, and it can be defined mathematically as
()
where P(Y = y) is the probability of observing y Y counts, r is the size parameter of the distribution (related to the dispersion), and p is the probability of success in a single trial (related to the mean). In the lung cancer RNA-Seq data, DESeq2 identifies 27,758 genes that are differentially expressed between different conditions or groups within the dataset. These genes show significant changes in expression levels with a p value of 0.05.

2.3. LASSO Regression

LASSO is a type of linear regression that enhances model simplicity and interpretability by adding a penalty equal to the absolute value of the magnitude of coefficients, which helps in feature selection by shrinking less important feature coefficients to zero. The cost function for LASSO regression is shown in the following equation:
()
where n is the number of observations, yi and ŷi are the actual and predicted values, βj are the feature coefficients, and α is the regularization parameter. To implement LASSO regression, we split the RNA-Seq data into training and testing sets and initialized the LASSO model with a chosen regurgitation parameter (α = 1.0); then, we fit the model to the training data. LASSO regression reduced the number of features that were obtained by applying DEGs on the 27,758 genes or features to 1019 genes (features).

2.4. The Attention Mechanism

The attention mechanism in neural networks, particularly in sequence-to-sequence models, enhances the model’s ability to focus on relevant parts of the input sequence for each output step. This mechanism computes a score for each input based on the current decoder’s hidden state using the score function.
()
where ht is the encoder’s hidden state at time step t, st−1 is the previous decoder’s hidden state, and W is a learned weight matrix. The alignment scores et are then calculated and normalized using the softmax function to obtain the attention weights shown in the following equation:
()
where αt,i represents the attention weight for the ith input at time step t. These attention weights are used to compute the context vector.
()
The context vector is a weighted sum of the encoder’s hidden states. Finally, the context vector is combined with the decoder’s hidden state to produce the attention output.
()
where [Ct; St−1] is the concatenation of the context vector and the previous decoder’s hidden state and WC is a learned weight matrix. This attention output is then used to generate the next output token, enabling the model to dynamically focus on different parts of the input sequence as needed. Our approach to biomarker identification utilizes an attention mechanism in PyTorch, which dynamically weighs the importance of input features. Upon initialization, the model creates a set of learnable weights corresponding to the input features′ dimensions. During the forward pass, these weights are normalized using a softmax function to ensure that they sum to one, making them interpretable as probabilities. These normalized weights are then applied to the input features through elementwise multiplication, resulting in an attention-weighted output. This process allows the model to emphasize more relevant features while downplaying less important ones, effectively enabling it to focus on the most significant parts of the input data. The model returns both the attention-weighted output and the attention weights, facilitating a better understanding of which features are being prioritized during the computation. The attention mechanism that was adopted within this study is realized as a basic single-layer component. The input dimension of the attention mechanism represents the number of genes (features) obtained by LASSO regression, which is 1019. The attention layer contains learnable parameters, and for it, we use PyTorch’s inbuilt function torch.rand () to initialize this parameter, and then this parameter is further modified during the training phase. A softmax function is used to compute the attention weights. The weighted features are then calculated as the elementwise product of the input features and the attention weights.

This mechanism allows the model to focus on the most relevant genes and sort them based on their importance, as determined by the attention weights, thereby enhancing the interpretability and performance of the downstream analysis or classification tasks. The workflow of the gene biomarkers’ identification method is depicted in Figure 1.

Details are in the caption following the image
Workflow of biomarkers’ identification.

3. Experimental Results

3.1. The GO Term Analysis of the Top 500 Genes Identified by the Attention Mechanism

To investigate the potential biological functions of the top 500 genes identified by the attention mechanism, GO enrichment and KEGG pathway analyses were conducted. The GO enrichment analysis was classified into three main categories: biological process, molecular function, and cellular component. This analysis facilitates the classification of DEGs based on their functional clusters or GO groups. The ranking of genes was based on the adjusted p value calculated using Fisher’s exact test, which provides an estimate of the likelihood of each gene belonging to a specific GO term category. Fisher’s test uses a 2 × 2 contingency table to determine the significance of the overlap between the input gene set and a specific GO term. The p value is calculated as [20, 21]
()
where a, b, c, and d are values from the contingency table. To correct for multiple testing, the adjusted p value was obtained using the Benjamini–Hochberg method to control the false discovery rate (FDR).
The enrichment analysis also assessed the strength of association using the odds ratio (OR), defined as [22] follows:
()

An OR greater than one indicates significant enrichment, with larger values suggesting a stronger association between the gene set and the GO term.

In addition, Enrichr calculates a combined score, which integrates the significance and magnitude of enrichment. The combined score is computed as [23]
()
where p is the raw p value, and the z-score represents the deviation of the observed gene set association from the expected distribution. Higher scores indicate stronger evidence for enrichment, combining statistical significance and magnitude of association.

The enriched GO terms within the biological process category revealed significant associations of the top 500 genes with several critical biological processes. Specifically, the genes showed substantial enrichment in cytoplasmic translation (GO: 0002181, p value and adjusted p value of 6.35603E − 17 and 1.76761E − 13, respectively), peptide biosynthetic process (GO: 0043043, p value and adjusted p value of 2.68335E − 15 and 3.7312E − 12, respectively), and macromolecule biosynthetic process (GO: 0009059, p value and adjusted p value of 1.1148E − 13 and 1.03342E − 10, respectively). These terms also show large ORs and combined scores. For cytoplasmic translation (GO: 0002181)” with an OR of 13.38, this means that the genes in our dataset are over 13 times more likely to be involved in cytoplasmic translation compared to the background. Since the p value and the adjusted p value are statistically significant, then a large OR is desirable in enrichment analysis. A large combined score typically means that the GO term is both highly enriched and highly significant.

Other significant processes included translation (GO: 0006412, adjusted p value: 1.53145E − 07), gene expression (GO: 0010467, adjusted p value: 2.28563E − 07), and the regulation of programmed cell death, particularly negative regulation of programmed cell death (GO: 0043069, adjusted p value: 4.33769E − 06), negative regulation of the apoptotic process (GO: 0043066, adjusted p value: 7.59821E − 06), and regulation of the apoptotic process (GO: 0042981, adjusted p value: 6.91463E − 05). Additional processes of interest included the positive regulation of the protein metabolic process (GO: 0051247, adjusted p value: 0.000160396) and supramolecular fiber organization (GO: 0097435, adjusted p value: 0.001440254).

These enriched GO terms indicate that the identified genes play crucial roles in various fundamental biological processes, such as biosynthesis, translation, gene expression, cell death regulation, and protein metabolism. This enrichment analysis underscores the functional importance of the top genes selected by the attention mechanism, highlighting their potential involvement in essential cellular and molecular activities as presented in Table 1.

Table 1. Top 10 enriched GO terms in which the top 500 genes obtained by the attention mechanism were significantly enriched in the biological process group.
Term p value Adjusted p value Odds ratio Combined score
Cytoplasmic translation (GO: 0002181) 6.35603E − 17 1.76761E − 13 13.38394729 499.148189
Peptide biosynthetic process (GO: 0043043) 2.68335E − 15 3.7312E − 12 8.439923826 283.1738786
Macromolecule biosynthetic process (GO: 0009059) 1.1148E − 13 1.03342E − 10 7.078224101 211.107531
Translation (GO: 0006412) 2.20274E − 10 1.53145E − 07 5.087552743 113.1275861
Gene expression (GO: 0010467) 4.10936E − 10 2.28563E − 07 4.43519645 95.85605619
Negative regulation of programmed cell death (GO: 0043069) 9.35855E − 09 4.33769E − 06 3.61650929 66.85831883
Negative regulation of the apoptotic process (GO: 0043066) 1.91253E − 08 7.59821E − 06 3.208270188 57.01819089
Regulation of the apoptotic process (GO: 0042981) 1.98911E − 07 6.91463E − 05 2.605445672 40.20309494
Positive regulation of the protein metabolic process (GO: 0051247) 5.19081E − 07 0.000160396 4.312308397 62.4043037
Supramolecular fiber organization (GO: 0097435) 5.17891E − 06 0.001440254 3.160831706 38.47021962

We also examined the enriched GO term results within the cellular component category. The analysis revealed that the top genes obtained by the attention mechanism are significantly enriched and associated with several key cellular components. Specifically, the terms include cell–substrate junction (GO: 0030055, p value and adjusted p value of 8.9976E − 24 and 2.56406E − 21, respectively), focal adhesion (GO: 0005925, p value and adjusted p value of 1.51735E-22 2.16222E − 20, respectively), collagen-containing extracellular matrix (GO: 0062023, p value and adjusted p value of 3.21243E − 16 and), intracellular organelle lumen (GO: 0070013, p value and adjusted p value of 1.09265E − 15 and 7.78513E − 14, respectively), secretory granule lumen (GO: 0034774, p value and adjusted p value of 1.70095E − 13 and 9.69544E − 12, respectively), endoplasmic reticulum lumen (GO: 0005788, p value and adjusted p value of 1.19821E − 12 and 5.6915E − 11, respectively), cytosolic large ribosomal subunit (GO: 0022625, p value and adjusted p value of 2.30422E − 11 and 8.20879E − 10, respectively), large ribosomal subunit (GO: 0015934, p value and adjusted p value of 2.30422E-11 and 8.20879E-10, respectively), ficolin-1-rich granule (GO: 0101002, p value and adjusted p value of 2.35055E − 10 and 7.4434E − 09, respectively), and ficolin-1-rich granule lumen (GO: 1904813, p value and adjusted p value of 1.67074E − 09 4.76161E-08, respectively). These terms also show large OR and combined scores indicating that they are highly enriched and highly significant. This enrichment analysis highlights critical cellular components that the genes are significantly associated with, providing insights into the underlying biological processes and pathways involved as presented in Table 2.

Table 2. Top 10 enriched GO terms in which the top 500 genes obtained by the attention mechanism were significantly enriched in the cellular component group.
Term p value Adjusted p value Odds ratio Combined score
Cell–substrate junction (GO: 0030055) 8.9967E − 24 2.56406E − 21 6.641901174 352.4537083
Focal adhesion (GO: 0005925) 1.51735E − 22 2.16222E − 20 6.478444162 325.4764267
Collagen-containing extracellular matrix (GO: 0062023) 3.21243E − 16 3.05181E − 14 5.310742886 189.4572068
Intracellular organelle lumen (GO: 0070013) 1.09265E − 15 7.78513E − 14 3.534257524 121.7557745
Secretory granule lumen (GO: 0034774) 1.70095E − 13 9.69544E − 12 5.148012092 151.3639974
Endoplasmic reticulum lumen (GO: 0005788) 1.19821E − 12 5.6915E − 11 5.222629223 143.3621776
Cytosolic large ribosomal subunit (GO: 0022625) 2.30422E − 11 8.20879E − 10 14.7535196 361.3681804
Large ribosomal subunit (GO: 0015934) 2.30422E − 11 8.20879E − 10 14.7535196 361.3681804
Ficolin-1-rich granule (GO: 0101002) 2.35055E − 10 7.4434E − 09 5.791853848 128.4123635
Ficolin-1-rich granule lumen (GO: 1904813) 1.67074E − 09 4.76161E − 08 6.898043865 139.4094596

Also, we investigated the enriched GO term results within the molecular function category, revealing that the top genes obtained are significantly enriched and associated with several crucial molecular functions. Notably, these include cadherin binding (GO: 0045296, p value and adjusted p value of 4.26636E − 14 and 2.06492E − 11, respectively), RNA binding (GO: 0003723, p value and adjusted p value of 3.70234E − 12 and 8.95967E − 10, respectively), MHC Class II protein complex binding (GO: 0023026, p value and adjusted p value of 5.08015E − 09 and 8.19597E − 07, respectively), actin binding (GO: 0003779, p value and adjusted p value of 5.92648E − 05 and 0.007171046, respectively), kinase binding (GO: 0019900, p value and adjusted p value of 0.000104042 and 0.0087715, respectively), WW domain binding (GO: 0050699, p value and adjusted p value of 0.000108738 and 0.0087715, respectively), peptidase inhibitor activity (GO: 0030414, p value and adjusted p value of 0.00028506), protease binding (GO: 0002020, p value and adjusted p value of 0.00035081 and 0.017020128, respectively), aldo–keto reductase (NADP) activity (GO: 0004033, p value and adjusted p value of 0.00031649 and 0.017020128, respectively), and ubiquitin-like protein ligase binding (GO: 0044389, p value and adjusted p value of 0.000379176 and 0.018352098, respectively). These terms also show large OR and combined scores. This enrichment analysis underscores critical molecular functions that the genes are significantly associated with, offering insights into the underlying biological mechanisms and pathways involved as shown in Table 3.

Table 3. Top 10 enriched GO terms in which the top 500 genes obtained by the attention mechanism were significantly enriched in the molecular function group.
Term p value Adjusted p value Odds ratio Combined score
Cadherin binding (GO: 0045296) 4.26636E − 14 2.06492E − 11 5.268459851 162.1918067
RNA binding (GO: 0003723) 3.70234E − 12 8.95967E − 10 2.600121641 68.44054477
MHC Class II protein complex binding (GO: 0023026) 5.08015E − 09 8.19597E − 07 22.3212831 426.2902129
Actin binding (GO: 0003779) 5.92648E − 05 0.007171046 3.558910162 34.64063152
Kinase binding (GO: 0019900) 0.000104042 0.0087715 2.409710475 22.098775
WW domain binding (GO: 0050699) 0.000108738 0.0087715 13.12121212 119.7516991
Peptidase inhibitor activity (GO: 0030414) 0.00028506 0.017020128 7.627922163 62.26529151
Protease binding (GO: 0002020) 0.000305081 0.017020128 3.825314821 30.96566598
Aldo–keto reductase (NADP) activity (GO: 0004033) 0.00031649 0.017020128 15.71774194 126.6570058
Ubiquitin-like protein ligase binding (GO: 0044389) 0.000379176 0.018352098 2.649798656 20.87381857

3.2. KEGG Pathway Enrichment Analysis of the Top 500 Genes Identified by the Attention Mechanism

Figure 2 depicts a bar chart of the top 500 genes from which the attention mechanism for lung cancer RNA-Seq data was derived based on KEGG pathway analysis. This analysis based on the GO mental enrichment test showed that results were significantly associated with several major biological processes and pathways. For instance, most genes were associated with coronavirus disease, which would be important should studies involving the virus interaction or immunity be considered. Several terms with highly lifted values also marked significant volcano plots: ribosome; protein processing in the endoplasmic reticulum and their significance in lung cancer were also identified in this analysis. The pathways associated with genes such as fluid shear stress and atherosclerosis, phagosome, antigen processing, and presentation were significant in indicating pathways involved in cellular stress response, immune system, and inflammation. Other sources that showed statistically significant enrichments were focal adhesion, lysosome, prostate cancer, and estrogen signaling pathway, suggesting that lung cancer involves cell adhesion, intracellular degradation, cancer, and estrogen signaling, respectively. This highly integrated enrichment analysis offers in-depth insight into the molecular aspect of lung cancer and offers prospective development possibilities.

Details are in the caption following the image
Analysis of the top 500 genes obtained using the attention mechanism for lung cancer RNA-Seq data based on KEGG pathway analysis.

3.3. PPI Network Construction and Biomarkers’ Gene Selection

A deeper investigation was conducted to gain insights into our list of the top 500 genes by analyzing them through a PPI network. To achieve this, a PPI network was constructed using the STRING database and the Cytoscape application, incorporating an input of 500 genes. As presented in Figure 3, the PPI network analysis of the top 500 genes in lung cancer RNA-Seq data reveals a highly interconnected and complex interaction landscape. Central hub genes and gene clusters play pivotal roles in the network, representing critical biological processes and potential targets for therapeutic intervention. This analysis provides a comprehensive overview of the molecular interactions in lung cancer, offering valuable insights for future research and clinical applications.

Details are in the caption following the image
The PPI network of the top 500 lung cancer genes that are obtained using the attention mechanism, where dark red indicates a high degree of connection and light red denotes a low degree of connection.

Furthermore, 10 top biomarker genes from the PPI network were selected using maximal clique centrality (MCC), maximum neighborhood component (MNC), and degree methods. This selection was performed with the CytoHubba plugin in Cytoscape, employing the default parameters. The top 10 genes exhibiting the highest MNC, MCC, and node degree scores were identified as hub genes, as illustrated in Figures 4(a), 4(b), and 4(c).

Details are in the caption following the image
The biomarker gene network was identified and ranked according to (a) MNC, (b) MCC, and (c) node degree. Nodes colored in red indicate higher MNC, MCC, or node degrees within the network. Nodes with colors transitioning between red and yellow represent medium levels of MNC, MCC, or node degree, while nodes colored in yellow signify lower levels of MNC, MCC, or node degree in the network.
Details are in the caption following the image
The biomarker gene network was identified and ranked according to (a) MNC, (b) MCC, and (c) node degree. Nodes colored in red indicate higher MNC, MCC, or node degrees within the network. Nodes with colors transitioning between red and yellow represent medium levels of MNC, MCC, or node degree, while nodes colored in yellow signify lower levels of MNC, MCC, or node degree in the network.
Details are in the caption following the image
The biomarker gene network was identified and ranked according to (a) MNC, (b) MCC, and (c) node degree. Nodes colored in red indicate higher MNC, MCC, or node degrees within the network. Nodes with colors transitioning between red and yellow represent medium levels of MNC, MCC, or node degree, while nodes colored in yellow signify lower levels of MNC, MCC, or node degree in the network.

Figure 4 highlights the top 10 hub genes obtained by each method with nodes colored from red (higher significance) to yellow (lower significance). In the MCC subnetwork, genes such as HSP90AA1, HSP90AB1, CTNNB1, JUN, and FOS are prominent, highlighting their roles in protein folding, cell adhesion, and gene regulation. The MNC subnetwork emphasizes ribosomal proteins such as RPL19, RPS8, and RPL13A, indicating the importance of protein synthesis in lung cancer. Similarly, the node degree subnetwork also underscores the significance of ribosomal proteins and includes EIF3E, a translation initiation factor, and NACA, involved in protein folding. Notably, RPSA appears in all three subnetworks, marking it as a crucial hub gene with roles in protein synthesis and cell signaling. This consistent presence underscores its potential as a therapeutic target. Overall, the analysis reveals the complex interaction landscape in lung cancer, highlighting key biological processes such as protein synthesis, folding, and cell signaling, and identifies significant genes such as RPSA for further research and clinical applications.

4. Discussion

The top 10 hub genes by coessentiality are shown in Figure 4, where the nodes range from red representing high significance to yellow representing low significance. The MCC subnetwork contains the following genes of interest irrespective of higher connectivity: HSP90AA1, HSP90AB, CTNNB1, JUN, and FOS, involved in protein folding, cell adhesion, and others. The heat shock proteins HSP90AA1 and HSP90AB1 are already known in the cancer literature for their ability to stabilize almost all the known oncogenic proteins and, hence, are indispensable for the survival and proliferation of cancer cells [24, 25]. Quite recently, the inhibitors that target HSP90 have been reported to modulate these pathways to affect the growth and metastasis capabilities of lung cancer. CTNNB1 coding for beta-catenin represents an element of the Wnt signaling pathway and is often associated with cancer and therapy resistance mechanisms [26]. AP-1, composed of JUN and FOS proteins, is implicated in cell growth and apoptosis, and aberrant expression of JUN and FOS predicted poor survival in lung cancer patients [27, 28]. A notable finding is the consistent presence of RPSA across all three subnetworks, marking it a crucial hub gene with dual roles in protein synthesis and cell signaling. RPSA, also known as the 37/67 kDa laminin receptor, is involved in cell adhesion and migration, essential for cancer metastasis. Its dual functionality and central role in lung cancer biology make it a compelling target for therapeutic intervention [29, 30]. The intersection of these findings with existing literature highlights the importance of these hub genes in maintaining the malignant phenotype of lung cancer cells. This comprehensive analysis reveals the intricate interaction landscape in lung cancer, emphasizing key biological processes such as protein synthesis, folding, and cell signaling, and identifies significant genes such as RPSA for further research and clinical applications.

The MNC subnetwork emphasizes the importance of ribosomal proteins, such as RPL19, RPS8, and RPL13A, indicating the critical role of protein synthesis in lung cancer [31]. The upregulation of ribosomal biogenesis and protein synthesis machinery is a hallmark of cancer cells, supporting their rapid growth and proliferation [32]. Similarly, the node degree subnetwork also highlights ribosomal proteins and includes EIF3E, a translation initiation factor, and NACA, involved in protein folding [32, 33]. EIF3E is essential for the initiation phase of protein translation, and its involvement in cancer underscores the heightened need for efficient protein synthesis in tumor cells [33]. NACA, part of the nascent polypeptide–associated complex, aids in the proper folding and assembly of new proteins, ensuring cellular function and stability.

The GO analysis provided valuable insights into the biological processes enriched among the identified genes. Key processes included cytoplasmic translation, peptide biosynthetic process, cell-substrate junction, focal adhesion, cadherin binding, and RNA binding, which are central to cancer progression.

5. Conclusion

This study demonstrates the potential of using computational methods and omics data analysis to enhance lung cancer diagnosis and treatment. By leveraging RNA-Seq data and employing LASSO regression with attention mechanisms, we identified significant lung cancer biomarkers that align with known oncogenes such as TP53, EGFR, KRAS, ALK, and PIK3CA. Our GO and KEGG pathway enrichment analyses further highlighted the involvement of these biomarkers in critical biological processes and pathways, such as protein synthesis, folding, cell adhesion, gene regulation, and immune responses. The PPI network analysis revealed a highly interconnected interaction landscape, with central hub genes playing pivotal roles in lung cancer progression. RPSA emerged as a crucial hub gene, consistently identified across different centrality measures, emphasizing its dual role in protein synthesis and cell signaling. Integrating the STRING database and Cytoscape application for constructing the PPI network and the CytoHubba plugin for hub gene selection provided a robust framework for identifying potential therapeutic targets. Our findings underscore the importance of advanced computational techniques in uncovering the complex molecular mechanisms of lung cancer, paving the way for personalized treatment strategies and improved patient outcomes. Future research should focus on the clinical validation of these biomarkers to bridge the gap between discovery and application, ultimately enhancing the precision of lung cancer diagnostics and therapeutics.

Ethics Statement

The authors have nothing to report.

Consent

The authors have nothing to report.

Conflicts of Interest

The authors declare no conflicts of interest.

Author Contributions

Data curation: M.A. and M.K.E.; formal analysis: B.A., H.A., M.M., and A.M.M.; investigation: H.A., E.M., and A.M.M.; supervision: M.A., M.E., B.A., and M.A.A.; writing–original draft, A.M.M., H.A., and M.E.; writing–review and editing: M.A., M.E., B.A., and H.A. All authors have read and agreed to the published version of the manuscript.

Funding

The Deputyship of Research & Innovation, Ministry of Education in Saudi Arabia, funded this research through the project number: 223202.

Acknowledgments

The authors acknowledge the Deputyship of Research & Innovation, Ministry of Education in Saudi Arabia, for funding this research through the project number: 223202.

    Data Availability Statement

    The experimental data and the simulation results that support the findings of this study are available at the following website: https://www.cancer.gov/ccg/research/genome-sequencing/tcga.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.