Biomarkers for respiratory diseases: Present applications and future discoveries
[Correction added on November 5, 2022 after first online publication: the Author Contributions, Acknowledgments, Funding, Conflicts of Interest, Data Availability Statement and Ethical approval has been updated].
Abstract
Biomarkers such as clinical characteristics and laboratory indices have been widely used in many respiratory diseases, such as chronic airway diseases, lung infection, lung cancer and acute respiratory distress syndrome. As such, various non-invasive and invasive samples, such as blood, urine, induced sputum, bronchoalveolar lavage fluid and lung biopsy are currently being collected. Omics-based discovery strategies facilitate the identification of next-generation candidate biomarkers. The identification and integration of multiple omics biomarkers (genomics, epigenomics, proteomics, metabolomics and radiomics) can answer many unsolved questions about respiratory diseases. The ultimate realization of the translation of novel biomarkers into clinical practice is a major challenge because good biomarkers must be sensitive, technically feasible, non-invasive, non-expensive and most importantly, validated. The four stages of clinical trial design in the development of biomarkers are also discussed. The concept of systems biology not only focuses on novel molecular biomarkers but also integrates clinical, laboratory and omics information as a whole. Finally, developing multiple biomarker panels is a goal to improve the healthcare of respiratory diseases.
1 INTRODUCTION
Biomarkers have been developed rapidly in biology and medicine. They have become an important concept for their tremendous effects on diagnosis, monitoring, prediction and prognosis. However, the definition and concept of biomarkers remain varied and confusing. Their complexity and rapid development make them even more difficult to understand. To verify the confusing definitions of biomarkers, a particular group has been formed to standardize the concept of biomarkers. The continuously updated definition is now available online, which is called the ‘Biomarkers, EndpointS, and other Tools’ resource.1 According to the group, the basic definition of a biomarker is ‘A defined characteristic that is measured as an indicator of normal biological processes, pathogenic processes or responses to an exposure or intervention.’1 Clearly, an accurate concept of biomarkers can help researchers and relevant specialists communicate more effectively. The broad definition of biomarkers makes them easy to apply in various aspects, including diagnostic biomarkers, monitoring biomarkers, pharmacodynamic/response biomarkers, predictive biomarkers, prognostic biomarkers and so on (Figure 1).

Past decades have seen great advances in the treatment of various respiratory diseases, such as tuberculosis, community-acquired pneumonia (CAP), chronic obstructive pulmonary disease (COPD) and asthma, thus greatly improving the health status and increasing the longevity of patients.2-5 However, many challenges still exist in both respiratory clinical practices and basic research. To date, no effective drug or treatment can be provided to patients with severe asthma and prone-exacerbated COPD.6-8 There are few studies on the exact mechanisms and pathogenesis of virus-induced lung infection, such as coronavirus disease 2019, which resulted in high mortality.9 Despite progress in mechanical ventilation and life support therapy, acute respiratory distress syndrome (ARDS) is still a life-threatening disease, partially because there is limited knowledge about the underlying molecular mechanisms and a lack of aetiological treatment. In other words, respiratory diseases are heterogeneous because of different pathogeneses, clinical manifestations, courses and drug responses. Thus, the use of biomarkers can help discover the underlying pathogenic mechanisms, identify susceptible factors for specific diseases, improve diagnosis, monitor progression and assess treatment response.10, 11 Here, we illustrate the current application of biomarkers in respiratory diseases and the biosamples frequently adopted in this area. New technologies, systems biology-based strategies and biomarker development methodologies will also be discussed.
2 PRESENT APPLICATION OF BIOMARKERS IN RESPIRATORY DISEASES
2.1 Chronic airway inflammation diseases
Chronic airway diseases, such as asthma and COPD, cause a major public hygiene burden worldwide every year. Researchers found that some COPD patients have high blood eosinophils, which was thought to be characteristic of asthma in the past.12, 13 As an increasing number of reports of high eosinophil levels in the blood and sputum of COPD patients are emerging,14-18 scientists discovered that these patients had more severe disease and a better response to glucocorticoids than patients with normal eosinophil levels.19
Another well-known chronic airway inflammatory disease is asthma. Asthma can be divided into T2 and non-T2 endotypes according to T2 inflammatory mediators such as eosinophils, Th2 cytokines and IgE.20 For example, patients with high levels of eosinophils can be diagnosed with T2 asthma.21 These patients showed a good response to an anti-IL5 monoclonal antibody (benralizumab). Fractional exhaled nitric oxide (FeNO) is another T2 inflammatory biomarker. Patients with elevated FeNO show a better response to glucocorticoids than those with normal or low FeNO.
2.2 Lung cancer
Osimertinib, a third-generation epidermal growth factor receptor (EGFR) tyrosine kinase inhibitor (TKI), is the standard therapy for advanced EGFR-positive non-small cell lung cancer (NSCLC). However, approximately 30%–40% of patients still have a poor response to osimertinib.22, 23 Thus, there is an unmet need for a biomarker to judge the indication of osimertinib in advanced EGFR-positive NSCLC patients. Zheng et al. discovered that the T790M relative mutation abundance (RMA) is an independent predictor in T790M-positive NSCLC patients who are receiving osimertinib.24 A higher level of T790M RMA indicated longer progression-free survival (PFS) in NSCLC patients.
Similarly, Fang et al. considered circulating tumour DNA as a possible surrogate method for patients with no solid pathological specimens.25 They compared the tumour mutational burden (TMB) from blood and tissue in advanced NSCLC patients and found that blood TMB can be a biomarker to predict the response to immunotherapy in NSCLC patients with no solid pathological tissue.25
2.3 Lung infection
Procalcitonin is a serum biomarker indicating bacterial infection and is synthesized by most organs and tissues and released to blood when bacterial infection occurs. For patients with lower respiratory tract infections, procalcitonin can help distinguish bacterial infections from other kinds of infections or inflammation. Rodriguez et al. conducted a prospective, multicenter study in intensive care unit (ICU) patients with H1N1 influenza. They found that procalcitonin had a 94% negative predictive value for excluding bacterial co-infection.26 High procalcitonin levels indicating bacterial infection can be applied not only in differential diagnosis but also in the guidance of antibiotic treatment. For CAP patients, regular examinations of procalcitonin can help adjust the course of antibiotics, that is, the time to stop antibiotics. Ito et al. demonstrated that procalcitonin can guide the antibiotic duration in CAP from 12.6 days to 8.6 days.27 Procalcitonin can help doctors use antibiotics more rationally and more scientifically. Both the European Respiratory Society (ERS) and American Thoracic Society suggested using the clinical criteria alone to determine when to start antibiotics, while utilizing clinical criteria together with procalcitonin to determine when to stop.28
2.4 Pulmonary embolism
Pulmonary embolism is an emergency event that can cause sudden death. Its atypical and latent symptoms may cause misdiagnosis and may cause the delay of therapy. Early diagnosis and treatment are important to decrease mortality. Many data have suggested that most pulmonary embolism patients show elevated D-dimer levels. When the patient shows unexplainable apparent elevated D-dimer levels, pulmonary embolism should be considered, and relevant examinations should be carried out to diagnose or exclude this emergency situation immediately. However, there are too many factors that can influence the level of D-dimer, which leads to the high sensitivity but poor specificity of the test.29 As a diagnostic biomarker for pulmonary embolism, the result of D-dimer should be considered comprehensively with other clinical criteria.
Once the patient is diagnosed with pulmonary embolism, anticoagulant therapy should start as soon as possible, in which warfarin is the most commonly used drug because of its affordability. The therapeutic window of warfarin is so narrow that the international normalized ratio (INR) test should be performed regularly to evaluate the effective drug concentration and bleeding risk.30 Therefore, the INR index acts as a monitoring biomarker for anticoagulation treatment efficiency and safety.
3 RESPIRATORY BIOSAMPLES UTILIZED IN BIOMARKER MEASUREMENT
Different diseases have various biomarkers for diagnosis, prognosis, etc. Among the tremendous amount of biomarkers, clinical feasibility should be taken into consideration. Non-invasive biosamples for novel biomarker development are a trend in clinical practice. The following section will discuss the advantages and disadvantages of various non-invasive samples, such as blood, plasma, nasal brushings, sputum and urine and invasive samples, such as bronchoalveolar lavage fluid (BALF) and tissue biopsy (Table 1, Figure 2).
Specimen source | Biomarkername | Biomarkertype | Range of application | Clinical application |
---|---|---|---|---|
Blood and plasma | ||||
Eosinophil | Cell | Prognosis | COPD15 | |
cfDNA/ctDNA | DNA | Prognosis | NSCLC37 | |
T790M RMA | Gene | Prognosis | NSCLC22 | |
ABG | Oxygen | Diagnosis | ARDS26 | |
CURB-65 | Scale | Diagnosis | CAP125 | |
PCT | Laboratory test | Monitoring | CAP29 | |
D-D | Laboratory test | Diagnosis | PE34 | |
INR | Laboratory test | Monitoring | PE35 | |
T-spot | Laboratory test | Diagnosis | TB40 | |
Exhaled air | ||||
FeNO | Physiological Test | Monitoring | Asthma21 | |
Physiological function | ||||
FVC %pre | Physiological Test | Prognosis | IPF32 | |
DLCO %pre | Physiological Test | Prognosis | IPF32 | |
Sputum | ||||
Eosinophil | Cell | Diagnosis | Asthma42 | |
Nasal brushing | ||||
Nasal epithelial cells | Cell | Diagnosis | Asthma46 | |
Urine | ||||
Proteinuria | Laboratory test | Diagnosis | COPD49 | |
BALF | ||||
G test/GM test | Laboratory test | Diagnosis | Fungal infection51 | |
Tissue | ||||
PD-L1 | Protein | Prognosis | NSCLC54 | |
EGFR gene | Gene | Prognosis | NSCLC36 |
- Abbreviations: ABG, arterial blood gas; ARDS, acute respiratory distress syndrome; BALF, bronchoalveolar lavage fluid; CAP, community-acquired pneumonia; COPD, chronic obstructive pulmonary disease; D-D, D-dimer; DLCO, carbon monoxide diffusion capacity; EGFR, epidermal growth factor receptor; FeNO, fractional exhaled nitric oxide; FVC, forced vital capacity; INR, international normalized ratio; IPF, idiopathic pulmonary fibrosis; NSCLC, non-small cell lung cancer; PCT, procalcitonin; PD-L1, programmed death ligand-1; PE, pulmonary embolism; RMA, relative mutation abundance; TB, tuberculosis.

3.1 Blood and plasma
Targeted therapy plays a critical role in NSCLC patients. Nevertheless, many patients show resistance to targeted drugs. At this time, a second biopsy should be arranged, but some patients are intolerant to this invasive examination. Phan et al. discovered that the plasma EGFR test can be a prognostic biomarker for resistance to TKI treatment in NSCLC patients.31
Circulating cell-free (cf) DNA is released from apoptotic cells and necrotic cells, which are usually accompanied by tumours, to the circulatory system. cfDNA is applied in the early discovery and monitoring of carcinoma.32 Furthermore, Mirtavoos-Mahyari et al. compared two kinds of circulating cfDNA test kits (MN, QIAGEN) quantitatively in NSCLC patients.33 Their results indicate that liquid samples, such as blood cfDNA, can help monitor and predict the clinical condition of NSCLC patients instead of invasive biopsy.
Tuberculosis is an infectious disease worldwide that causes a tremendous public health burden. The prevalence rate of tuberculosis in China is much higher than that in other countries, and the economic burden caused by the infectious disease of tuberculosis is substantial. Early discovery and diagnosis are truly important for the control of tuberculosis. The clinical features of tuberculosis vary and sometimes lead to misdiagnosis. The T-SPOT.tuberculosis assay (an interferon [IFN]-γ release assay) is based on detecting secreted IFN-γ in M. tuberculosis-specific T-cells stimulated by Mycobacterium-specific antigens.34 The T-spot test has excellent sensitivity and good specificity for tuberculosis infection, and it is easy to test through blood specimens.35 When the patient is suspected to have tuberculosis, T-spot is the necessary test that should be performed to help with the diagnosis.
3.2 Sputum
Induced sputum comes from patients who either do not have or have a little sputum by atomizing hypertonic saline. We can test the components of the cells, including the ratio of eosinophils, neutrophils, lymphocytes, mononuclear phagocytes and so on, in the induced sputum. Cells in the septum can reflect the local condition of the respiratory tract better than blood due to the better sensitivity and specificity.36 Asthmatic patients with a high level of eosinophils in the sputum indicate an excellent response to glucocorticoid treatment. After glucocorticoid treatment, the ratio of eosinophils in sputum decreased, and the cellular morphology changed.37 A high level of neutrophils in sputum indicates a poor response to glucocorticoids and a greater chance of difficulty in controlling symptoms.37 In the control of chronic airway disease, induced sputum can help to dynamically monitor the condition of respiratory tract inflammation and precisely guide the treatment strategy at any time.
3.3 Urine
Researchers have previously reported that COPD patients have dysfunction in pulmonary microvessels. Proteinuria indicates not only damage to microvessels in the kidney but also microvascular dysfunction in pulmonary circulation.38 The severity of proteinuria is related to the severity of COPD, including pulmonary function.39 Oelsner et al. analysed the association between proteinuria and FEV1/FVC in COPD patients.40, 41 The results showed that FEV1 decreased with increased proteinuria levels, and a higher level of proteinuria was accompanied by an increased hospitalization risk.38 The results showed that urine samples might be used as a predictive biomarker for indicating pulmonary microvascular dysfunction in the future.
3.4 BALF
Fungal infection usually happens when the human immune system is destroyed. Most of the time, fungal infection is dormant and hard to detect, eventually leading to severe infection and death. To control the infection, early detection and the identification of pathogens are the most important. The 1,3-β-D-glucan test (G test) and galactomannan test (GM test) are easy and convenient blood tests for fungal infection. It has been discovered that the BALF GM test has better sensitivity and specificity than the blood test. In 2016, the Infectious Disease Society of America suggested the BALF GM test as a biomarker of invasive pulmonary aspergillosis.42 Previous studies revealed that the GM test together with the G test of BALF can further decrease the false positive rate.43 In other words, the GM test and G test together can enhance the clinical diagnostic value in fungal infection diagnosis, especially in BALF but not blood.
3.5 Tissue
Pathology is the gold standard of the diagnosis of carcinoma. The diagnosis of pulmonary carcinoma depends on tissue specimens from bronchoscopy and surgery. The detection of alterations in human bronchial epithelial cells (HBECs) may be used to detect early-stage pulmonary injury, such as carcinoma, via bronchoscope examination. Gene translation alterations and protein expression alterations reflect early alterations in metabolism in HBECs.44, 45 Apart from targeted therapy, immunotherapy plays critical roles in pulmonary carcinoma. Tumour cells can express programmed death-ligand 1 (PD-L1), which can combine with PD-1 on T cells. The ligand-antibody combination will suppress T cell function and activity, leading to immune escape. The inhibitor of PD-L1/PD-1 blocks the pathway, relieving the suppression of T cell activity. High PD-L1 expression in carcinoma is an indicator for the use of immunotherapy. The PD-L1 test via immunohistochemistry has become a routine test item in clinical applications.46
4 SYSTEMS BIOLOGY STUDIES TO DISCOVER NOVEL BIOMARKERS FOR THE RESPIRATORY SYSTEM
The heterogeneity of respiratory diseases calls for personalized medicine that allows physicians to choose a specific regimen for a patient according to personal characteristics, such as demographic features (age, sex, smoking status) and genomic, genetic and other omic characteristics.47-49 In recent years, clinical and basic studies have shifted from the ‘hypothesis-driven’ to the ‘discovery-driven’ model.50 The ‘hypothesis-driven’ model is a traditional strategy in which a single biomarker can be identified when exploring the pathogenetic mechanism of a disease, whereas the ‘discovery-driven’ model is a brand-new strategy in which a considerable number of candidate biomarkers emerge based on the progress of high-throughput technologies, analytic methodologies and digital revolution. Omic studies are a hallmark in the era of concept and biotech revolution that comprises genomics, epigenomics, proteomics, metabolomics and radiomics.51-53
As organisms are complex and multi-dimensional, error and bias are inevitable if clinical decision-making is largely dependent on a single biomarker or specific ‘omics’ data. Systems biology regards the body as a whole. The information collected from demographic features, laboratory tests, physiological indices, imaging and molecular changes from omics should be integrated.54, 55 This will further our understanding of disease pathophysiology and improve the diagnosis and precise treatment of patients. The following section will discuss the research progress and clinical application of omics studies in respiratory diseases.
4.1 Genomic studies in respiratory diseases
The success of the human genome project has opened a new era of omics. The next-generation sequencing (NGS) technique can be used to sequence whole exomes in the genome with high efficiency and a relatively low cost.56-58 More than 500 000 single nucleotide polymorphisms (SNPs) and copy number variations (CNVs) in the genome from thousands of participants were found. Data released from the HapMap project, which focuses on mapping SNPs in the whole genome, found that 99.5% of the genome information between two individuals is identical; however, the remaining 0.5% of variants contribute to disease susceptibility.59 Thousands of novel genes and regulatory regions were discovered through whole-genome sequencing, which facilitated the development of the ‘genome-wide association study (GWAS)’, which aims to study the association between gene variants and various diseases.60 One of the most successful examples of GWAS in respiratory disease is the finding of EGFR mutations in advanced NSCLC patients. The response to EGFR-TKI drugs can be predicted in advance in lung cancer patients with this mutation, which indicates that physicians should choose EGFR-TKIs as first-line therapy for these populations other than chemotherapy.10 Genetic risk scores (GRSs) comprise several gene variants that demonstrate genetic risk factors for a specific disease; for example, a GRS study for predicting the onset of COPD revealed that populations with high GRSs have a five-fold chance of developing COPD compared with those with low GRSs. The GRS combined with smoking status increased the absolute risk of COPD to 82.4% in the high-score group, whereas the probability of COPD was 17.4% in the low-score group.38, 61 Warfarin is a traditional anticoagulation medicine for pulmonary embolism and has been widely used because of its efficiency and low cost. However, it does not work for some patients due to narrow treatment windows and varied responses to drugs. Gene polymorphism of vitamin K epoxide reductase complex 1 (VKORC1) affects the individual's response to warfarin, so a test for VKORC1 variants enables the assessment of warfarin sensitivity and determination of the initial dosage.62
4.2 Transcriptomic studies in respiratory disease
Transcripts are products of gene expression. Transcriptomes are a bridge linking the genome to proteomes. Transcriptomic studies are essential because transcription is always the first step in exploring translational regulation in proteins. Transcriptomic research experienced different stages with distinct technologies from gene chip microarray to serial analysis of gene expression (SAGE) and NGS. The latter is also called RNA-seq, which has become the mainstream sequencing technology, favoured over microarray and SAGE for its higher sensitivity, broader dynamic range and better efficiency.
Single-cell sequencing can help us to learn about how cell proportions and states vary and how this variation correlates with genome variants, disease course and treatment response.63 Single-cell RNA- sequencing (scRNA-Seq) is cutting edge in transcriptomic research. It aims to profile the gene expression of a single cell, such as sperm or oocytes, or a single cell subset with specific characteristics. It overcomes the limitations of traditional RNA-seq of a bulk heterogeneous cell population, which is not able to provide precise and differential information.64-66 Technical advances have ensured the development of scRNA-Seq. The first major challenge is how to isolate target cells from the mixed cell population or tissue. Three main techniques, such as micromanipulation, microfluid and FACS, have made separation and labelling single cells available. Second, the quantity of genetic materials in a single cell is too small to perform scRNA-Seq. Single-cell whole-genome amplification techniques, such as multiple annealing and looping-based amplification cycles and multiple displacement amplification (MDA), come into the market to take the place of traditional PCR, with higher efficiency and fidelity. As an example of scRNA-Seq in the study of allergic airway inflammation disease, the mRNA of 91 individual cells was sequenced, and 52 cells were interleukin 13 (IL-13) positive, and the other 39 cells were IL-13 negative. The results proved that the enzyme Cyp11a1, which has the ability to catalyse cholesterol to glucocorticoids, played an important role in maintaining homeostasis and regulating glucocorticoid production.
4.3 Epigenomic studies in respiratory diseases
The epigenome refers to gene modification. Although each cell contains almost the same gene sequence, the gene expression varies according to different tissues, developmental stages and environmental stimuli. DNA methylation, histone modification and noncoding RNAs are three main manifestations of the epigenome.67, 68 For example, several studies revealed that DNA methylation is involved in several steps of lung cancer pathogenesis, such as cell differentiation and epithelial-mesenchymal transition. In a genome-wide profile of DNA methylation for lung cancer, the methylation of the polycomb-group family was downregulated, and methylation-mediated WNT pathway disruption contributed to cell proliferation, cell cycle progression and invasion in NSCLC.69-71
miRNAs are non-coding RNAs. In mammalian cells, miRNAs are generated from several intron regions. Approximately 400 human miRNAs can target more than half of the human transcripts by interfering with mRNA transcription, splicing and translation.72-75 The expression of miR-101 and miR-144 is increased in COPD patients who smoke compared with healthy smokers. This finding was also confirmed in an in vitro study, and miRNA expression can be induced when human epithelial cells are exposed to tobacco smoke extract.76 Sarcoidosis is a lung granulomatous disease with unknown aetiology. A number of studies have been performed to show the pattern of miRNA expression to explore the underlying mechanism or develop novel biomarkers for this rare disease. The miRNA pattern was different in the lungs and peripheral blood mononuclear cells of sarcoidosis patients compared to healthy controls. For example, the levels of miR-34a increased in sarcoidosis, which can downregulate sirtuin (SIRT) 1 expression and stimulate IFN-γ expression. SIRT1 and IFN-γ are involved in energy metabolism and systemic inflammation, which can partially explain the pathogenesis of sarcoidosis.77
4.4 Proteomic studies in respiratory diseases
The proteome is the whole set of proteins expressed by the genome. The differential and quantitative changes in the proteome are very helpful in understanding the pathogenesis of diseases.78 Proteomics refers to the study of the dynamics of whole protein expression and function. Based on new approaches, such as microarray, nuclear magnetic resonance (NMR) and mass spectrometry (MS), genome expression levels and post-translational modifications in respiratory diseases have been deeply and extensively studied.
Bronchopulmonary dysplasia (BPD) is a disease entity diagnosed by clinical criteria, including the demand for oxygen therapy after 36 weeks of post-menstrual age, hyaline membrane disease or mechanical ventilation in preterm infants. Due to the heterogeneity and complexity of the disease, BPD lacks specific and targeted treatment.79 Tracheal aspirate was collected from infants with BPD for proteomic analysis. The result indicated that the expression of calcyphosine, calcium and integrin binding protein-1 were significantly different between mild and severe BPD.79 Mass spectrometry is a powerful technique that has advanced in recent years that ensures a high-throughput profile of the proteome. Utilising a mass-spectrometry-based top-down and bottom-up strategy, seven main allergens’ post-translational modification characteristics (Der p1, Der f1, Ph1 p1, Ph1 p5, Der p2, Der f2 and Bet v1) have been illustrated in detail, which are candidate targets for the further development of allergy tests and vaccines.80
4.5 Metabolomics studies in respiratory diseases
The metabolome is the complete set of metabolites synthesized by an organism. It comprises a spectrum of small molecular compounds, such as amino acids, peptides, organic acids, carbohydrates and oxidative stress products.81-83 Metabolomics can reflect the response of organisms to environmental stimuli and represent various disease statuses and patterns of metabolite changes.84-86 The most common metabolic platforms are gas chromatography-MS (GC-MS) or liquid chromatography-MS (LC-MS) and NMR spectroscopy.87-89 For instance, GC-MS analysis was adopted to measure metabolites in the exhaled breath of patients with ARDS, in which 23 were ARDS patients with ventilation, and the other 20 were ventilated ICU control patients. Three volatile organic compounds (VOCs), octane, acetaldehyde and 3-methylheptane, were found to be significantly different. These three VOCs were established as a composite biomarker for the early diagnosis of ARDS, with a diagnostic accuracy as high as 0.8.90-92 In a recent study of tuberculosis, a total of 400 metabolites were identified in serum by MS screening, of which 20 were regarded as robust biomarkers to distinguish patients with tuberculosis from healthy individuals. Urine can also be used as a specimen, and its metabolites are biosignatures for early tuberculosis treatment.93
4.6 Radiomics studies in respiratory diseases
Radiomics refers to the extraction and analysis of computerized image features derived from medical images such as computed tomography (CT), positron emission tomography or magnetic resonance imaging.94, 95 The approach of radiomics involves image acquisition and segmentation, feature extraction and quantification, database setup and statistical analysis. The development of radiomics provides a large amount of information to correlate image features with disease clinical characteristics and genomic signatures by establishing descriptive and predictive models. In respiratory disease, the most commonly used imaging modality is chest CT. By deep learning techniques and convolutional neural networks, machines can be trained to identify the lesion's intensity, texture, structure and location and then provide quantitative data. Previous studies indicated that radiomics is helpful for identifying the malignant potential of lung nodules in chest CT. In another study, to discriminate lung invasive adenocarcinomas from non-invasive adenocarcinomas before surgery, 3D radiomics was adopted and successfully discriminated invasive and non-invasive adenocarcinomas with accuracies of 86.3% and 90.8% and 84.0% and 88.1% in the primary and validation cohorts, respectively. Radiomics showed great advantages compared to traditional CT indices such as morphology and mean CT value.96-98 CT-based radiomics can identify and quantify the severity of lung emphysema. A study revealed that the emphysema severity index from radiomics was correlated with the airway obstruction level from a pulmonary function test.99, 100
5 APPROACHES TO IDENTIFY AND VALIDATE NOVEL RESPIRATORY BIOMARKERS
Apart from omics studies, a large number of data come from clinical characteristics, laboratory tests, radiological imaging and physiological indices. The top-down strategy is applied in systems biology to integrate all of these data and create new biomarkers for the improvement of diagnosis, disease monitoring and precise intervention.101, 102 Newly identified candidate biomarkers need to be verified before bench-to-bedside transformation. Commercialization must take a long time to ensure the accuracy, reproducibility and low cost of new biomarkers. Figure 3 graphically displays the process and elements in the development of novel biomarkers in the medical field.

5.1 Challenges encountered in the biomarker discovery phase
Taking omics studies as an example, top-down-based population studies will generate numerous novel biomarkers. However, few have been applied in clinical practice. Some notable outcomes cannot be confirmed by other researchers because the discrimination is from measurement procedures or sample processing differences but not from disease. Various samples, such as blood, urine, induced sputum and BALF, have been used for respiratory omics research. Specimen collection, storage and handling are essential for preserving sample stability. Establishing a standardized protocol for measurement, data mining and reporting is very important to minimize bias. The application of quality control in the measurement ensures comparable and repeated results.
Another challenge for omics research is to analyse big data. Data analysis technology involves raw data processing, data mining and network modelling. A specialized and well-designed database is critical for data storage, organization and interpretation. It contains sequences of the genome and proteome, structures of peptide and small molecules, annotation analysis and literature reviews. Examples of biological databases are as follows. National Center for Biotechnology Information is a freely accessible sequence database that includes data on genes, genomes, nucleotides, proteins, SNPs and so on. Kyoto Encyclopedia of Genes and Genomes (KEGG) is a major pathway database that includes signal and biochemical pathways. Search tool for the retrieval of interacting genes/proteins (STRING) is the most useful database for generating protein–protein interaction networks. These databases enable us to further understand the disease mechanisms and provide a more accurate definition for various respiratory diseases.103
Big data require specialized statistical tools. Instead of univariate analysis, such as the p-value or Welch's t-test, omics studies usually adopt multivariate analysis, which is capable of processing thousands of variables and elucidating complex and multi-dimensional information. Bioinformatics tools are advancing as well. For an unknown sequence discovered by an omics study, bioinformation tools such as BLASTN and BLASTP provide high-speed sequence matching. Alignment results are annotated according to the two aligned nucleotides or amino acids. Reports such as ‘identity’, ‘positive’ and ‘GAP’ are presented. The interpretation of omics big data has led to the discovery of the pathogenesis of respiratory diseases and the identification of potential diagnostic and predictive biomarkers.
5.2 Validation is crucial for the development of novel biomarkers
Prior to clinical application, the analytical and clinical validity of newly identified biomarkers should be proven. Analytical validity refers to the truth of biomarker testing; clinical validity refers to valuable information that is added to the other clinical indices. Therefore, biomarker validation is not only limited to the repetition of a clinical trial but also contains ‘all activities of biomarker development’ to ensure the quality of the conclusion. As mentioned previously, the standardization of protocols, quality control, the design of clinical trials, blinding and randomization, etc. can minimize bias.
Rigorous clinical trials ensure clinical validity. For diagnostic and predictive biomarkers, there are four different phases as follows: phase I: adopt the top-down strategy and omics to search for novel candidate biomarkers; phase II: perform a retrospective clinical trial to validate the truth of the biomarker; phase III: perform a prospective randomized control trial to validate the truth of the biomarker; and phase IV: study the impact of the biomarker on human health after marketing.104 Test participants are usually divided into a training set and a validation set. The number of training sets is small, and confounders can affect the results of the trial. To validate the identified biomarker, a larger cohort of another population should be involved. When performing validation trials, good clinical practice, such as inclusion and exclusion criteria, the standardization of procedures, the monitoring of procedures and data inputting and analysis, should be adhered to. Finally, multi-centre clinical trials are very useful for validation tests.
5.3 Setting up a multiple biomarker model
Respiratory diseases are very complex and multi-factorial. The application of a specific biomarker in the diagnosis or prediction of drug response in the absence of other bioinformation, such as clinical characteristics, imaging and physiology, can lead to ‘false’ conclusions due to bias or low clinical utility. Recently, interest has been focused on establishing multiple biomarker models, which can be more efficient in diagnosis and prediction.16, 105, 106 A multiple biomarker model instead of a single biomarker model has an intrinsic statistical advantage, in which each variable can be assigned a statistical weight. Therefore, the composite index can be calculated by summing the value of multiple biomarkers with different weights.107, 108
For example, idiopathic pulmonary fibrosis is a kind of refractory disease in the respiratory system with an unpredictable disease course and varied treatment response.109, 110 One previous study revealed that the use of a combination of serum biomarkers such as KL-6 and SP-D, SP-A and SP-D or a panel of miRNAs can increase the prognostic accuracy.111 In the prospective observation of fibrosis in the lung clinical endpoints study, the role of four serum biomarkers (SP-D, MMP7, CA 19-9 and CA 125) was explored. Increases in CA 19-9 and CA 125 suggested epithelial damage. The baseline values of SP-D and CA 19-9 were significantly higher in patients with progressive disease.112 Another study that explored early diagnostic, endotype and disease monitoring biomarkers for allergic rhinitis and asthma found more than 100 biomarkers associated with epithelial damage and local inflammation.113, 114 Further studies are needed to translate this panel into different biomarker models with higher sensitivity and specificity.113, 115
In addition, there are now emerging new brands of science with the integration of different fields. Pathology focuses on the observation of tissues and cells. In these observational studies, there were challenges to make it clear about the causal relationships. To make clear of the causal relationships is one the advantages of epidemiology, which belongs to the fields of data science. In the era of big-data omics science, molecular pathology can produce multi-omic data, including massively parallel sequencing (or NGS) techniques. Causal inference methods of epidemiology can be utilized in the molecular pathology research. Molecular pathological epidemiology (MPE), a hybrid of pathology and data science, showed the strengths of such an interdisciplinary integration which can promote the precision medicine.116 MPE research may help to detect intermediary biomarkers that can predict full-blown disease in the future.117 Relevant research through MPE on the environment factors and personalized molecular biomarkers may give us new horizon.
6 CONCLUSION
Current respiratory disease treatment still relies on traditional evaluation based on clinical assessment, imaging, pulmonary physiology and biochemical biomarkers. In the future, omics-based discovery strategies will enable the emergence of numerous novel molecular biomarkers. It will greatly improve the accuracy and efficiency of diagnosis, prognosis prediction and tailored treatment for respiratory diseases. Furthermore, the concept of systems biology, which focuses on the integration of clinical characteristics, radiologic features, laboratory findings and genomic, proteomic and metabolomic profiles together, will further our understanding of the pathogenesis of respiratory disease and promote the transformation of novel biomarkers to clinical use. However, it is still realized that many obstacles and hurdles coming from experimental, analytic and financial aspects must be overcome to keep pace with future clinical needs.
AUTHOR CONTRIBUTIONS
Dr. Xiaojing Liu, Dr Bo Cui, Dr Qian Wang and Dr Yuan Ma contributed to the preparation and collection of original literatures and figures and the writing and editing of manuscript. Dr. Zhihong Chen and Dr.Li Li were responsible for the structural designs, scientific quality and writing.
ACKNOWLEDGEMENTS
Not applicable.
FUNDING INFORMATION
Not applicable.
CONFLICT OF INTEREST
The authors declare no conflict of interest.
ETHICAL APPROVAL
Not applicable.
Open Research
DATA AVAILABILITY STATEMENT
Data sharing is not applicable to this article as no new data were created or analyzed in this study.