Volume 23, Issue 12 pp. 1117-1126

Invited Review Series: Molecular Techniques for Respiratory Diseases

Free Access

Insights into respiratory disease through bioinformatics

Bhavna Hurgobin,

Bhavna Hurgobin

orcid.org/0000-0001-9603-2493

Telethon Kids Institute, The University of Western Australia, Perth, WA, Australia

These authors contributed equally to the work.Search for more papers by this author

Emma de Jong,

Emma de Jong

orcid.org/0000-0002-2501-6119

Telethon Kids Institute, The University of Western Australia, Perth, WA, Australia

These authors contributed equally to the work.Search for more papers by this author

Anthony Bosco,

Corresponding Author

Anthony Bosco

[email protected]

orcid.org/0000-0002-4335-615X

Telethon Kids Institute, The University of Western Australia, Perth, WA, Australia

Correspondence: Anthony Bosco, Telethon Kids Institute, The University of Western Australia, 15 Hospital Avenue, Nedlands, Perth, WA 6009, Australia. Email: [email protected]Search for more papers by this author

Bhavna Hurgobin,

Bhavna Hurgobin

orcid.org/0000-0001-9603-2493

Telethon Kids Institute, The University of Western Australia, Perth, WA, Australia

These authors contributed equally to the work.Search for more papers by this author

Emma de Jong,

Emma de Jong

orcid.org/0000-0002-2501-6119

Telethon Kids Institute, The University of Western Australia, Perth, WA, Australia

These authors contributed equally to the work.Search for more papers by this author

Anthony Bosco,

Corresponding Author

Anthony Bosco

[email protected]

orcid.org/0000-0002-4335-615X

Telethon Kids Institute, The University of Western Australia, Perth, WA, Australia

First published: 14 September 2018

https://doi.org/10.1111/resp.13401

Citations: 19

Series Editors: Ian A. Yang and Yuben Moodley

Share a link

Email
Wechat
Bluesky

ABSTRACT

Respiratory diseases such as asthma, chronic obstructive pulmonary disease and lung cancer represent a critical area for medical research as millions of people are affected globally. The development of new strategies for treatment and/or prevention, and the identification of biomarkers for patient stratification and early detection of disease inception are essential to reducing the impact of lung diseases. The successful translation of research into clinical practice requires a detailed understanding of the underlying biology. In this regard, the advent of next-generation sequencing and mass spectrometry has led to the generation of an unprecedented amount of data spanning multiple layers of biological regulation (genome, epigenome, transcriptome, proteome, metabolome and microbiome). Dealing with this wealth of data requires sophisticated bioinformatics and statistical tools. Here, we review the basic concepts in bioinformatics and genomic data analysis and illustrate the application of these tools to further our understanding of lung diseases. We also highlight the potential for data integration of multi-omic profiles and computational drug repurposing to define disease subphenotypes and match them to targeted therapies, paving the way for personalized medicine.

INTRODUCTION

Asthma, chronic obstructive pulmonary disease (COPD) and lung cancers represent the most common, non-communicable respiratory diseases. Asthma and COPD each affect over 200 million people globally, with asthma being the most common chronic disease among children,1 and COPD the fourth leading cause of death worldwide.2 While lung cancer affects comparatively fewer people (<2 million), it is the most commonly diagnosed cancer and the leading cause of cancer-related deaths.3 Together, these respiratory diseases contribute an immense global health burden, which exceeds €46 billion annually in Europe alone.4 The development of new strategies for the treatment or prevention of these diseases requires a comprehensive understanding of the underlying biology. In this regard, the advent of omics has enabled the generation of rich data sets that span a broad range of cellular and molecular states that underpin the pathogenesis of lung diseases. However, processing large data sets to pinpoint relevant causal pathways and therapeutic targets among the vast array of possible candidates is a challenging task, and below we highlight contributions from bioinformatics towards this goal.

TRANSCRIPTOMICS: FROM SHORT SEQUENCING READS TO BIOLOGICAL MECHANISMS

The transcriptome (the entire set of mRNA transcripts and non-coding RNA expressed at a given time) changes dynamically in response to environmental exposures and disease states. Assessment of the transcriptome can provide important insights into the molecular mechanisms of disease. Notably, the first distinctions between Th2-high and Th2-low asthma were based on expression levels of a panel of IL-13-induced genes.5 RNA-sequencing (RNA-Seq) is currently the method of choice for the assessment of transcriptome, and has several advantages over the previous gold-standard DNA microarray (reviewed).6 In essence, RNA-Seq is performed on short fragments of RNA (typically mRNA) to generate tens of millions of short sequencing reads per sample. The biggest challenge faced by researchers is extracting biological meaning from these reads. Below, we provide an overview of the tools and techniques employed for this purpose.

A workflow for the analysis of RNA-Seq data is illustrated in Figure 1. First, raw sequencing reads are checked for overall sequencing quality. FastQC7 has become the standard for performing this task, where each sequencing file is evaluated with respect to several quality control metrics. Each file is assigned a ‘pass’, ‘warning’ or ‘fail’ for each of these metrics, and MultiQC8 is employed to aggregate the FastQC results across all samples. Where necessary, data trimming tools such as Trimmomatic9 can be used to clean up the raw sequencing data prior to alignment of the genome, by removing artificial sequences, or those with low quality scores. These steps can improve the efficiency of the alignment step, and ensure only the best quality data are used for downstream analyses. There are several splice-aware read alignment tools (TopHat,10 HISAT11 and STAR),12 which map each sequencing read to a reference genome, to determine which genes they belong to. The percentage of mapped reads is an important quality metric, and should be around 70–90% in good quality human data sets.13 Post-alignment statistics can be evaluated with SAMStat.14 Next, the number of reads that uniquely align to each gene are counted, ultimately producing a large matrix of gene counts, where each column contains a sample and each row corresponds to a gene. There are a number of tools available for this step including summarizeOverlaps,15 featureCounts16 and htseq-count.17 From here, downstream analyses are typically performed using R or RStudio.18 First, exploratory data analysis techniques such as unsupervised clustering are performed on the gene count data to visualize how the samples group together, and identify outliers or the presence of batch effects. The most common techniques are principal component analysis (PCA) and multidimensional scaling (MDS). Both are data reduction techniques, which are applied to normalized data rather than the raw gene counts to account for any technical differences that may be present between samples. PCA operates directly on the gene count information while MDS is based on a measure of distance between samples (derived from the gene counts). In both analyses, closely related samples are expected to cluster together. Filtering out genes with low counts prior to clustering is recommended, as genes with low counts can be highly variable between samples, and may skew the data. Furthermore, the presence of batch effects or other sources of unwanted variation can significantly impact downstream analyses. RUVseq19 can systematically model unwanted variation in the data, and remove it from the analysis. In addition to finding problems with data quality, clustering techniques can also be used to identify subphenotypes. For example, Kuo et al. employed cluster analysis to transcriptomic data from >90 bronchial biopsies or epithelial brushings from patients with moderate-to-severe asthma.20 By focusing on inflammatory genes, they identified a distinct subgroup of patients, who were characterized by high counts of submucosal eosinophils, high exhaled nitric oxide and high use of oral corticosteroids.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Overview of the steps involved in the analysis of RNA-sequencing (RNA-Seq) data, including various types of downstream analyses that can be performed with gene expression data. The grey boxes point to examples of tools that can be used to process and analyse the data.

Following initial exploratory data analysis steps, the goal of most RNA-Seq experiments is to identify differentially expressed genes. This can be performed with methods such as edgeR,21 DESeq222 or limma-voom.23 A comparison of these tools is beyond the scope of this article, but is available elsewhere.24 The typical output from a differential gene expression analysis contains a list of genes with an official gene identifier (e.g. Ensembl ID and gene symbol) alongside a test statistic, log fold-change and a significant P-value (both unadjusted and adjusted for multiple comparisons). It is at this point that biological interpretation of the data begins.

Pathways analysis provides important insights into the collective biological functions of a set of genes. Popular pathways analysis tools include DAVID,25 InnateDB,26 Enrichr27 and the Molecular Signatures Database.28, 29 Each tool has inherent biases (e.g. some tools specialize in immune pathways), and in our experience performing pathways analysis with multiple tools can provide complementary information. These analyses, which require a list of differentially expressed genes (upregulated and downregulated) as input, are straight forward to perform. Statistical significance can be assessed using a Fisher's exact test, hypergeometric test or gene-set enrichment test (see below), and P-values are adjusted for multiple testing.

Whilst informative, pathways analyses cannot unveil the regulatory mechanisms that are driving a biological response or disease process. In this regard, Upstream Regulator Analysis can be performed to infer the putative molecular drivers of a set of differentially expressed genes, based on experimentally determined cause-and-effect relationships extracted from the literature.30 Two statistical measures are calculated; the overlap P-value and an activation Z-score. The overlap P-value is derived from a Fisher's exact test to determine whether there is a significant overlap between the user-identified gene list, and the known downstream targets of a candidate regulator. The activation Z-score is used to infer the activation state of a predicted regulator, leveraging prior knowledge of the direction of target gene regulation (up or down) when the predicted transcriptional regulator is ‘activated’ or ‘inhibited’. The predicted regulators identified from Upstream Regulator Analysis can be experimentally validated using gene silencing techniques.31

A limitation of Upstream Regulator and pathways analysis is that they do not provide a holistic, systems-level view of the data. Notably, biological systems are governed by a universal set of organizing principles, or common design elements that are found in other complex interconnected systems such as the stock market and the world-wide-web.32 Network analysis is a powerful tool to study the organization and function of biological systems. The underlying concept is that a biological system can be represented as a network graph of interacting genes or gene products (mRNA transcripts and proteins). The network graph reveals fundamental topological properties of biological systems. Gene networks have a modular architecture and power-law degree distribution. The modular property describes the organization of the network into smaller subnetworks that are functionally and/or topologically isolated; the power-law distribution means that most genes have few interactions, and a few genes have many, behaving as hubs that dominate the network structure are more essential to network integrity. A popular algorithm for network analysis of gene expression data is weighted gene co-expression network analysis (WGCNA).33 This algorithm infers network structure from experimental data based on the correlation patterns between all gene pairs across the samples. Prior to network analysis, gene counts are transformed using a variance stabilizing transformation from the DESeq2 package. Modules of highly co-expressed genes are identified by hierarchical clustering. The resulting network modules can be interrogated using the same pathways analysis tools noted above. Genes can be ranked according to properties such as intramodular connectivity (defined as the sum of the correlation patterns for a given within each module) to identify ‘hub’ genes and prioritize them for follow-up mechanistic studies. WGCNA of sputum transcriptomes was recently utilized to further our understanding of why only a subset of house dust mite sensitized individuals suffer from asthma despite ubiquitous environmental exposure.34 The data showed that Th2-associated inflammation was upregulated in house dust mite allergic subjects regardless of the presence or absence of asthma. However, in the subjects with asthma, Th2-related pathways formed interlinked networks with airway epithelial pathways (EGFR, ERBB2 and CDH1), suggesting that pathological Th2 inflammation involves rewiring of Th2 gene networks with epithelial repair/remodelling networks.

An alternative approach to WGCNA is to build regulatory networks between transcription factors and target genes. The Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNe) uses mutual information to infer statistical dependence between transcription factors and target genes across a gene expression data set.35 The Passing Attributes between Networks for Data Assimilation (PANDA) algorithm starts with a pre-specified network model that is based on transcription factor motif data, and employs message passing theory to integrate information derived from multiple layers (sequence motif data, gene expression profiles and protein–protein interactions) resulting in more accurate network models.36 Both algorithms have extended functionality to extract sample-specific information on gene network patterns, and identify network driver genes that determine biological state transitions.35, 37-39

GENOME-WIDE ASSOCIATION STUDIES

Genome-wide association studies (GWAS) identify associations between genetic markers and phenotypic traits by assessing thousands of individuals in either case–control40-43 or quantitative40, 44 study designs. GWAS leverage the discovery of millions of single nucleotide polymorphism (SNP) across the human genome, and the fact that a subset of these SNP encapsulates common genetic variation via a phenomenon called linkage disequilibrium (LD; process by which a group of correlated SNP tend to be inherited more often than by chance).40, 41, 44-46 This has resulted in a deluge of data, requiring the development and application of bioinformatics tools for pre-processing and analysis. Here, we describe a general framework for the analysis of data from a typical GWAS, and discuss how the application of bioinformatics and multi-omics integrative approaches have led to advances in our understanding of lung diseases.

There are three major steps involved in the analysis of data from a typical GWAS (Fig. 2). Whole-genome sequencing (WGS) data would undergo read quality assessment, trimming of low-quality bases, read alignment and post-alignment quality checks as described in the previous section of transcriptomics. However, read alignment would be performed by tools such as SOAP47 and BWA,48 which have been designed to specifically handle WGS data. Additionally, several post-alignment processing steps would be carried out. For instance, PCR duplicates, which arise during DNA amplification,49 would be removed using Picard tools (http://broadinstitute.github.io/picard). The elimination of uniquely mapped reads that fall below a given quality threshold or reads that do not uniquely map in proper pairs may also be required using SAMtools.50

The second step consists of performing variant calling on the output from the first step, to identify all sites where an alternate allele is present either in the case or control samples.51 Some popular tools available for this are Varscan2,52 GATK,51 FreeBayes53 and SAMtools/BCFtools.50 It is likely that the initial variant calling step will produce more SNP than expected. This normally occurs as a result of the reduced sensitivity of the variant callers during this initial step to reduce false negatives. Therefore, post-variant filtering is carried out by the variant caller itself to remove false positives. Otherwise, additional tools such as vcflib (https://github.com/vcflib/vcflib), which are specifically designed for this purpose and required by some variant callers, can be used.

In the third step, several quality control checks40, 44-46 are performed on the filtered variants prior to GWAS statistical analyses as described in Table 1. The majority of these steps (1–4, 7 and 8) can be performed by the popular software PLINK.56 Data imputation (i.e. statistical inference of unobserved genotypes), on the other hand, would be performed with tools such as IMPUTE2,57 MAcH58 and BEAGLE.59 After quality checks, association analysis is performed to find relationships between genotypes and phenotypes. This can be achieved via a single locus analysis40, 44 whereby individual SNP are tested sequentially under the null hypothesis of no association. Several methods available for this purpose include logistic regression, contingency tables (chi-square test and Fisher's exact test) and mixed linear models.40, 44, 60, 61 Given the large number of SNP being tested, the results have to be corrected for multiple hypothesis testing (Table 2).40, 44 However, these stringent corrective measures often lead to the exclusion of important loci. Therefore, the application of multi-locus analyses may be preferred as the joint effects of multiple markers is considered, thus preventing the need for multiple test correction.40, 44, 62

Table 1. Quality control steps that are applied to genetic data prior to statistical analysis in GWAS. The order in which these steps appear does not reflect the actual order in which they would be carried out

Step	Explanation
(1) SNP missing rate	The number of subjects in a population who are missing information on a specific SNP. SNP with a high missing rate will more likely introduce bias in downstream analyses and should be removed
(2) Subject missing rate	Number of SNP missing for a particular subject. High missing rate at the individual level could indicate poor DNA quality or technical issues during library preparation or sequencing
(3) Sex discrepancy	Discrepancies between the assigned sex of the individual in the study and the sex determined based on the genotype would be checked. This can only be done if SNP on the X and Y chromosomes have been surveyed. Any inconsistencies would point to a probable sample mix-up
(4) MAF	Only SNP that are above an MAF (frequency of the less common allele) threshold would be included. The threshold depends on the population under study and its size, for example large populations would use lower MAF thresholds. SNP with low MAF do not have sufficient power and should be discarded. Such SNP may also indicate genotyping errors
(5) HWE	The HWE law assumes that in the absence of selection, mutation or migration, the allele and genotype frequencies in a very large population should not change from generation to generation.46 SNP which deviate from HWE should be removed as their expected frequencies do not match their observed frequencies. Such SNP can arise due to genotyping errors. However, in cases compared to controls, the deviation from HWE can signify true association with disease risk
(6) Data imputation	This process determines the identity of missing genotypes in SNP data set. This is particularly useful for meta-analyses of GWAS where different genotyping platforms each with a different set of markers were used. In other words, some markers would be present in some GWAS but absent in others. In such cases, genotype imputation would be carried out whereby LD information from reference panels such as HapMap54 and the 1000 Genomes project55 would be used to estimate the missing genotypes
(7) Relatedness	This step determines how strongly two individuals are linked at the genetic level as the inclusion of related subjects could lead to incorrect estimations of standard errors of SNP effect sizes. A standard GWA study assumes that none of the individuals are related unless family data is being analysed
(8) Population stratification	GWAS where subjects of different ethnicities are involved would be subject to population stratification. This is due to the difference in allele frequencies that exist in different subpopulations, which can either lead to or hinder false positive and true positive associations, respectively. To correct for population stratification, the ancestry of each subject would be verified using dedicated software or approaches that compare the allele frequencies across their genome to those of the HapMap ethnic groups. Subjects whose genetic profiles do not match the target populations would be excluded. Principal component analysis and multidimensional scaling plots of the genetic data can also be generated to identify outliers

HWE, Hardy–Weinberg equilibrium; GWAS, genome-wide association study; LD, linkage disequilibrium; MAF, minor allele frequency; QC, quality control; SNP, single nucleotide polymorphism.

Table 2. Statistical methods used to correct for multiple testing. These are available in PLINK

Test	Comment
(1) Bonferroni correction	This method assumes that each SNP association test in a GWAS is independent of the others. This is not necessarily true as SNP in LD are correlated and therefore not independent. This method can be overly stringent and lead to false negatives
(2) FDR	This approach evaluates which percentage of the statistically significant results is true false positives and corrects for them. The FDR method is less stringent than the Bonferroni correction
(3) Permutation testing	In this approach, the genotype information of a given data set is kept constant, but the sample labels are rearranged n times. P-values are then calculated for each permutated data set to establish a threshold. This is the most computationally intensive method among the three, but it also yields the best results

FDR, false discovery rate; GWAS, genome-wide association study; LD, linkage disequilibrium; SNP, single nucleotide polymorphism.

The main limitation of GWAS is the lack of statistical power due to modest effect sizes of the variants identified.40, 44, 45 To overcome this limitation, multiple studies can be combined in a process called meta-analysis. By increasing sample size and surveying more variants throughout the genome, meta-analyses increase power to detect association signals and reduce false positives.40, 44, 45 METAL,63 which is commonly used for this purpose, pools evidence for association from individual GWAS through various statistical approaches revolving around P-values, Z-scores, effect size estimates and standard errors. Examples of how the leveraged statistical power of meta-analyses has revealed novel information in the context of asthma are well-documented.60, 61, 64-66 For instance, Demenais et al.60 utilized GWAS data from ~140 000 individuals of various ancestries and identified more than 800 asthma-related SNP at 18 loci, 5 of which had not previously been associated with asthma. The authors also observed significant overlap between these SNP and those associated with autoimmune, inflammatory and allergic diseases as well as lung cancer, supporting a pleiotropic effect and shared genetic susceptibility across these diseases.

A key challenge of interpreting GWAS is that many candidate variants are not replicated across studies, reflecting the complexity of these diseases and the marked heterogeneity observed among asthma and COPD endotypes. In addition, determining the functional relevance of GWAS hits remains a challenge as most risk variants are located within noncoding regions of the genome. Thus, it is becoming clear that integration of GWAS data with additional biological information is necessary to elucidate the intervening biology that links genetic variation with phenotypic traits, as illustrated in the study by Demenais et al. The authors leveraged reference epigenome data sets to show that the asthma-associated loci were enriched in enhancer marks found in cells of the immune system. Similarly, Ferreira et al.61 searched for shared susceptibility genes across asthma, allergic rhinitis and eczema. They identified 136 disease risk variants at 99 loci across 13 individual GWAS. By comparing these loci with cell type specific regulatory annotations contained within the ENCODE database, the authors showed that within blood, the strongest SNP enrichment occurred within T-helper cell subsets, supporting the well-established role of these cells in allergic disease.

COMPUTATIONAL DRUG REPURPOSING

Bringing a new drug to market is time-consuming (12–15 years) and expensive (>$1 billion), and many drugs with good safety profiles fail due to problems with efficacy.67 These drugs may potentially be useful for the treatment of alternative indications. Through online data repositories such as the Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/), bioinformaticians can access data from millions of samples derived from patients with lung diseases. In parallel, large-scale efforts such as the Library of Integrated Network-Based Cellular Signatures (LINCS) are underway to systematically profile the response of multiple cell lines and primary cells to thousands of genetic (gene knockout and overexpression) and small molecule perturbations.68 The LINCS profiles can be interrogated to identify drug compounds/combinations that are predicted to reverse disease-associated expression signatures or mimic therapeutic responses. The LINCS data can be accessed via a web-based search engine (http://amp.pharm.mssm.edu/L1000CDS2/#/index).69 Researchers can query this search engine by submitting lists of upregulated and downregulated genes.

Network analysis can be used in combination with computational drug repurposing to identify compounds that perturb gene network patterns. This approach has been employed to find repurposed drugs that enhance cancer immunotherapy.70 Notably, the development of drugs that block immune checkpoint pathways (CTLA4, PD-1 and PD-L1) was a major advance in the treatment of lung cancer; however, only around one-third of patients respond.71 Employing a mouse model of mesothelioma that was characterized by a dichotomous response to anti-CTLA4 therapy, gene expression profiles were obtained in whole tumours derived from responding and non-responding mice. WGCNA was performed to identify modules associated with treatment response. The module response patterns were compared with drug-induced perturbation signatures from the connectivity map (cMap) database72 to identify compounds that were predicted to reinforce the therapeutic response. This analysis was based on gene set enrichment analysis (GSEA) methods, which utilize Kolmogorov–Smirnov like statistics to determine if a set of genes is overrepresented at the extreme tails of a ranked list of differentially expressed genes.28, 72 This analysis unveiled all-trans retinoic acid as a candidate that was predicted to enhance the response rate to immune checkpoint blockade therapy, and this prediction was confirmed in experimental validation studies.

Transcription factors play a crucial role in driving the molecular states that underpin pathogenic states; however, these proteins are often considered to be undruggable.73 An exception relevant to respiratory disease was the development of a DNAzyme that targets the transcription factor GATA3.74, 75 Using bioinformatics, Shen et al. has developed a systematic approach to prioritize candidate inhibitors of transcriptional regulators.76 The approach relies on an algorithm called VIPER (virtual inference of protein activity by enriched regulon analysis), which estimates the activity of transcriptional regulators in an unbiased way based on the expression profile of their inferred target genes. Compounds that modulate the activity of transcriptional regulators are identified by running VIPER on the cMap or LINCS databases. VIPER requires a network model to be specified that connects transcriptional regulators to their target genes. The networks are constructed from multiple sources including ARACNe analysis of RNA-Seq profiles, systematic studies of transcription factor binding to regulatory regions of target genes and gene knockdown-induced expression profiles from the GEO database. For proof-of-concept, several candidate inhibitors of MYC and STAT3 were identified and experimentally validated. This approach represents a significant advance towards targeting network driver genes.

MULTI-OMIC DATA INTEGRATION

Biological systems are governed by multiple layers of molecular regulation (Fig. 3).77 A complete understanding of disease will therefore require an integrated model that describes the organization and behaviour of the whole system. The integration of multi-omic data is a computationally challenging task and an ongoing area of method development.78 One major challenge is the large number of variables that are measured in comparison to the small number of samples. Another challenge is variation in the scale, complexity and the correlation structure between data sets. For example, patterns of methylation and mRNA expression in the same set of genes can be inversely or positively correlated, or independent,79 and mRNA and protein expression levels are often not strongly correlated.80 Similarity Network Fusion proposes a unique solution to these problems.81 The approach entails constructing networks of patients instead of molecular features to integrate multi-omic profiles. A subject-to-subject similarity matrix is constructed for each type of omic data, and then message passing theory is employed to fuse the networks into a single similarity matrix. Because the number of predictors depends on the number of subjects, rather than molecular features, the method is well suited to integrate highly heterogeneous omic data sets. In addition, the unsupervised nature of the analysis is well suited to the discovery of novel molecular subphenotypes.

Li et al. employed Similarity Network Fusion to integrate multi-omic profiles from COPD patients, healthy non-smokers and smokers with normal lung function.82 They combined nine blocks of omic data across multiple molecular levels (mRNA, miRNA, proteins and metabolites) and anatomical locations (airway epithelium, lung resident immune cells, airway exudates, exosomes and serum). They found that the mean accuracy of subgroup prediction was 0.28 when each omic data block was analysed in isolation. However, combining data from multiple omic platforms increased the mean accuracy of prediction to 0.9. Moreover, the relationship between classification accuracy and sample size was also evaluated, and they found that combining data from at least five omic blocks could classify COPD patients with 100% accuracy, even with group sizes as small as n = 6. These data highlight the utility of Similarity Network Fusion-based data integration for accurate classification of patient subgroups with small number of patients in the presence of the confounding effects of cigarette smoking.

Integration of multi-omic profiles can provide insights into the vast connections between genes and environmental factors that impact on disease risk. In a landmark study, Price et al. followed 108 ‘healthy’ individuals for 9 months and collected biological samples every 3 months.83 Multi-omic profiles were generated, including WGS, gut microbiome, clinical diagnostic tests, proteins and metabolites. The genome sequencing data was summarized into polygenic risk scores for 127 traits. These risk scores provide a single variable that estimates genome-wide risk for a given trait by summing the number of risk alleles for each individual, weighted by effect size estimates from large GWAS.84 An inter-omic correlation network was then constructed to identify pairwise relationships between the omic data layers. The resulting network consisted of 766 nodes and 3470 edges. For example, α-diversity (species richness) of the microbiome was positively correlated with height and β-nerve growth factor levels, and negatively correlated with CSF-1, IL-8 and FLT3 ligand. The clinical diagnostic tests were utilized to identify deviations from wellness—or test results outside of normal reference ranges. The data were used to suggest evidence-based changes to diet (including supplements) and lifestyle (exercise and stress management) that were personalized for each participant and designed to improve their health.

Inferring network structure from multi-omic data is challenging because there are large numbers of highly correlated variables. If the aim of the study is to find a multi-omic signature comprised of a small set of biomarkers to discriminate a biological outcome of interest, statistical methods such as Multiblock Partial least squares (or projection to latent structures, MBPLS) can be employed.78 These methods seek to maximize covariance between summary vectors derived from each omic data block and a biological phenotype or response. The identified molecular features derived from multi-omic layers can be thought of as jointly contributing to a biological trait. Software for performing these analyses is available in Matlab and R.85, 86

CONCLUSIONS

Non-communicable lung diseases represent a significant global health problem. The pathogenesis of chronic lung diseases is determined by interactions between multiple genes and environmental factors, which are subject to multiple layers of molecular regulation. Bioinformatics will be essential to define molecular subphenotypes of disease in an unbiased way, and identify new therapies that are specifically targeted to the molecular drivers of each phenotype. As the field moves from single to multi-omic studies, we will start to develop a more complete understanding of the complex biology that determines disease expression, which in turn will lead to new opportunities for the diagnosis, treatment and prevention of chronic lung diseases.

Acknowledgements

A.B. is supported by a Fellowship from the Simon Lee Foundation and funding from the National Health and Medical Research Council.

The Authors

The main research interests of B.H. from the Telethon Kids Institute are in the implementation and application of bioinformatics approaches in the fields of transcriptomics, systems immunology and cancer genomics for the interpretation and visualization of biological data. The main research interests of E.d.J. from the Telethon Kids Institute are in using bioinformatics approaches to understand the underlying biology of childhood asthma and allergy, in particular assessing transcriptomic profiles of airway epithelium to investigate inherent changes associated with atopic asthma. The main research interests of A.B. from the Telethon Kids Institute are in using network graph theory and computational biology to work backwards (reverse engineering) from genomic profiles of immune responses to reconstruct the wiring diagram of the underlying gene networks. The long-term goal of this work is to unlock the basic mechanisms and principles that govern the functionality of the immune system in healthy and diseased states.

Abbreviations

ARACNe: Algorithm for the Reconstruction of Accurate Cellular Networks
BWA: Burrows-Wheeler Aligner
CDH1: Cadherin 1
CSF-1: colony stimulating factor 1
cMap: connectivity map
CTLA4: Cytotoxic T-Lymphocyte Associated Protein 4
ERBB2: erb-b2 receptor tyrosine kinase 2
EGFR: epidermal growth factor receptor
FDR: false discovery rate
FLT3: Fms Related Tyrosine Kinase 3
GEO: Gene Expression Omnibus
GWAS: genome-wide association study
HWE: Hardy–Weinberg equilibrium
LD: linkage disequilibrium
LINCS: Library of Integrated Network-Based Cellular Signatures
MAF: minor allele frequency
MDS: multidimensional scaling
PCA: principal component analysis
PD-1: programmed cell death protein 1
PD-L1: programmed death-ligand 1
RNA-Seq: RNA-sequencing
SNP: single nucleotide polymorphism
SOAP: Short Oligonucleotide Analysis Package
Th2: T helper cell type 2
VIPER: virtual inference of protein activity by enriched regulon analysis
WGCNA: weighted gene co-expression network analysis
WGS: whole-genome sequencing

REFERENCES

1Poole JA. Asthma is a major noncommunicable disease affecting over 230 million people worldwide and represents the most common chronic disease among children. Int. Immunopharmacol. 2014; 23: 315.
10.1016/j.intimp.2014.09.013
CAS PubMed Web of Science® Google Scholar
2López-Campos JL, Tan W, Soriano JB. Global burden of COPD. Respirology 2016; 21: 14–23.
10.1111/resp.12660
PubMed Web of Science® Google Scholar
3Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M, Parkin Donald M, Forman D, Bray F. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int. J. Cancer 2014; 136: E359–86.
10.1002/ijc.29210
CAS PubMed Web of Science® Google Scholar
4European Lung White Book. Available from URL: https://www.erswhitebook.org/chapters/the-economic-burden-of-lung-disease/the-cost-of-respiratory-disease/. Accessed: 3 May 2018.
Google Scholar
5Woodruff PG, Modrek B, Choy DF, Jia G, Abbas AR, Ellwanger A, Arron JR, Koth LL, Fahy JV. T-helper type 2–driven inflammation defines major subphenotypes of asthma. Am. J. Respir. Crit. Care Med. 2009; 180: 388–95.
10.1164/rccm.200903-0392OC
CAS PubMed Web of Science® Google Scholar
6Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009; 10: 57–63.
10.1038/nrg2484
CAS PubMed Web of Science® Google Scholar
7Andrews S. FastQC: a quality control tool for high throughput sequence data. Available from URL: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed: 5 July 2018.
Google Scholar
8Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 2016; 32: 3047–8.
10.1093/bioinformatics/btw354
CAS PubMed Web of Science® Google Scholar
9Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014; 30: 2114–20.
10.1093/bioinformatics/btu170
CAS PubMed Web of Science® Google Scholar
10Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009; 25: 1105–11.
10.1093/bioinformatics/btp120
CAS PubMed Web of Science® Google Scholar
11Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 2015; 12: 357–60.
10.1038/nmeth.3317
CAS PubMed Web of Science® Google Scholar
12Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 2013; 29: 15–21.
10.1093/bioinformatics/bts635
CAS PubMed Web of Science® Google Scholar
13Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szcześniak MW, Gaffney DJ, Elo LL, Zhang X et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016; 17: 13.
10.1186/s13059-016-0881-8
PubMed Web of Science® Google Scholar
14Lassmann T, Hayashizaki Y, Daub CO. SAMStat: monitoring biases in next generation sequencing data. Bioinformatics 2011; 27: 130–1.
10.1093/bioinformatics/btq614
CAS PubMed Web of Science® Google Scholar
15Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, Morgan MT, Carey VJ. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 2013; 9: e1003118.
10.1371/journal.pcbi.1003118
CAS PubMed Web of Science® Google Scholar
16Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 2014; 30: 923–30.
10.1093/bioinformatics/btt656
CAS PubMed Web of Science® Google Scholar
17Anders S, Pyl PT, Huber W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 2015; 31: 166–9.
10.1093/bioinformatics/btu638
CAS PubMed Web of Science® Google Scholar
18 R Core Team. R: a language and environment for statistical computing. 2013. Available from URL: http://www.R-project.org/. Accessed: 6 May 2018.
Google Scholar
19Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 2014; 32: 896–902.
10.1038/nbt.2931
CAS PubMed Web of Science® Google Scholar
20Kuo C-HS, Pavlidis S, Loza M, Baribaud F, Rowe A, Pandis I, Hoda U, Rossios C, Sousa A, Wilson SJ et al. A transcriptome-driven analysis of epithelial brushings and bronchial biopsies to define asthma phenotypes in U-BIOPRED. Am. J. Respir. Crit. Care Med. 2016; 195: 443–55.
10.1164/rccm.201512-2452OC
Web of Science® Google Scholar
21Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010; 26: 139–40.
10.1093/bioinformatics/btp616
CAS PubMed Web of Science® Google Scholar
22Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15: 550.
10.1186/s13059-014-0550-8
CAS PubMed Web of Science® Google Scholar
23Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma Powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015; 43: e47.
10.1093/nar/gkv007
CAS PubMed Web of Science® Google Scholar
24Schurch NJ, Schofield P, Gierliński M, Cole C, Sherstnev A, Singh V, Wrobel N, Gharbi K, Simpson GG, Owen-Hughes T et al. How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA 2016; 22: 839–51.
10.1261/rna.053959.115
CAS PubMed Web of Science® Google Scholar
25Huang d W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 2009; 4: 44–57.
10.1038/nprot.2008.211
CAS PubMed Web of Science® Google Scholar
26Breuer K, Foroushani AK, Laird MR, Chen C, Sribnaia A, Lo R, Winsor GL, Hancock REW, Brinkman FSL, Lynn DJ. InnateDB: systems biology of innate immunity and beyond—recent updates and continuing curation. Nucleic Acids Res. 2013; 41: D1228–33.
10.1093/nar/gks1147
CAS PubMed Web of Science® Google Scholar
27Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, Koplev S, Jenkins SL, Jagodnik KM, Lachmann A et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016; 44: W90–7.
10.1093/nar/gkw377
CAS PubMed Web of Science® Google Scholar
28Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 2005; 102: 15545–50.
10.1073/pnas.0506580102
CAS PubMed Web of Science® Google Scholar
29Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov Jill P, Tamayo P. The Molecular Signatures Database hallmark gene set collection. Cell Syst. 2015; 1: 417–25.
10.1016/j.cels.2015.12.004
CAS PubMed Web of Science® Google Scholar
30Krämer A, Green J, Pollard J, Tugendreich S. Causal analysis approaches in Ingenuity Pathway Analysis. Bioinformatics 2014; 30: 523–30.
10.1093/bioinformatics/btt703
CAS PubMed Web of Science® Google Scholar
31Bosco A, Wiehler S, Proud D. Interferon regulatory factor 7 regulates airway epithelial cell responses to human rhinovirus infection. BMC Genomics 2016; 17: 76.
10.1186/s12864-016-2405-z
PubMed Web of Science® Google Scholar
32Ma'ayan A. Complex systems biology. J. R. Soc. Interface 2017; 14: 20170391.
10.1098/rsif.2017.0391
PubMed Web of Science® Google Scholar
33Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008; 9: 559.
10.1186/1471-2105-9-559
CAS PubMed Web of Science® Google Scholar
34Jones AC, Troy NM, White E, Hollams EM, Gout AM, Ling K-M, Kicic A, Stick SM, Sly PD, Holt PG et al. Persistent activation of interlinked type 2 airway epithelial gene networks in sputum-derived cells from aeroallergen-sensitized symptomatic asthmatics. Sci. Rep. 2018; 8: 1511.
10.1038/s41598-018-19837-6
PubMed Web of Science® Google Scholar
35Alvarez MJ, Shen Y, Giorgi FM, Lachmann A, Ding BB, Ye BH, Califano A. Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nat. Genet. 2016; 48: 838–47.
10.1038/ng.3593
CAS PubMed Web of Science® Google Scholar
36Glass K, Huttenhower C, Quackenbush J, Yuan GC. Passing messages between biological networks to refine predicted interactions. PLoS One 2013; 8: e64832.
10.1371/journal.pone.0064832
CAS PubMed Web of Science® Google Scholar
37van IDG, Glass K, Quackenbush J, Kuijjer ML. PyPanda: a Python package for gene regulatory network reconstruction. Bioinformatics 2016; 32: 3363–5.
10.1093/bioinformatics/btw422
PubMed Web of Science® Google Scholar
38Schlauch D, Glass K, Hersh CP, Silverman EK, Quackenbush J. Estimating drivers of cell state transitions using gene regulatory network models. BMC Syst. Biol. 2017; 11: 139.
10.1186/s12918-017-0517-y
PubMed Web of Science® Google Scholar
39Padi M, Quackenbush J. Detecting phenotype-driven transitions in regulatory network structure. NPJ Syst. Biol. Appl. 2018; 4: 16.
10.1038/s41540-018-0052-5
PubMed Web of Science® Google Scholar
40Bush WS, Moore JH. Genome-wide association studies. PLoS Comput. Biol. 2012; 8: e1002822.
10.1371/journal.pcbi.1002822
CAS PubMed Web of Science® Google Scholar
41Witte JS. Genome-wide association studies and beyond. Annu. Rev. Public Health 2010; 31: 9–20.
10.1146/annurev.publhealth.012809.103723
PubMed Web of Science® Google Scholar
42Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science 1996; 273: 1516–7.
10.1126/science.273.5281.1516
CAS PubMed Web of Science® Google Scholar
43Lander ES. The new genomics: global views of biology. Science 1996; 274: 536–9.
10.1126/science.274.5287.536
CAS PubMed Web of Science® Google Scholar
44Zeng P, Zhao Y, Qian C, Zhang L, Zhang R, Gou J, Liu J, Liu L, Chen F. Statistical analysis for genome-wide association study. J. Biomed. Res. 2015; 29: 285.
PubMed Web of Science® Google Scholar
45De R, Bush WS, Moore JH. Bioinformatics challenges in genome-wide association studies (GWAS). In: RJ Trent (ed) Clinical Bioinformatics. New York: Springer, 2008.
Google Scholar
46Marees AT, de Kluiver H, Stringer S, Vorspan F, Curis E, Marie-Claire C, Derks EM. A tutorial on conducting genome-wide association studies: quality control and statistical analysis. Int. J. Methods Psychiatr. Res. 2018; 27: e1608.
10.1002/mpr.1608
PubMed Web of Science® Google Scholar
47Li R, Yu C, Li Y, Lam T-W, Yiu S-M, Kristiansen K, Wang J. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 2009; 25: 1966–7.
10.1093/bioinformatics/btp336
CAS PubMed Web of Science® Google Scholar
48Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 2009; 25: 1754–60.
10.1093/bioinformatics/btp324
CAS PubMed Web of Science® Google Scholar
49Ebbert MT, Wadsworth ME, Staley LA, Hoyt KL, Pickett B, Miller J, Duce J, Kauwe JS, Ridge PG. Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinformatics 2016; 17: 239.
10.1186/s12859-016-1097-3
PubMed Web of Science® Google Scholar
50Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The sequence alignment/map format and SAMtools. Bioinformatics 2009; 25: 2078–9.
10.1093/bioinformatics/btp352
CAS PubMed Web of Science® Google Scholar
51DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, Del Angel G, Rivas MA, Hanna M. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011; 43: 491–8.
10.1038/ng.806
CAS PubMed Web of Science® Google Scholar
52Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012; 22: 568–76.
10.1101/gr.129684.111
CAS PubMed Web of Science® Google Scholar
53Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. [Preprint at http://arxiv.org/abs/1207.3907]: Cornell University; 2012.
Google Scholar
54 The International HapMap Consortium. The international HapMap project. Nature 2003; 426: 789.
10.1038/nature02168
CAS PubMed Web of Science® Google Scholar
55 The 1000 Genomes Project. Variation in genome-wide mutation rates within and between human families. Nat. Genet. 2011; 43: 712.
10.1038/ng.862
PubMed Web of Science® Google Scholar
56Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, De Bakker PI, Daly MJ. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007; 81: 559–75.
10.1086/519795
CAS PubMed Web of Science® Google Scholar
57Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009; 5: e1000529.
10.1371/journal.pgen.1000529
CAS PubMed Web of Science® Google Scholar
58Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 2010; 34: 816–34.
10.1002/gepi.20533
CAS PubMed Web of Science® Google Scholar
59Browning BL, Browning SR. Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 2016; 98: 116–26.
10.1016/j.ajhg.2015.11.020
CAS PubMed Web of Science® Google Scholar
60Demenais F, Margaritte-Jeannin P, Barnes KC, Cookson WO, Altmüller J, Ang W, Barr RG, Beaty TH, Becker AB, Beilby J. Multiancestry association study identifies new asthma risk loci that colocalize with immune-cell enhancer marks. Nat. Genet. 2018; 50: 42–53.
10.1038/s41588-017-0014-7
CAS PubMed Web of Science® Google Scholar
61Ferreira MA, Vonk JM, Baurecht H, Marenholz I, Tian C, Hoffman JD, Helmer Q, Tillander A, Ullemar V, van Dongen J. Shared genetic origin of asthma, hay fever and eczema elucidates allergic disease biology. Nat. Genet. 2017; 49: 1752–7.
10.1038/ng.3985
CAS PubMed Web of Science® Google Scholar
62Wang S-B, Feng J-Y, Ren W-L, Huang B, Zhou L, Wen Y-J, Zhang J, Dunwell JM, Xu S, Zhang Y-M. Improving power and accuracy of genome-wide association studies via a multi-locus mixed linear model methodology. Sci. Rep. 2016; 6: 19444.
10.1038/srep19444
CAS PubMed Web of Science® Google Scholar
63Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 2010; 26: 2190–1.
10.1093/bioinformatics/btq340
CAS PubMed Web of Science® Google Scholar
64Salpeter SR, Buckley NS, Ormiston TM, Salpeter EE. Meta-analysis: effect of long-acting β-agonists on severe asthma exacerbations and asthma-related deaths. Ann. Intern. Med. 2006; 144: 904–12.
10.7326/0003-4819-144-12-200606200-00126
CAS PubMed Web of Science® Google Scholar
65Shrewsbury S, Pyke S, Britton M. Meta-analysis of increased dose of inhaled steroid or addition of salmeterol in symptomatic asthma (MIASMA). BMJ 2000; 320: 1368–73.
10.1136/bmj.320.7246.1368
CAS PubMed Web of Science® Google Scholar
66Torgerson DG, Ampleford EJ, Chiu GY, Gauderman WJ, Gignoux CR, Graves PE, Himes BE, Levin AM, Mathias RA, Hancock DB. Meta-analysis of genome-wide association studies of asthma in ethnically diverse North American populations. Nat. Genet. 2011; 43: 887.
10.1038/ng.888
CAS PubMed Web of Science® Google Scholar
67Nosengo N. Can you teach old drugs new tricks? Nature 2016; 534: 314–6.
10.1038/534314a
PubMed Web of Science® Google Scholar
68Keenan AB, Jenkins SL, Jagodnik KM, Koplev S, He E, Torre D, Wang Z, Dohlman AB, Silverstein MC, Lachmann A et al. The library of integrated network-based cellular signatures NIH program: system-level cataloging of human cells response to perturbations. Cell Syst. 2018; 6: 13–24.
10.1016/j.cels.2017.11.001
CAS PubMed Web of Science® Google Scholar
69Duan Q, Reid SP, Clark NR, Wang Z, Fernandez NF, Rouillard AD, Readhead B, Tritsch SR, Hodos R, Hafner M et al. L1000CDS(2): LINCS L1000 characteristic direction signatures search engine. NPJ Syst. Biol. Appl. 2016; 2: pii: 16015.
10.1038/npjsba.2016.15
Google Scholar
70Lesterhuis WJ, Rinaldi C, Jones A, Rozali EN, Dick IM, Khong A, Boon L, Robinson BW, Nowak AK, Bosco A et al. Network analysis of immunotherapy-induced regressing tumours identifies novel synergistic drug combinations. Sci. Rep. 2015; 5: 12298.
10.1038/srep12298
PubMed Web of Science® Google Scholar
71Jain P, Jain C, Velcheti V. Role of immune-checkpoint inhibitors in lung cancer. Ther. Adv. Respir. Dis. 2018; 12: 1753465817750075.
10.1177/1753465817750075
Web of Science® Google Scholar
72Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 2006; 313: 1929–35.
10.1126/science.1132939
CAS PubMed Web of Science® Google Scholar
73Dang CV, Reddy EP, Shokat KM, Soucek L. Drugging the 'undruggable' cancer targets. Nat. Rev. Cancer 2017; 17: 502–8.
10.1038/nrc.2017.36
CAS PubMed Web of Science® Google Scholar
74Krug N, Hohlfeld JM, Kirsten AM, Kornmann O, Beeh KM, Kappeler D, Korn S, Ignatenko S, Timmer W, Rogon C et al. Allergen-induced asthmatic responses modified by a GATA3-specific DNAzyme. N. Engl. J. Med. 2015; 372: 1987–95.
10.1056/NEJMoa1411776
PubMed Web of Science® Google Scholar
75Greulich T, Hohlfeld JM, Neuser P, Lueer K, Klemmer A, Schade-Brittinger C, Harnisch S, Garn H, Renz H, Homburg U et al. A GATA3-specific DNAzyme attenuates sputum eosinophilia in eosinophilic COPD patients: a feasibility randomized clinical trial. Respir. Res. 2018; 19: 55.
10.1186/s12931-018-0751-x
PubMed Web of Science® Google Scholar
76Shen Y, Alvarez MJ, Bisikirska B, Lachmann A, Realubit R, Pampou S, Coku J, Karan C, Califano A. Systematic, network-based characterization of therapeutic target inhibitors. PLoS Comput. Biol. 2017; 13: e1005599.
10.1371/journal.pcbi.1005599
PubMed Web of Science® Google Scholar
77Noell G, Faner R, Agusti A. From systems biology to P4 medicine: applications in respiratory medicine. Eur. Respir. Rev. 2018; 27: pii: 170110.
10.1183/16000617.0110-2017
Web of Science® Google Scholar
78Huang S, Chaudhary K, Garmire LX. More is better: recent Progress in multi-omics data integration methods. Front. Genet. 2017; 8: 84.
10.3389/fgene.2017.00084
CAS PubMed Web of Science® Google Scholar
79van Eijk KR, de Jong S, Boks MP, Langeveld T, Colas F, Veldink JH, de Kovel CG, Janson E, Strengman E, Langfelder P et al. Genetic analysis of DNA methylation and gene expression levels in whole blood of healthy human subjects. BMC Genomics 2012; 13: 636.
10.1186/1471-2164-13-636
CAS PubMed Web of Science® Google Scholar
80Liu Y, Beyer A, Aebersold R. On the dependency of cellular protein levels on mRNA abundance. Cell 2016; 165: 535–50.
10.1016/j.cell.2016.03.014
CAS PubMed Web of Science® Google Scholar
81Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, Haibe-Kains B, Goldenberg A. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 2014; 11: 333–7.
10.1038/nmeth.2810
CAS PubMed Web of Science® Google Scholar
82Li CX, Wheelock CE, Skold CM, Wheelock AM. Integration of multi-omics datasets enables molecular classification of COPD. Eur. Respir. J. 2018; 51: pii: 1701930.
10.1183/13993003.01930-2017
Web of Science® Google Scholar
83Price ND, Magis AT, Earls JC, Glusman G, Levy R, Lausted C, McDonald DT, Kusebauch U, Moss CL, Zhou Y et al. A wellness study of 108 individuals using personal, dense, dynamic data clouds. Nat. Biotechnol. 2017; 35: 747–56.
10.1038/nbt.3870
CAS PubMed Web of Science® Google Scholar
84Lewis CM, Vassos E. Prospects for using risk scores in polygenic medicine. Genome Med. 2017; 9: 96.
10.1186/s13073-017-0489-y
PubMed Web of Science® Google Scholar
85Rohart F, Gautier B, Singh A, Le Cao KA. mixOmics: an R package for 'omics feature selection and multiple data integration. PLoS Comput. Biol. 2017; 13: e1005752.
10.1371/journal.pcbi.1005752
PubMed Web of Science® Google Scholar
86Li W, Zhang S, Liu C-C, Zhou XJ. Identifying multi-layer gene regulatory modules from multi-dimensional genomic data. Bioinformatics 2012; 28: 2458–66.
10.1093/bioinformatics/bts476
CAS PubMed Web of Science® Google Scholar

Citing Literature

Volume23, Issue12

December 2018

Pages 1117-1126

This article also appears in:

Molecular Techniques for Respiratory Diseases

Insights into respiratory disease through bioinformatics

ABSTRACT

INTRODUCTION

TRANSCRIPTOMICS: FROM SHORT SEQUENCING READS TO BIOLOGICAL MECHANISMS

GENOME-WIDE ASSOCIATION STUDIES

COMPUTATIONAL DRUG REPURPOSING

MULTI-OMIC DATA INTEGRATION

CONCLUSIONS

Acknowledgements

The Authors

Abbreviations

REFERENCES

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Insights into respiratory disease through bioinformatics

ABSTRACT

INTRODUCTION

TRANSCRIPTOMICS: FROM SHORT SEQUENCING READS TO BIOLOGICAL MECHANISMS

GENOME-WIDE ASSOCIATION STUDIES

COMPUTATIONAL DRUG REPURPOSING

MULTI-OMIC DATA INTEGRATION

CONCLUSIONS

Acknowledgements

The Authors

Abbreviations

REFERENCES

Citing Literature

Figures

References

Related

Information