Introduction

This vignette will give a brief overview of the practical problem that motivated us to work on this project, some mathematical details about our method, and a walkthrough of a real data analysis example.

To install this package, you will need the devtools:: package and either Rtools (Windows) or Xcode (Mac) to build packages from GitHub. Then, install this package via

library(devtools)
install_github("TransBioInfoLab/pathwayMultiomics")

Motivating Example

Some time ago, our research team attempted to use multi-omics tools to answer a few questions concerning colorectal cancer. While the tools were often well-documented and had nice statistical properties, we consistently ran into one of the following two constraints:

  • Multi-Omics tools must match across platforms by samples, or
  • Multi-Omics tools must match across platforms by features (e.g. genes).

Joining data platforms

In many cancer data sets, we could usually find a nice overlapping set of samples between two platforms of data (for instance, copy number variation and DNA methylation). However, as soon as we added in a third or fourth platform (such as gene and/or protein expression), the intersection of the samples recorded with these new platforms rapidly approached the empty set. We could start with 600+ samples, and we’d barely have 50 samples by the time we added a third platform and only a handful of samples (or even none at all) if we added a fourth platform.

If instead, we tried to match on features, we would run into a different set of issues. What is the best way to map from the “tallest” data, site- or probe-level data (as in DNA methylation, SNP, or RNA editing levels) to the “shortest” data, such as expressions of small sets of individual proteins? Using a feature-matching strategy would require us to discard many relevant biological features simply because they were not recorded in the most restricted data.

A Pathway-based solution

Rather than using only the intersection of the samples or only the intersection of the genomic features, we proposed to match across data sets by data summaries of biological pathway p-values. For example, consider omics data measured on three data platforms for colorectal adenocarcinoma: copy number variation, DNA methylation, and protein expression. These three data sets may share samples, but that isn’t required.

Our first step is to statistically assess the activity of a set of biological pathways in each genomic dataset independently and then match the pathway p-values across the three datasets. The resulting summary data would have one row per pathway in the collection and three columns of p-values for each pathway.

In this package, we provide an example of SNP, DNA methylation, and gene expression results for 640 subjects with Alzheimer’s disease (from the ROSMAP study) evaluated over the Broad Institute’s C2 CP collection. We show the first 10 rows (of the 2833 pathways in the collection with fewer than 200 genes or more than 4 genes):

pathway snpPval dnamPval rnaseqPval
BIOCARTA_41BB_PATHWAY 0.7792572 0.8163709 0.0371308
BIOCARTA_ACE2_PATHWAY 0.8778045 0.2253418 0.0011873
BIOCARTA_ACETAMINOPHEN_PATHWAY 0.2675038 0.2972964 0.6133866
BIOCARTA_ACH_PATHWAY 0.5626200 0.5347566 0.0389610
BIOCARTA_ACTINY_PATHWAY 0.4538971 0.1351410 0.0359640
BIOCARTA_AGPCR_PATHWAY 0.5998621 0.0118939 0.0187746
BIOCARTA_AGR_PATHWAY 0.4561920 0.9892614 0.2667333
BIOCARTA_AHSP_PATHWAY 0.7556747 0.4608289 0.2947053
BIOCARTA_AKAP13_PATHWAY 0.8725796 0.3237901 0.2807193
BIOCARTA_AKAP95_PATHWAY 0.7723686 0.2584549 0.0369630

The MiniMax Statistic and its Distribution

The MiniMax statistic

From a systems biology perspective, when we perform a multi-omics analysis, we often care about biological processes which are dysregulated across multiple layers. For example, a moderate cascading effect from DNA methylation upstream through gene expression to protein expression may be of more practical significance than a heavily dysregulated effect in a single layer of the process. The latter scenario will be easily and quickly detected in single-omics analysis, but the former scenario will be missed.

Therefore, the pathway MiniMax statistic is defined to be the minimum p-value among all pairwise maxima for that biological pathway. Biologically, it indicates if the pathway is dysregulated in two or more layers of genomic data. Mathematically, such a measure is equivalent to requiring that the second smallest p-value is significant. That is, we can equivalently define the MiniMax statistic to be the second-order statistic of the p-values.

The Distribution of the MiniMax statistic

Let us consider the scenario wherein we test that a single pathway is dysregulated for a single data platform \(g \in 1, 2, \ldots, G\). Under the null hypothesis \(H_0\) (that there is no signal within that pathway for genomics data set \(g\)), the \(p\)-value of a well-defined statistical test for this question follows a uniform distribution; that is, \(p_g \sim U[0,1]\). Further, consider \(G\) such tests for that pathway, and for the sake of argument, assume that they are independent (a horribly inaccurate assumption to be sure); then under the null hypothesis \(H_0\), \[ p_g \overset{i.i.d.}{\sim} U[0,1],\ g = 1, 2, \ldots, G. \] It is a known result that the order statistics for such a collection of uniform random variables follows a Beta distribution (denoted \(\mathbb{B}(\alpha, \beta)\) herein). Thus, \[ \text{MiniMax}(p_1, p_2, \ldots, p_G) \equiv p_{[2]} \sim \mathbb{B}(2, G + 1 - 2). \]

Adjusting the parameters for dependence across data platforms

We know that this assumption of independence may not hold under real data. We believe that the pathway activity for one layer of genomic data should be related to the activity for that pathway in another genomic data layer. However, we believe that the use of the Beta distribution itself will still be appropriate, but that the parameters need to be adjusted to account for the dependence between \(p_1, p_2, \ldots, p_G\).

In order to estimate these Beta distribution parameters, we recommend that you run your single-platform analyses twice: once with the real data, and once with a random permutation of phenotypes in that dataset (to simulate pathway conditions under \(H_0\)). Once this has been completed, pass the MiniMax statistics calculated from platform \(p\)-values for each pathway under \(H_0\) to the MiniMax_estBetaParams() function. This function has options to use either the Method of Moments or Maximum Likelihood to estimate the parameters of this Beta distribution. If you do not have access to the original data (such as in the case of a meta-analysis), then the only option we offer right now is to use the closed-form definitions of these parameters (\(\{\alpha = 2, \beta = G + 1 - 2\}\)).

Continuing the Alzheimer’s disease example

In the data set above, the MiniMax statistics and corresponding \(p\)-values are given below. Note that we assume that \(\{\alpha = 2, \beta = 2\}\) since the individual omics data p-values for DNA methylation and SNP were based on meta-analyses (we do not have access to the raw data to estimate the parameters as we would like). However, we verified in silico that these fixed values performed reasonably well. Therefore, we can add the statistic and its significance as follows:

###  The results table  ---
alzheimersMultiOmics_df
#> # A tibble: 2,833 x 8
#>    pathway          size  snpPval   snpFDR dnamPval dnamFDR rnaseqPval rnaseqFDR
#>    <chr>           <dbl>    <dbl>    <dbl>    <dbl>   <dbl>      <dbl>     <dbl>
#>  1 BIOCARTA_GRANU~    14 9.67e- 1 9.99e- 1  0.00241   0.359     0.919     0.936 
#>  2 BIOCARTA_LYM_P~    11 6.79e- 1 9.99e- 1  0.0156    0.538     0.0749    0.158 
#>  3 BIOCARTA_BLYMP~    12 3.31e-69 1.17e-66  0.138     0.858     0.999     0.999 
#>  4 BIOCARTA_CARM_~    24 3.30e- 1 9.99e- 1  0.0263    0.643     0.0287    0.0878
#>  5 BIOCARTA_LAIR_~    17 2.17e- 1 9.99e- 1  0.0463    0.718     0.630     0.700 
#>  6 BIOCARTA_VDR_P~    24 8.78e- 1 9.99e- 1  0.115     0.840     0.0350    0.0988
#>  7 BIOCARTA_MTA3_~    18 7.98e- 1 9.99e- 1  0.0494    0.721     0.105     0.197 
#>  8 BIOCARTA_GABA_~     9 3.71e- 1 9.99e- 1  1         1         0.0280    0.0865
#>  9 BIOCARTA_EGFR_~    11 6.77e- 1 9.99e- 1  0.0339    0.668     0.118     0.213 
#> 10 BIOCARTA_MONOC~    10 5.26e- 1 9.99e- 1  0.657     1         0.167     0.268 
#> # ... with 2,823 more rows


###  Pathway Multi-Omics Significance with the MiniMax  ---
# We accept the default values of the Beta Distribution
adMiniMax_df <- 
    alzheimersMultiOmics_df %>% 
    # The MiniMax() function takes in the gene set name and the three p-value
    #   columns
    select(pathway, ends_with("Pval")) %>% 
    rename(
        SNP = snpPval, DNAm = dnamPval, RNAseq = rnaseqPval
    ) %>% 
    MiniMax()

adMiniMax_df
#> # A tibble: 2,833 x 7
#>    pathway                     SNP     DNAm   RNAseq MiniMax  MiniMaxP drivers  
#>    <chr>                     <dbl>    <dbl>    <dbl>   <dbl>     <dbl> <chr>    
#>  1 PID_PDGFRB_PATHWAY     6.99e- 1 0.000145  1.67e-4 1.67e-4   8.33e-8 DNAm and~
#>  2 WP_CHEMOKINE_SIGNALIN~ 8.19e- 1 0.000317  1.94e-5 3.17e-4   3.02e-7 DNAm and~
#>  3 KEGG_HEMATOPOIETIC_CE~ 3.67e-36 0.000340  7.61e-1 3.40e-4   3.48e-7 DNAm and~
#>  4 PID_TCR_PATHWAY        4.48e- 4 0.0275    4.90e-4 4.90e-4   7.20e-7 RNAseq a~
#>  5 WP_REGULATION_OF_TOLL~ 3.76e- 5 0.0332    6.55e-4 6.55e-4   1.29e-6 RNAseq a~
#>  6 KEGG_CHEMOKINE_SIGNAL~ 7.90e- 1 0.000672  2.98e-4 6.72e-4   1.35e-6 DNAm and~
#>  7 PID_KIT_PATHWAY        2.55e- 1 0.000684  1.10e-4 6.84e-4   1.40e-6 DNAm and~
#>  8 WP_KIT_RECEPTOR_SIGNA~ 3.05e- 2 0.000368  1.41e-3 1.41e-3   5.94e-6 DNAm and~
#>  9 PID_CXCR4_PATHWAY      4.87e- 4 0.00150   7.29e-2 1.50e-3   6.76e-6 DNAm and~
#> 10 REACTOME_TCR_SIGNALING 6.06e-52 0.207     2.16e-3 2.16e-3   1.40e-5 RNAseq a~
#> # ... with 2,823 more rows

Finally, we can adjust these pathway \(p\)-values for multiple comparisons and filter to those most significant.

adRes_df <- 
    adMiniMax_df %>% 
    mutate(MiniMaxFDR = p.adjust(MiniMaxP, method = "fdr")) %>% 
    filter(MiniMaxFDR < 0.01) 

adRes_df %>% 
    select(pathway, MiniMaxFDR, drivers) %>% 
    mutate(pathway = str_trunc(pathway, width = 45)) %>% 
    kableExtra::kable()
pathway MiniMaxFDR drivers
PID_PDGFRB_PATHWAY 0.0002360 DNAm and RNAseq
WP_CHEMOKINE_SIGNALING_PATHWAY 0.0003282 DNAm and RNAseq
KEGG_HEMATOPOIETIC_CELL_LINEAGE 0.0003282 DNAm and SNP
PID_TCR_PATHWAY 0.0005100 RNAseq and SNP
WP_REGULATION_OF_TOLLLIKE_RECEPTOR_SIGNALI… 0.0005686 RNAseq and SNP
KEGG_CHEMOKINE_SIGNALING_PATHWAY 0.0005686 DNAm and RNAseq
PID_KIT_PATHWAY 0.0005686 DNAm and RNAseq
WP_KIT_RECEPTOR_SIGNALING_PATHWAY 0.0021048 DNAm and RNAseq
PID_CXCR4_PATHWAY 0.0021267 DNAm and SNP
REACTOME_TCR_SIGNALING 0.0039757 RNAseq and SNP
REACTOME_HATS_ACETYLATE_HISTONES 0.0071556 RNAseq and SNP
REACTOME_MHC_CLASS_II_ANTIGEN_PRESENTATION 0.0071556 RNAseq and SNP
WP_INSULIN_SIGNALING 0.0071556 DNAm and RNAseq
REACTOME_RAC1_GTPASE_CYCLE 0.0076145 DNAm and RNAseq
REACTOME_DDX58_IFIH1_MEDIATED_INDUCTION_OF… 0.0076145 RNAseq and SNP
KEGG_NEUROTROPHIN_SIGNALING_PATHWAY 0.0076145 DNAm and RNAseq
WP_TRANSCRIPTION_FACTOR_REGULATION_IN_ADIP… 0.0077424 DNAm and RNAseq
REACTOME_SUMOYLATION_OF_CHROMATIN_ORGANIZA… 0.0094065 RNAseq and SNP
REACTOME_HCMV_INFECTION 0.0094065 RNAseq and SNP
REACTOME_NEUROTRANSMITTER_RECEPTORS_AND_PO… 0.0097309 RNAseq and SNP

In our manuscript, we demonstrate many of the top pathways have been implicated in Alzheimer’s disease or in neurodegenerative conditions at large.


Session Information

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] pathwayMultiomics_0.0.0.9006 forcats_0.5.1               
#>  [3] stringr_1.4.0                dplyr_1.0.7                 
#>  [5] purrr_0.3.4                  readr_2.0.2                 
#>  [7] tidyr_1.1.4                  tibble_3.1.5                
#>  [9] ggplot2_3.3.5                tidyverse_1.3.1             
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.7         svglite_2.0.0      lubridate_1.8.0    assertthat_0.2.1  
#>  [5] rprojroot_2.0.2    digest_0.6.28      utf8_1.2.2         R6_2.5.1          
#>  [9] cellranger_1.1.0   backports_1.3.0    RcppZiggurat_0.1.6 reprex_2.0.1      
#> [13] evaluate_0.14      highr_0.9          httr_1.4.2         pillar_1.6.4      
#> [17] rlang_0.4.12       readxl_1.3.1       rstudioapi_0.13    jquerylib_0.1.4   
#> [21] rmarkdown_2.11     pkgdown_1.6.1      textshaping_0.3.6  desc_1.4.0        
#> [25] webshot_0.5.2      munsell_0.5.0      broom_0.7.10       compiler_4.1.1    
#> [29] modelr_0.1.8       xfun_0.27          pkgconfig_2.0.3    systemfonts_1.0.3 
#> [33] htmltools_0.5.2    tidyselect_1.1.1   viridisLite_0.4.0  fansi_0.5.0       
#> [37] crayon_1.4.2       tzdb_0.2.0         dbplyr_2.1.1       withr_2.4.2       
#> [41] grid_4.1.1         jsonlite_1.7.2     gtable_0.3.0       lifecycle_1.0.1   
#> [45] DBI_1.1.1          magrittr_2.0.1     scales_1.1.1       Rfast_2.0.3       
#> [49] cli_3.1.0          stringi_1.7.5      cachem_1.0.6       fs_1.5.0          
#> [53] xml2_1.3.2         bslib_0.3.1        ellipsis_0.3.2     ragg_1.2.0        
#> [57] generics_0.1.1     vctrs_0.3.8        kableExtra_1.3.4   tools_4.1.1       
#> [61] glue_1.4.2         hms_1.1.1          parallel_4.1.1     fastmap_1.1.0     
#> [65] yaml_2.2.1         colorspace_2.0-2   rvest_1.0.2        memoise_2.0.0     
#> [69] knitr_1.36         haven_2.4.3        sass_0.4.0