vignettes/intro_to_minimax.Rmd
intro_to_minimax.RmdThis vignette will give a brief overview of the practical problem that motivated us to work on this project, some mathematical details about our method, and a walkthrough of a real data analysis example.
To install this package, you will need the devtools:: package and either Rtools (Windows) or Xcode (Mac) to build packages from GitHub. Then, install this package via
library(devtools)
install_github("TransBioInfoLab/pathwayMultiomics")
Some time ago, our research team attempted to use multi-omics tools to answer a few questions concerning colorectal cancer. While the tools were often well-documented and had nice statistical properties, we consistently ran into one of the following two constraints:
In many cancer data sets, we could usually find a nice overlapping set of samples between two platforms of data (for instance, copy number variation and DNA methylation). However, as soon as we added in a third or fourth platform (such as gene and/or protein expression), the intersection of the samples recorded with these new platforms rapidly approached the empty set. We could start with 600+ samples, and we’d barely have 50 samples by the time we added a third platform and only a handful of samples (or even none at all) if we added a fourth platform.
If instead, we tried to match on features, we would run into a different set of issues. What is the best way to map from the “tallest” data, site- or probe-level data (as in DNA methylation, SNP, or RNA editing levels) to the “shortest” data, such as expressions of small sets of individual proteins? Using a feature-matching strategy would require us to discard many relevant biological features simply because they were not recorded in the most restricted data.
Rather than using only the intersection of the samples or only the intersection of the genomic features, we proposed to match across data sets by data summaries of biological pathway p-values. For example, consider omics data measured on three data platforms for colorectal adenocarcinoma: copy number variation, DNA methylation, and protein expression. These three data sets may share samples, but that isn’t required.
Our first step is to statistically assess the activity of a set of biological pathways in each genomic dataset independently and then match the pathway p-values across the three datasets. The resulting summary data would have one row per pathway in the collection and three columns of p-values for each pathway.
In this package, we provide an example of SNP, DNA methylation, and gene expression results for 640 subjects with Alzheimer’s disease (from the ROSMAP study) evaluated over the Broad Institute’s C2 CP collection. We show the first 10 rows (of the 2833 pathways in the collection with fewer than 200 genes or more than 4 genes):
| pathway | snpPval | dnamPval | rnaseqPval |
|---|---|---|---|
| BIOCARTA_41BB_PATHWAY | 0.7792572 | 0.8163709 | 0.0371308 |
| BIOCARTA_ACE2_PATHWAY | 0.8778045 | 0.2253418 | 0.0011873 |
| BIOCARTA_ACETAMINOPHEN_PATHWAY | 0.2675038 | 0.2972964 | 0.6133866 |
| BIOCARTA_ACH_PATHWAY | 0.5626200 | 0.5347566 | 0.0389610 |
| BIOCARTA_ACTINY_PATHWAY | 0.4538971 | 0.1351410 | 0.0359640 |
| BIOCARTA_AGPCR_PATHWAY | 0.5998621 | 0.0118939 | 0.0187746 |
| BIOCARTA_AGR_PATHWAY | 0.4561920 | 0.9892614 | 0.2667333 |
| BIOCARTA_AHSP_PATHWAY | 0.7556747 | 0.4608289 | 0.2947053 |
| BIOCARTA_AKAP13_PATHWAY | 0.8725796 | 0.3237901 | 0.2807193 |
| BIOCARTA_AKAP95_PATHWAY | 0.7723686 | 0.2584549 | 0.0369630 |
From a systems biology perspective, when we perform a multi-omics analysis, we often care about biological processes which are dysregulated across multiple layers. For example, a moderate cascading effect from DNA methylation upstream through gene expression to protein expression may be of more practical significance than a heavily dysregulated effect in a single layer of the process. The latter scenario will be easily and quickly detected in single-omics analysis, but the former scenario will be missed.
Therefore, the pathway MiniMax statistic is defined to be the minimum p-value among all pairwise maxima for that biological pathway. Biologically, it indicates if the pathway is dysregulated in two or more layers of genomic data. Mathematically, such a measure is equivalent to requiring that the second smallest p-value is significant. That is, we can equivalently define the MiniMax statistic to be the second-order statistic of the p-values.
Let us consider the scenario wherein we test that a single pathway is dysregulated for a single data platform \(g \in 1, 2, \ldots, G\). Under the null hypothesis \(H_0\) (that there is no signal within that pathway for genomics data set \(g\)), the \(p\)-value of a well-defined statistical test for this question follows a uniform distribution; that is, \(p_g \sim U[0,1]\). Further, consider \(G\) such tests for that pathway, and for the sake of argument, assume that they are independent (a horribly inaccurate assumption to be sure); then under the null hypothesis \(H_0\), \[ p_g \overset{i.i.d.}{\sim} U[0,1],\ g = 1, 2, \ldots, G. \] It is a known result that the order statistics for such a collection of uniform random variables follows a Beta distribution (denoted \(\mathbb{B}(\alpha, \beta)\) herein). Thus, \[ \text{MiniMax}(p_1, p_2, \ldots, p_G) \equiv p_{[2]} \sim \mathbb{B}(2, G + 1 - 2). \]
We know that this assumption of independence may not hold under real data. We believe that the pathway activity for one layer of genomic data should be related to the activity for that pathway in another genomic data layer. However, we believe that the use of the Beta distribution itself will still be appropriate, but that the parameters need to be adjusted to account for the dependence between \(p_1, p_2, \ldots, p_G\).
In order to estimate these Beta distribution parameters, we recommend that you run your single-platform analyses twice: once with the real data, and once with a random permutation of phenotypes in that dataset (to simulate pathway conditions under \(H_0\)). Once this has been completed, pass the MiniMax statistics calculated from platform \(p\)-values for each pathway under \(H_0\) to the MiniMax_estBetaParams() function. This function has options to use either the Method of Moments or Maximum Likelihood to estimate the parameters of this Beta distribution. If you do not have access to the original data (such as in the case of a meta-analysis), then the only option we offer right now is to use the closed-form definitions of these parameters (\(\{\alpha = 2, \beta = G + 1 - 2\}\)).
In the data set above, the MiniMax statistics and corresponding \(p\)-values are given below. Note that we assume that \(\{\alpha = 2, \beta = 2\}\) since the individual omics data p-values for DNA methylation and SNP were based on meta-analyses (we do not have access to the raw data to estimate the parameters as we would like). However, we verified in silico that these fixed values performed reasonably well. Therefore, we can add the statistic and its significance as follows:
### The results table ---
alzheimersMultiOmics_df
#> # A tibble: 2,833 x 8
#> pathway size snpPval snpFDR dnamPval dnamFDR rnaseqPval rnaseqFDR
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 BIOCARTA_GRANU~ 14 9.67e- 1 9.99e- 1 0.00241 0.359 0.919 0.936
#> 2 BIOCARTA_LYM_P~ 11 6.79e- 1 9.99e- 1 0.0156 0.538 0.0749 0.158
#> 3 BIOCARTA_BLYMP~ 12 3.31e-69 1.17e-66 0.138 0.858 0.999 0.999
#> 4 BIOCARTA_CARM_~ 24 3.30e- 1 9.99e- 1 0.0263 0.643 0.0287 0.0878
#> 5 BIOCARTA_LAIR_~ 17 2.17e- 1 9.99e- 1 0.0463 0.718 0.630 0.700
#> 6 BIOCARTA_VDR_P~ 24 8.78e- 1 9.99e- 1 0.115 0.840 0.0350 0.0988
#> 7 BIOCARTA_MTA3_~ 18 7.98e- 1 9.99e- 1 0.0494 0.721 0.105 0.197
#> 8 BIOCARTA_GABA_~ 9 3.71e- 1 9.99e- 1 1 1 0.0280 0.0865
#> 9 BIOCARTA_EGFR_~ 11 6.77e- 1 9.99e- 1 0.0339 0.668 0.118 0.213
#> 10 BIOCARTA_MONOC~ 10 5.26e- 1 9.99e- 1 0.657 1 0.167 0.268
#> # ... with 2,823 more rows
### Pathway Multi-Omics Significance with the MiniMax ---
# We accept the default values of the Beta Distribution
adMiniMax_df <-
alzheimersMultiOmics_df %>%
# The MiniMax() function takes in the gene set name and the three p-value
# columns
select(pathway, ends_with("Pval")) %>%
rename(
SNP = snpPval, DNAm = dnamPval, RNAseq = rnaseqPval
) %>%
MiniMax()
adMiniMax_df
#> # A tibble: 2,833 x 7
#> pathway SNP DNAm RNAseq MiniMax MiniMaxP drivers
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 PID_PDGFRB_PATHWAY 6.99e- 1 0.000145 1.67e-4 1.67e-4 8.33e-8 DNAm and~
#> 2 WP_CHEMOKINE_SIGNALIN~ 8.19e- 1 0.000317 1.94e-5 3.17e-4 3.02e-7 DNAm and~
#> 3 KEGG_HEMATOPOIETIC_CE~ 3.67e-36 0.000340 7.61e-1 3.40e-4 3.48e-7 DNAm and~
#> 4 PID_TCR_PATHWAY 4.48e- 4 0.0275 4.90e-4 4.90e-4 7.20e-7 RNAseq a~
#> 5 WP_REGULATION_OF_TOLL~ 3.76e- 5 0.0332 6.55e-4 6.55e-4 1.29e-6 RNAseq a~
#> 6 KEGG_CHEMOKINE_SIGNAL~ 7.90e- 1 0.000672 2.98e-4 6.72e-4 1.35e-6 DNAm and~
#> 7 PID_KIT_PATHWAY 2.55e- 1 0.000684 1.10e-4 6.84e-4 1.40e-6 DNAm and~
#> 8 WP_KIT_RECEPTOR_SIGNA~ 3.05e- 2 0.000368 1.41e-3 1.41e-3 5.94e-6 DNAm and~
#> 9 PID_CXCR4_PATHWAY 4.87e- 4 0.00150 7.29e-2 1.50e-3 6.76e-6 DNAm and~
#> 10 REACTOME_TCR_SIGNALING 6.06e-52 0.207 2.16e-3 2.16e-3 1.40e-5 RNAseq a~
#> # ... with 2,823 more rowsFinally, we can adjust these pathway \(p\)-values for multiple comparisons and filter to those most significant.
adRes_df <-
adMiniMax_df %>%
mutate(MiniMaxFDR = p.adjust(MiniMaxP, method = "fdr")) %>%
filter(MiniMaxFDR < 0.01)
adRes_df %>%
select(pathway, MiniMaxFDR, drivers) %>%
mutate(pathway = str_trunc(pathway, width = 45)) %>%
kableExtra::kable()| pathway | MiniMaxFDR | drivers |
|---|---|---|
| PID_PDGFRB_PATHWAY | 0.0002360 | DNAm and RNAseq |
| WP_CHEMOKINE_SIGNALING_PATHWAY | 0.0003282 | DNAm and RNAseq |
| KEGG_HEMATOPOIETIC_CELL_LINEAGE | 0.0003282 | DNAm and SNP |
| PID_TCR_PATHWAY | 0.0005100 | RNAseq and SNP |
| WP_REGULATION_OF_TOLLLIKE_RECEPTOR_SIGNALI… | 0.0005686 | RNAseq and SNP |
| KEGG_CHEMOKINE_SIGNALING_PATHWAY | 0.0005686 | DNAm and RNAseq |
| PID_KIT_PATHWAY | 0.0005686 | DNAm and RNAseq |
| WP_KIT_RECEPTOR_SIGNALING_PATHWAY | 0.0021048 | DNAm and RNAseq |
| PID_CXCR4_PATHWAY | 0.0021267 | DNAm and SNP |
| REACTOME_TCR_SIGNALING | 0.0039757 | RNAseq and SNP |
| REACTOME_HATS_ACETYLATE_HISTONES | 0.0071556 | RNAseq and SNP |
| REACTOME_MHC_CLASS_II_ANTIGEN_PRESENTATION | 0.0071556 | RNAseq and SNP |
| WP_INSULIN_SIGNALING | 0.0071556 | DNAm and RNAseq |
| REACTOME_RAC1_GTPASE_CYCLE | 0.0076145 | DNAm and RNAseq |
| REACTOME_DDX58_IFIH1_MEDIATED_INDUCTION_OF… | 0.0076145 | RNAseq and SNP |
| KEGG_NEUROTROPHIN_SIGNALING_PATHWAY | 0.0076145 | DNAm and RNAseq |
| WP_TRANSCRIPTION_FACTOR_REGULATION_IN_ADIP… | 0.0077424 | DNAm and RNAseq |
| REACTOME_SUMOYLATION_OF_CHROMATIN_ORGANIZA… | 0.0094065 | RNAseq and SNP |
| REACTOME_HCMV_INFECTION | 0.0094065 | RNAseq and SNP |
| REACTOME_NEUROTRANSMITTER_RECEPTORS_AND_PO… | 0.0097309 | RNAseq and SNP |
In our manuscript, we demonstrate many of the top pathways have been implicated in Alzheimer’s disease or in neurodegenerative conditions at large.
sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] pathwayMultiomics_0.0.0.9006 forcats_0.5.1
#> [3] stringr_1.4.0 dplyr_1.0.7
#> [5] purrr_0.3.4 readr_2.0.2
#> [7] tidyr_1.1.4 tibble_3.1.5
#> [9] ggplot2_3.3.5 tidyverse_1.3.1
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.7 svglite_2.0.0 lubridate_1.8.0 assertthat_0.2.1
#> [5] rprojroot_2.0.2 digest_0.6.28 utf8_1.2.2 R6_2.5.1
#> [9] cellranger_1.1.0 backports_1.3.0 RcppZiggurat_0.1.6 reprex_2.0.1
#> [13] evaluate_0.14 highr_0.9 httr_1.4.2 pillar_1.6.4
#> [17] rlang_0.4.12 readxl_1.3.1 rstudioapi_0.13 jquerylib_0.1.4
#> [21] rmarkdown_2.11 pkgdown_1.6.1 textshaping_0.3.6 desc_1.4.0
#> [25] webshot_0.5.2 munsell_0.5.0 broom_0.7.10 compiler_4.1.1
#> [29] modelr_0.1.8 xfun_0.27 pkgconfig_2.0.3 systemfonts_1.0.3
#> [33] htmltools_0.5.2 tidyselect_1.1.1 viridisLite_0.4.0 fansi_0.5.0
#> [37] crayon_1.4.2 tzdb_0.2.0 dbplyr_2.1.1 withr_2.4.2
#> [41] grid_4.1.1 jsonlite_1.7.2 gtable_0.3.0 lifecycle_1.0.1
#> [45] DBI_1.1.1 magrittr_2.0.1 scales_1.1.1 Rfast_2.0.3
#> [49] cli_3.1.0 stringi_1.7.5 cachem_1.0.6 fs_1.5.0
#> [53] xml2_1.3.2 bslib_0.3.1 ellipsis_0.3.2 ragg_1.2.0
#> [57] generics_0.1.1 vctrs_0.3.8 kableExtra_1.3.4 tools_4.1.1
#> [61] glue_1.4.2 hms_1.1.1 parallel_4.1.1 fastmap_1.1.0
#> [65] yaml_2.2.1 colorspace_2.0-2 rvest_1.0.2 memoise_2.0.0
#> [69] knitr_1.36 haven_2.4.3 sass_0.4.0