Find Gene Set / Pathway Significance across Multi-Omics Data

Given a data frame of pathway-level p-values across multiple -omics platforms, use the MiniMax technique to assign statistical significance to concordant or cascading pathway-level biological effects.

MiniMax(
  pValues_df,
  pValuesNull_df = NULL,
  orderStat = 2L,
  method = c("parametric", "MLE", "MoM"),
  annotateResults = TRUE,
  ...
)

Arguments

pValues_df	A data frame of pathway / gene set p-values under true responses (this data set should contain true biological signal). The rows correspond to gene sets / pathways, and the columns correspond to the data platforms for the disease of interest.
pValuesNull_df	A data frame of pathway / gene set p-values under the null hypothesis, most likely constructed from randomly permuting the response and re-estimating all significance levels (this data set should NOT contain any true biological signal). As with `pValues_df`, the rows correspond to gene sets / pathways, and the columns correspond to the data platforms for the disease of interest. NOTE: if this data set is not provided, only `method = "parametric"` will be available.
orderStat	How many platforms should show a biological signal for a pathway / gene set to have multi-omic "enrichment"? Defaults to 2. See "Details" for more information.
method	If `pValuesNull_df` is provided, which estimation method will be used to find the parameters of the Beta Distribution? Options are `"parametric"` (no estimation from the data; this should be used only in cases where no MiniMax statistics under the null hypothesis are available, such as in the case of pure meta-analysis approaches), `"MLE"` (Maximum Likelihood Estimates), or `"MoM"` (Method of Moments estimates). Using `"MLE"` or `"MoM"` requires the user to provide `pValuesNull_df`. See "Details" for more information.
annotateResults	Should the platforms driving each result be marked? Defaults to `TRUE`. See `MiniMax_calculateDrivers` for more information.
...	Additional arguments passed to the `MiniMax_calculateDrivers` function.

Value

A copy of the pValues_df data frame with two additional columns: MiniMax (the statistic values for each gene set) and MiniMaxP (the p-values of these statistics). This data frame is sorted by ascending MiniMax p-value.

Details

Concerning Parameter Estimation Methods: We currently support 3 options to estimate the parameters of the Beta Distribution. The "parametric" option does not use the data, and it is therefore the only option available if pValuesNull_df is not provided. Instead, it assumes that the MiniMax statistics will have a Beta \((k, n + 1 - k)\) distribution, where \(k\) is the value of orderStat and \(n\) has the value nPlatforms. See https://en.wikipedia.org/wiki/Order_statistic.

The next two estimation options make use of the pValuesNull_df data frame, which should be calculated by finding the same significance levels of the statistical tests used on the real data (for each pathway and data platform), but by using a random permutation of the outcome of interest instead of the real values; more permutations are better. The "MLE" option uses the beta.mle function to find the Maximum Likelihood Estimates of \(\alpha\) and \(\beta\). The "MoM" option uses the closed-form Method of Moments estimators of \(\alpha\) and \(\beta\) as shown in https://en.wikipedia.org/wiki/Beta_distribution#Method_of_moments.

Concerning Appropriate Order Statistics: The MiniMax operation is equivalent to sorting the p-values and taking the second smallest. In our experience, setting this "order statistic" cutoff to 2 is appropriate for =< 5 data platforms. Biologically, this is equivalent to saying "if this pathway is dysregulated in at least two data types for this disease / condition, it is worthy of additional consideration". In situations where more than 5 data platforms are available for the disease of interest, we recommend increasing the orderStat value to 3.

Examples

 data("multiOmicsMedSignalResults_df")
 data("nullMiniMaxResults_df")

 MiniMax(
   pValues_df = multiOmicsMedSignalResults_df,
   pValuesNull_df = nullMiniMaxResults_df[, -5],
   method = "MLE",
   # Passed to the MiniMax_calculateDrivers() function
   drivers_char = c("cnv", "rnaSeq", "protein")
 )
#> # A tibble: 50 x 8
#>    terms     treated pVal_CNV pVal_RNAseq pVal_Prot MiniMax MiniMaxP drivers    
#>    <chr>     <lgl>      <dbl>       <dbl>     <dbl>   <dbl>    <dbl> <chr>      
#>  1 cluster05 TRUE       0.213       0.007     0.012   0.012  0.00105 protein an~
#>  2 cluster24 FALSE      0.03        0.016     0.669   0.03   0.00532 cnv and rn~
#>  3 cluster32 FALSE      0.03        0.037     0.661   0.037  0.00771 cnv and rn~
#>  4 cluster19 TRUE       0.18        0.004     0.044   0.044  0.0105  protein an~
#>  5 cluster40 FALSE      0.103       0.102     0.515   0.103  0.0459  cnv and rn~
#>  6 cluster26 FALSE      0.115       0.019     0.754   0.115  0.0554  cnv and rn~
#>  7 cluster49 FALSE      0.595       0.127     0.158   0.158  0.0947  protein an~
#>  8 cluster35 FALSE      0.034       0.201     0.688   0.201  0.141   cnv and rn~
#>  9 cluster45 FALSE      0.094       0.23      0.249   0.23   0.175   cnv and rn~
#> 10 cluster11 FALSE      0.247       0.159     0.473   0.247  0.197   cnv and rn~
#> # ... with 40 more rows