A function to select groups of variables which are good at predicting a given phenotype. The groups considered corresponds to various cut levels in a user defined hierarchy. The selection is performed by various penalty-based regression methods (weighted-LASSO or group-LASSO).

getHierLevel(
  X,
  y,
  hc.object,
  selection = c("rho-sicomore", "sicomore", "mlgl"),
  compression = "mean",
  depth.cut = 3,
  choice = c("lambda.min", "lambda.1se"),
  mc.cores = 2,
  stab = FALSE,
  stab.param = list(B = c(100, 100), cutoff = c(0.75, 0.75), PFER = c(1, 1))
)

Arguments

X

input matrix

y

response variable

hc.object

output of a hierarchical clustering algorithm in the hclust format (must be an "hclust" object)

selection

method used to perform variable selection. Either 'sicomore', 'rho-sicomore' or 'mlgl' (see details). Default is 'rho-sicomore'.

compression

a string (either "mean" or "SNP.dist"). Indicates how groups of variables are compressed before variable selection is performed at each level of the hierarchy. Only relevant for 'sicomore' or 'rho-sicomore'.

depth.cut

an integer specifying the depth of the search space for the variable selection part of the algorithm. This argument allows to increase the speed of the algorithm by restraining the search space without affecting too much the performance. A value between 3 and 6 is recommended, the smaller the faster.

choice

a string (either "lambda.min" or "lambda.1se"). Indicates how the tuning parameter is chosen in the penalized regression approach

mc.cores

an integer for the number of cores to use in the parallelization of the cross-validation and some other functions. Default is 1.

stab

A boolean indicating if the algorithm perform a lasso stability selection using stabsel function from stabs package.

stab.param

A list of parameter for the stabsel function if stab = TRUE. The parameters to choose are the FWER (1 by default), cut-off (0.75 by default) and bootstrap number (200 by default).

Value

an RC object with class 'sicomore-model', with methods nGrp(), nVar(), getGrp(), getVar(), getCV(), getX.comp(), getCoef() and with the following fields:

  • groups:a list with the selected groups of predictors

  • coefficients:a vector with the estimated coefficients (one per selected group) if stab=FALSE

  • X.comp:The compressed version of the original input matrix (as many columns as number of selected groups)

  • cv.error:for the best grouping, a data frame showing the cross-validation error used in the variable selection procedure if stab=FALSE

  • selection:the selection method used

  • compression:the compression method used

  • group_inference:the group selection infered by "lasso" or "hclust" if no selection by lasso.

Details

The methods for variable selection are variants of the LASSO or the group-LASSO designed to perform selection of interaction between multiple hierarchies: 'sicomore' and 'rho-sicomore' (see Ambroise et al. (2018) , Park et al. (2007) ) use a LASSO penalty on compressed groups of variables along the hierarchies to select the interactions. The rho-sicomore variant is a weighted version of sicomore, which weights depend on the levels in the hierarchies. The method 'mlgl' of Grimonprez (2016) uses a weigthed group-Lasso penalty which does not require compression but is more computationally demanding.

References

Ambroise C, Chiquet J, Guinot F, Szafranski M (2018). “Fast Computation of Genome-Metagenome Interaction Effects.” arXiv preprint arXiv:1810.12169.

Grimonprez Q (2016). Selection de groupes de variables corrélées en grande dimension. Ph.D. thesis, Université de Lille.

Park MY, Hastie T, Tibshirani R (2007). “Averaged gene expressions for regression.” Biostatistics, 8(2), 212--227.