Bayesian inference on high-dimensional Seemingly Unrelated Regressions, applied to metabolomics data
Speaker: Alex Lewin, Reader in Statistics, Institute of Environment, Health and Societies, Brunel University London, UK.
This biostatistics seminar is jointly organised with the Sven Furberg Seminars in Bioinformatics and Statistical Genomics. At the end of the seminar simple food and refreshments will be served.
Increasingly, epidemiologists are collecting multiple high-dimensional molecular data sets on large cohorts of people. The interest is in finding associations between these data sets and with genetic variants. In order to do this effectively these multi-variate data sets should be modelled jointly, taking into account correlations in the data. Sparse solutions are usually required, and performing variable selection in this setting is critical.
We present a Bayesian Seemingly Unrelated Regressions (SUR) model for associating metabolomics outcomes with genetic variants, allowing for both sparse variable selection and correlation between the outcomes. This model can be fit using a Gibbs sampler, but this quickly becomes computationally unfeasible as the dimensions of the problem grow. Previously people have made use of either the assumption of independence between the outcomes (Bottolo et al. 2011, Lewin et al. 2015) or selected predictors jointly for all the outcomes (Bhadra and Mallik 2013, Bottolo et al. 2013).
In order to overcome some of the computational difficulty with the general SUR model,
Zellner and Ando (2010) proposed a reparametrisation of the model in which the likelihood factorises completely into a product of conditional distributions, and used a Direct Monte Carlo (DMC) approach to estimate the posterior. This improves computational time, however their method requires re-sampling of the regression coefficients in order to obtain the correct posterior distribution.
We extend their work by allowing for a more general prior distribution, and we show that it is possible to build a Gibbs-DMC sampler without the need for re-sampling. Zellner and Ando (2010) demonstrated their DMC method on examples with up to 3 responses. We are aiming higher, with real molecular biology data involving 100's or 1000's of responses. The proposed method is applied to both simulated data, to illustrate the computational gains, and real metabolomics analysis where the dimension of the data precludes the use of the traditional sampler.
Zhi (George) Zhao, a PhD student at the Department of Biostatistics of the University of Oslo, will present his talk entitled "Approaches to incorporate drug-drug similarity in multiple-response penalised likelihood methods for predicting drug sensitivity based on multi-omics data."