Statistical methods for biobanks and registries

Large amounts of data are being collected in ever increasing medical registries and databases. By combining information from large population-based cohorts, with the biological material from the biobanks, and registry data, there is an obvious potential for increased knowledge about the way genetic composition and environmental exposures influence health. On this background, our health registries and biobanks are often considered a unique treasure, with endless opportunities. However, to take full advantage of the quality, depth and breadth of these valuable data, one will need targeted statistical modeling. The purpose of the present project is to develop statistical methods specifically tailored to the needs presented by the large and complex datasets that result from data couplings as described above. We will focus on two essential analytical problems; integration of different sources and layers of high-dimensional data, and causal inference. This project is a collaboration between several research groups at the department

High-dimensional data integration

Multiple data sources give complimentary information about systems or individuals from different angles on different scales. Each input contributes unique information, but there is also considerable overlap and often in the presence of heavy noise. Disentangling signal from noise in complex and high-dimensional data is a key step. Much of the work that has been done has focused directly on genomics, where one has to deal with different layers or levels of genomic information. Efficient integration of complementary information sources from these multiple levels can greatly facilitate the discovery of true causes and states of disease in specific sub-groups of patients sharing a common genetic background. We will follow two lines of research into the problem of conducting data integration; i) Methods based on principal component analysis (PCA), ii) Rank based methods.

High-dimensional Principal Component Analysis (PCA)

The asymptotic behavior of PCA in the high-dimensional setting has attracted a substantial amount of attention the last few years, and as a bi-product of our investigation of methods for high-dimensional data integration mentioned above, we are also investigating some theoretical properties of high-dimensional PCA.

High-dimensional statistical inference under noisy conditions

Modern bio-medical research produces enormous amounts of data. Prominent examples are sequencing-based or array-based measurements within genomics. A crucial problem that so far has not received much attention within this area is the effects of measurement error and measurement uncertainty. Together with the inevitable problem of missing data, measurement error in such high-dimensional types of data leads to high-dimensional inference under noisy conditions. In this project we are working with high-dimensional regression methods, taking care of measurement error and missing data. Motivating examples are sequencing data based on varying number of reads, and array-based gene expression measurements with technical- and biological variability and noise.

Mixed lasso

A popular way of dealing with the dimensionality problem in regression situations is to to invoke a so-called sparsity assumption, saying that only a small number of the explanatory variables are responsible for the biological action in question / related to the response. The most well known regression metod of this type is the lasso. In this project we will study the performance of the lasso in the presence of correlated data, e.g. longitudinal data. The typical way of analyzing correlated data is by use of random effect models (mixed models), and random effects have been introduced to the lasso (the mixed lasso). The goal of the current project is to investigate a number of methodological problems related to the mixed lasso.

Published Feb. 24, 2011 8:29 PM - Last modified Jan. 20, 2016 2:00 PM