Fast inference for intractable ultra high-dimensional Potts models for genome sequence data
Speaker: Jukka Corander, Professor, Oslo Centre for Biostatistics and Epidemiology, Dept. of Biostatistics, University of Oslo.
The potential for genome-wide modeling of epistasis has recently surfaced given the possibility of sequencing densely sampled populations and the emerging families of statistical interaction models. Direct coupling analysis (DCA) has earlier been shown to yield valuable predictions for single protein structures, and has recently been extended to genome-wide analysis of bacteria, identifying novel interactions in the co-evolution between resistance, virulence and core genome elements. However, earlier computational DCA methods have not been scalable to enable model fitting simultaneously to 104-105 polymorphisms, representing the upper bound of variation observed in genomic analyses of many bacterial species. We will introduce a novel inference method (SuperDCA) which employs a new scoring principle, efficient parallelization, optimization and filtering on phylogenetic information to achieve scalability for up to 105 polymorphisms. Using large population samples of Streptococcus pneumoniae, we demonstrate the ability of SuperDCA to make significant biological findings about this major human pathogen. We also show that our method can uncover weak signals of selection that are not detectable by genome-wide association analysis, even though our analysis does not require phenotypic measurements. SuperDCA thus holds considerable potential in building understanding about numerous organisms at a systems biological level.