Deep learning for sequence based gene expression prediction

Biostatistical seminar with Ksenia Sokolova, Department of Computer Science, Princeton University, USA.

Abstract

Human biology is defined by specialized cell types driven by a common genome,  98% of which is outside of genes. This noncoding genetic space is linked to the majority of disease risk but remains poorly understood. In this talk, I will discuss how deep learning can be used to predict the effects of noncoding variants on gene expression in primary human cell types. I will introduce ExPectoSC, an atlas of deep-learning models that predict cell-type-specific gene expression from genomic sequences, covering 105 primary cell types across seven organ systems, and how it can be used in the disease context. Additionally, I will present a novel genomic-centered contrastive pre-training method, cGen, to improve training of the models from sequence alone in limited-data contexts. Utilizing sequence augmentations, after pre-training cGen generates unsupervised embeddings that highlight functional clusters and are informative of gene expression in the absence of any labeled information.

Published Apr. 11, 2024 10:41 AM - Last modified Apr. 16, 2024 9:03 AM