Abstract
Human biology is defined by specialized cell types driven by a common genome, 98% of which is outside of genes. This noncoding genetic space is linked to the majority of disease risk but remains poorly understood. In this talk, I will discuss how deep learning can be used to predict the effects of noncoding variants on gene expression in primary human cell types. I will introduce ExPectoSC, an atlas of deep-learning models that predict cell-type-specific gene expression from genomic sequences, covering 105 primary cell types across seven organ systems, and how it can be used in the disease context. Additionally, I will present a novel genomic-centered contrastive pre-training method, cGen, to improve training of the models from sequence alone in limited-data contexts. Utilizing sequence augmentations, after pre-training cGen generates unsupervised embeddings that highlight functional clusters and are informative of gene expression in the absence of any labeled information.