Tractable and Expressive Generative Models of Genetic Variation Data


Population genetic studies often rely on artificial genomes (AGs) simulated by generative models of genetic data. In recent years, unsupervised learning models, based on hidden Markov models, deep generative adversarial networks, restricted Boltzmann machines, and variational autoencoders, have gained popularity due to their ability to generate AGs closely resembling empirical data. These models, however, present a tradeoff between expressivity and tractability. Here, we propose to use hidden Chow-Liu trees (HCLT) and their representation as probabilistic circuits (PCs) as a solution to this tradeoff. We first learn a HCLT structure which captures the long-range dependencies among SNPs in the training data set. We then convert the HCLT to its equivalent PC as a means of supporting tractable and efficient probability inference. The parameters in these PCs are inferred with an expectation-maximization algorithm using the training data. Compared to other models for generating AGs, HCLT obtains the largest log-likelihood on test genomes across SNPs chosen across the genome and from a contiguous genomic region. Moreover, the AGs generated by HCLT more accurately resemble the source data set in their patterns of allele frequencies, linkage disequilibrium, pairwise haplotype distances, and population structure. Lastly, we show that HCLT can accurately impute missing genetic data. This work not only presents a new and robust AG simulator but also manifests the potential of PCs in population genetics.

In Proceedings of the 26th Annual International Conference on Research in Computational Molecular Biology (RECOMB), 2022
Meihua Dang
Meihua Dang
Master’s student in Computer Science

My research interests include probabilistic modeling and deep generative models.