Abstract #470

# 470
SSGP: SNP-set based genomic prediction to incorporate biological information.
J. Jiang*1, J. O'Connell2, P. VanRaden3, L. Ma1, 1Department of Animal and Avian Sciences, University of Maryland, College Park, MD, 2University of Maryland School of Medicine, Baltimore, MD, 3Animal Genomics and Improvement Laboratory, ARS-USDA, Beltsville, MD.

Genomic prediction has emerged as an effective approach in plant and animal breeding and in precision medicine. Including biological information into the genomic model can be of great advantage. Due to the statistical and computational challenges in large genomics studies, however, a fast and flexible method to incorporate such external information is still lacking. Here, we proposed a linear mixed model that can incorporate biological information in a flexible way and developed a fast variational Bayes-based software package named SSGP. In our model, whole genome markers can be split into groups in a user-defined manner, and each group of markers is given a common effect variance. Since previous functional genomics studies have accumulated much evidence on which genes, genomic regions or pathways are more/less important for a trait of interest, we can divide genome-wide SNPs into several groups based on their levels of importance and then use the predefined SNP sets in SSGP. Additionally, each marker has a pre-specified weight for which the rule can be flexibly assigned, e.g., based on minor allele frequency or LD pattern. The model was implemented with the parameter expanded variational Bayesian method. For testing purpose, we analyzed a large cattle data set consisting of ~24k bulls (20k in training set and 4k in validation set) and ~760k whole-genome SNP markers. By simply grouping markers based on proximity (markers were divided into continuous, non-overlapping chunks, each containing 1k SNPs) and considering only additive effects, SSGP already performed better than Bayes A in all 5 milk traits analyzed, with an increase of up to 8 percent points in prediction accuracy. Meantime, it took only ~5h for each trait with 20 threads. We also analyzed many simulation data sets and the WTCCC heterogeneous stock mice data set for which the results of many existing methods had been reported. Generally, SSGP could achieve similar prediction performance compared with the best approaches reported, though only proximity was used for grouping SNPs. Collectively, the method and software show great potential to increase accuracy in genomic prediction, particularly in the future when more useful biological information is becoming available.

Key Words: genomic prediction, SNP set, biological information