Abstract #466

# 466
Cattle variant-detection modelling using selective-sequencing experimental design and statistical learning.
K. Bakshy*1, R. Schnabel2, D. Bickhart1, 1USDA-Agricultural Research Service Dairy Forage Research Center, Madison, WI, 2University of Missouri, Columbia, MO.

The objective of the current study is to generate a gold standard variant data set specific for the Holstein breed to train mixture models used in SNP variant identification from whole genome sequence data. It is now plausible to comprehensively and economically catalog genetic variations using whole genome DNA sequencing data. Nevertheless, the data still suffers from a low signal-to-noise ratio, which results in a high degree of false positive variant site detections. To accurately distinguish rare variant sites from the noise in sequencing data, the Genome Analysis Toolkit (GATK) implements a statistical learning method that uses a previously developed training set of validated variant sites to identify true positive variants in a data set. Currently, there is no such validated set of variant sites for use in model-training for cattle variant surveys. We used an inverse weight algorithm to prioritize Holstein bulls for sequencing based on the rarity of their homozygous SNP haplotype segments identified in the US national dairy evaluation database. The final list of 172 prioritized Holstein bulls, which represented approximately 85% of the homozygous haplotypes found in the database, were sequenced to at least 20X coverage on an Illumina HiSeqX. Raw reads were aligned to the reference genome ARS-UCDv1.2 using BWA MEM, and 23,912,824 SNPs were called using the SAMtools workflow. By exploiting the expected homozygous nature of haplotype sequence from these individuals, we were able to curate a list of ~200K high quality, lower-frequency variant sites for use in variant-detection modeling. We used these variant sites as training data for the GATK Variant Quality Score Recalibration module to assess the improvement in accuracy of SNP calling and identified 1.1% more rare variants (frequency <5%) in a cut-off study using several different model training parameters. By establishing a high confidence variant site data set for Holstein cattle, we enable more accurate prediction of low-frequency variants in the population for future whole-genome sequence surveys.

Key Words: SNP, variant detection, whole-genome sequencing