Skip to main content
ARS Home » Midwest Area » Madison, Wisconsin » U.S. Dairy Forage Research Center » Cell Wall Biology and Utilization Research » Research » Publications at this Location » Publication #361231

Research Project: Investigating Microbial, Digestive, and Animal Factors to Increase Dairy Cow Performance and Nutrient Use Efficiency

Location: Cell Wall Biology and Utilization Research

Title: Cattle variant-detection modelling using selective-sequencing experimental design and statistical learning

Author
item Bakshy, Kiranmayee
item SCHNABEL, ROBERT - University Of Missouri
item Bickhart, Derek

Submitted to: American Dairy Science Association Abstracts
Publication Type: Abstract Only
Publication Acceptance Date: 2/20/2019
Publication Date: N/A
Citation: N/A

Interpretive Summary:

Technical Abstract: Objective: Generate a gold standard variant dataset specific for the Holstein breed in order to train mixture models used in SNP variant identification from whole genome sequence data. Introduction: It is now plausible to comprehensively and economically catalog genetic variations using whole genome DNA sequencing data. Nevertheless, the data still suffers from a low signal-to-noise ratio which results in a high degree of false positive variant site detections. In order to accurately distinguish rare variant sites from the noise in sequencing data, the Genome Analysis Toolkit (GATK) implements a statistical learning method that uses a previously-developed training set of validated variant sites to identify true positive variants in a dataset. Currently, there is no highly validated set of variant sites for use in model-training for cattle variant surveys. Results: We used an inverse weight algorithm to prioritize Holstein bulls for sequencing based on the rarity of their homozygous SNP haplotype segments identified in the US national dairy evaluation database. The final list of 172 prioritized Holstein bulls, which represented approximately 85% of the homozygous haplotypes found in the database, were sequenced to at least 20X coverage on an Illumina HiSeqX. Raw reads were aligned to the reference genome ARS-UCDv1.2 using BWA MEM, and 23,912,824 SNPs were called using the SAMtools workflow. By exploiting the expected homozygous nature of haplotype sequence from these individuals, we were able to curate a list of ~200K high quality, lower-frequency variant sites for use in variant-detection modeling. We used these variant sites as training data for the GATK Variant Quality Score Recalibration module to assess the improvement in accuracy of SNP calling and identified 1.1% more rare variants (frequency < 5%) in a cut-off study using several different model training parameters. Conclusion: By establishing a high confidence variant site dataset for Holstein cattle, we enable more accurate prediction of low-frequency variants in the population for future whole genome sequence surveys.