Skip to main content
ARS Home » Northeast Area » Beltsville, Maryland (BARC) » Beltsville Agricultural Research Center » Animal Genomics and Improvement Laboratory » Research » Publications at this Location » Publication #303158

Title: Fast imputation using medium- or low-coverage sequence data

Author
item Vanraden, Paul
item SUN, CHUANYU - National Association Of Animal Breeders

Submitted to: World Congress of Genetics Applied in Livestock Production
Publication Type: Proceedings
Publication Acceptance Date: 4/21/2014
Publication Date: 8/17/2014
Citation: Van Raden, P.M., Sun, C. 2014. Fast imputation using medium- or low-coverage sequence data. World Congress of Genetics Applied in Livestock Production. Vancouver, Canada, Aug. 17–22. 3 pp.

Interpretive Summary: Direct estimation of unknown genotypes (imputation) from raw genome sequence data can be more accurate than determining genotypes first and then imputing missing genotypes, especially if the sequence data has a low number of reads per nucleotide (read depth) or high error rates. However, different imputation strategies are required than those used previously for data from genotyping chips. A fast computing algorithm previously designed to impute from lower to higher density chips was adapted to use each animal’s sequence reads directly. The new algorithm imputes genotypes more accurately, especially if high-density genotypes are included in the data for each sequenced animal. Sequencing tools offer a tradeoff between number of animals and average read depth. The imputation strategy can reduce sequencing cost by accurately imputing genotypes using lower coverage for more individuals. More efficient imputation will allow geneticists to locate and test effects of more DNA variants and to include those in future selection programs, thereby increasing the rate of genetic improvement for traits of economic interest.

Technical Abstract: Direct imputation from raw sequence reads can be more accurate than calling genotypes first and then imputing, especially if read depth is low or error rates high, but different imputation strategies are required than those used for data from genotyping chips. A fast algorithm to impute from lower to higher density chips was adapted to use sequence data directly. An efficient strategy chooses the 2 haplotypes most likely to form the genotype and updates the posterior allele probabilities from the prior probabilities within those haplotypes as each animal’s sequence is processed. Imputation of 1 million loci on 1 chromosome required 37 minutes and 5 gigabytes of memory using 10 processors for 500 bulls simulated at 8X coverage plus 250 younger bulls that had lower coverage or had high, medium, or low density chips. Percentages of correct genotypes were 99.2, 97.0, and 94.1 for bulls sequenced at 8X, 4X, and 2X coverage, respectively, and were 98.1, 96.8, and 91.7 for bulls genotyped with 600K, 60K, and 10K density chips, respectively. Imputation using sequence with low coverage or high error was less accurate if genotypes from a high-density chip were not included in the sequence data. Including high-density genotypes for each sequenced animal ensured accurate haplotype matching for all animals. Sequencing tools offer a tradeoff between number of animals and average read depth. The new algorithm can reduce sequencing cost by accurately imputing genotypes using lower coverage for more individuals. More efficient imputation will allow geneticists to locate and test effects of more DNA variants and to include those in future selection programs.