Skip to main content
ARS Home » Northeast Area » Beltsville, Maryland (BARC) » Beltsville Agricultural Research Center » Animal Genomics and Improvement Laboratory » Research » Publications at this Location » Publication #313922

Title: Fast imputation using medium or low-coverage sequence data

Author
item Vanraden, Paul
item SUN, CHUANYU - National Association Of Animal Breeders
item O'CONNELL, JEFFREY - University Of Maryland

Submitted to: BioMed Central (BMC) Genetics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 6/29/2015
Publication Date: 7/14/2015
Citation: Van Raden, P.M., Sun, C., O'Connell, J.R. 2015. Fast imputation using medium or low-coverage sequence data. BioMed Central (BMC) Genetics. 16:82.

Interpretive Summary: Direct estimation of unknown genotypes (imputation) from raw genome sequence data can be more accurate than determining sequence genotypes first and then imputing missing genotypes from chips, especially if the sequence data has a low number of reads per nucleotide (read depth) or high error rates. Different imputation strategies are required than those used previously for data from known genotypes. A fast computing algorithm previously designed to impute from lower to higher density chips was adapted to use each animal’s sequence reads directly. The new algorithm imputes genotypes more accurately, especially if high-density genotypes are included in the data for each sequenced animal. Sequencing tools offer a tradeoff between number of animals and average read depth. The imputation strategy can reduce sequencing cost by accurately imputing genotypes using lower coverage for more individuals. More efficient imputation will allow geneticists to locate and test effects of more DNA variants and to include those in future selection programs, thereby increasing the rate of genetic improvement for traits of economic interest.

Technical Abstract: Accurate genotype imputation can greatly reduce costs and increase benefits by combining whole-genome sequence data of varying read depth and microarray genotypes of varying densities. For large populations, an efficient strategy chooses the two haplotypes most likely to form each genotype and updates the posterior allele probabilities from the prior probabilities within those two haplotypes as each individual’s sequence is processed. Directly using allele read counts can improve imputation accuracy and reduce computation compared to calling or computing genotype probabilities first and then imputing, especially if read depths are low or error rates are high. A new algorithm was implemented in findhap version 4 software and tested using simulated bovine and actual human sequence data with different combinations of reference population size, sequence read depth and error rate. Read depths of >= 8X may be desired for direct investigation of sequenced individuals, but for a given total cost, sequencing more individuals at read depths of 2X to 4X gave more accurate imputation from array genotypes. Accuracy of imputation improved further if reference individuals had both low-coverage sequence and high-density microarray data. With read depths of <= 4X, findhap version 4 had higher accuracy than Beagle version 4; computing time was up to 400 times faster with findhap than with Beagle. For 10,000 sequenced individuals plus 250 with high-density array genotypes to test imputation, findhap used 7 hours, 10 processors and 50 gigabytes of memory for 1 million loci on one chromosome.