Location: Plant, Soil and Nutrition Research
Title: The practical haplotype graph, a platform for storing and using pangenomes for imputationAuthor
BRADBURY, PETER - Retired ARS Employee | |
CASSTEVENS, TERRY - Cornell University | |
JENSEN, SARA - Cornell University | |
JOHNSON, LYNN - Cornell University | |
MILLER, ZACHARY - Cornell University | |
MONIER, BRANDON - Cornell University | |
ROMAY, MARIA CINTA - Cornell University | |
SONG, BAOXING - Cornell University | |
Buckler, Edward - Ed |
Submitted to: Bioinformatics
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 6/22/2022 Publication Date: 6/24/2022 Citation: Bradbury, P.J., Casstevens, T., Jensen, S.E., Johnson, L.E., Miller, Z.R., Monier, B., Romay, M., Song, B., Buckler IV, E.S. 2022. The practical haplotype graph, a platform for storing and using pangenomes for imputation. Bioinformatics. 38(15):3698-3702. https://doi.org/10.1093/bioinformatics/btac410. DOI: https://doi.org/10.1093/bioinformatics/btac410 Interpretive Summary: The genetic blueprint or genome that directs the development of every individual consists of DNA. While mostly identical between individuals of the same species, differences in the DNA code of individuals result in observed differences in traits or phenotypes. In agricultural crops, knowledge about those differences and their resulting phenotypes helps plant breeders develop new, better performing varieties. Because the DNA code of an individual contains billions of characters or base pairs, collecting and storing that information for the tens of thousands (or more) individual strains in a breeding program is an immense challenge. The Practical Haplotype Graph (PHG) described in this publication provides both an efficient way to store that information and a method for inexpensively imputing and storing the sequence of newly tested individuals. Here, imputing means starting with a relatively small sample of DNA sequence from a new individual then inferring its full sequence by comparing it with known DNA sequence stored in a PHG database. The PHG consists of a database and an extensive set of software applications for writing data to the database and using it for imputation. Practical means that it stores a very large amount of data in an efficient and useable form. The database organizes sequence using haplotypes or short segments of sequence, while the software organizes the haplotypes as a graph to use in imputation. This paper demonstrates the effectiveness of a PHG constructed from 26 maize lines with high-quality, complete sequence. Simulations show that a DNA sample containing only a fraction of the genome can be used to infer the entire genome accurately. The paper refers to other publications describing the application of the PHG to four major agricultural crops: maize, wheat, sorghum, and cassava. Technical Abstract: Motivation: Pangenomes provide novel insights for population and quantitative genetics, genomics and breeding not available from studying a single reference genome. Instead, a species is better represented by a pangenome or collection of genomes. Unfortunately, managing and using pangenomes for genomically diverse species is computationally and practically challenging. We developed a trellis graph representation anchored to the reference genome that represents most pangenomes well and can be used to impute complete genomes from low density sequence or variant data. Results: The Practical Haplotype Graph (PHG) is a pangenome pipeline, database (PostGRES & SQLite), data model (Java, Kotlin or R) and Breeding API (BrAPI) web service. The PHG has already been able to accurately represent diversity in four major crops including maize, one of the most genomically diverse species, with up to 1000-fold data compression. Using simulated data, we show that, at even 0.1× coverage, with appropriate reads and sequence alignment, imputation results in extremely accurate haplotype reconstruction. The PHG is a platform and environment for the understanding and application of genomic diversity. |