Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #406829

Research Project: Improving Crop Efficiency Using Genomic Diversity and Computational Modeling

Location: Plant, Soil and Nutrition Research

Title: The practical haplotype graph, a platform for storing and using pangenomes for imputation

Author
item BRADBURY, PETER - Retired ARS Employee
item CASSTEVENS, TERRY - Cornell University
item JENSEN, SARA - Cornell University
item JOHNSON, LYNN - Cornell University
item MILLER, ZACHARY - Cornell University
item MONIER, BRANDON - Cornell University
item ROMAY, MARIA CINTA - Cornell University
item SONG, BAOXING - Cornell University
item Buckler, Edward - Ed

Submitted to: Bioinformatics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 6/22/2022
Publication Date: 6/24/2022
Citation: Bradbury, P.J., Casstevens, T., Jensen, S.E., Johnson, L.E., Miller, Z.R., Monier, B., Romay, M., Song, B., Buckler IV, E.S. 2022. The practical haplotype graph, a platform for storing and using pangenomes for imputation. Bioinformatics. 38(15):3698-3702. https://doi.org/10.1093/bioinformatics/btac410.
DOI: https://doi.org/10.1093/bioinformatics/btac410

Interpretive Summary: The genetic blueprint or genome that directs the development of every individual consists of DNA. While mostly identical between individuals of the same species, differences in the DNA code of individuals result in observed differences in traits or phenotypes. In agricultural crops, knowledge about those differences and their resulting phenotypes helps plant breeders develop new, better performing varieties. Because the DNA code of an individual contains billions of characters or base pairs, collecting and storing that information for the tens of thousands (or more) individual strains in a breeding program is an immense challenge. The Practical Haplotype Graph (PHG) described in this publication provides both an efficient way to store that information and a method for inexpensively imputing and storing the sequence of newly tested individuals. Here, imputing means starting with a relatively small sample of DNA sequence from a new individual then inferring its full sequence by comparing it with known DNA sequence stored in a PHG database. The PHG consists of a database and an extensive set of software applications for writing data to the database and using it for imputation. Practical means that it stores a very large amount of data in an efficient and useable form. The database organizes sequence using haplotypes or short segments of sequence, while the software organizes the haplotypes as a graph to use in imputation. This paper demonstrates the effectiveness of a PHG constructed from 26 maize lines with high-quality, complete sequence. Simulations show that a DNA sample containing only a fraction of the genome can be used to infer the entire genome accurately. The paper refers to other publications describing the application of the PHG to four major agricultural crops: maize, wheat, sorghum, and cassava.

Technical Abstract: Motivation: Pangenomes provide novel insights for population and quantitative genetics, genomics and breeding not available from studying a single reference genome. Instead, a species is better represented by a pangenome or collection of genomes. Unfortunately, managing and using pangenomes for genomically diverse species is computationally and practically challenging. We developed a trellis graph representation anchored to the reference genome that represents most pangenomes well and can be used to impute complete genomes from low density sequence or variant data. Results: The Practical Haplotype Graph (PHG) is a pangenome pipeline, database (PostGRES & SQLite), data model (Java, Kotlin or R) and Breeding API (BrAPI) web service. The PHG has already been able to accurately represent diversity in four major crops including maize, one of the most genomically diverse species, with up to 1000-fold data compression. Using simulated data, we show that, at even 0.1× coverage, with appropriate reads and sequence alignment, imputation results in extremely accurate haplotype reconstruction. The PHG is a platform and environment for the understanding and application of genomic diversity.