Skip to main content
ARS Home » Plains Area » Clay Center, Nebraska » U.S. Meat Animal Research Center » Genetics and Animal Breeding » Research » Publications at this Location » Publication #371401

Research Project: Developing a Systems Biology Approach to Enhance Efficiency and Sustainability of Beef and Lamb Production

Location: Genetics and Animal Breeding

Title: Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C

Author
item KRONENBERG, ZEV - Phase Genomics, Inc
item RHIE, ARANG - National Human Genome Research Institute
item KOREN, SERGEY - National Human Genome Research Institute
item CONCEPCION, GREGORY - Pacific Biosciences Inc
item PELUSO, PAUL - Pacific Biosciences Inc
item MUNSON, KATHERINE - University Of Washington Medical School
item PORUBSKY, DAVID - University Of Washington Medical School
item Kuhn, Kristen
item MUELLER, KATHRYN - Phase Genomics, Inc
item LOW, WAI YEE - University Of Adelaide
item HIENDLEDER, STEFAN - University Of Adelaide
item FEDRIGO, OLIVIER - Rockefeller University
item LIACHKO, IVAN - Phase Genomics, Inc
item HALL, RICHARD - National Human Genome Research Institute
item PHILLIPPY, ADAM - National Human Genome Research Institute
item EICHLER, EVAN - University Of Washington Medical School
item WILLIAMS, JOHN - University Of Adelaide
item Smith, Timothy - Tim
item JARVIS, ERICH - Rockefeller University
item SULLIVAN, SHAWN - Phase Genomics, Inc
item KINGAN, SARAH - Pacific Biosciences Inc

Submitted to: Nature Communications
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 11/12/2020
Publication Date: 4/28/2021
Citation: Kronenberg, Z.N., Rhie, A., Koren, S., Concepcion, G., Peluso, P., Munson, K.M., Porubsky, D., Kuhn, K.L., Mueller, K.A., Low, W., Hiendleder, S., Fedrigo, O., Liachko, I., Hall, R.J., Phillippy, A.M., Eichler, E.E., Williams, J.L., Smith, T.P.L., Jarvis, E.D., Sullivan, S.T., Kingan, S.B. 2021. Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C. Nature Communications. 12. Article 1935. https://doi.org/10.1038/s41467-020-20536-y.
DOI: https://doi.org/10.1038/s41467-020-20536-y

Interpretive Summary: Accurate reference genomes are essential for nearly all genomic analyses. Creation of accurate references is challenging because sequencing technologies cannot yet read across entire chromosomes. Instead, the chromosome sequence must be reconstructed into segments representing fragments of chromosomes, called contigs, by assembling large numbers of relatively short, random reads. The presence of repetitive sequence along the chromosome interrupts the assembly whenever the length of the repeat is longer than the sequence reads, creating gaps. Long-read technologies, that produce read lengths longer than most repetitive sequences in mammalian genomes, have supported development of much higher quality assemblies than previously possible. However, another level of complexity has plagued even long-read genome assembly projects, in the form of variation between the two parental copies (one from the mother, and one from the father) of each chromosome. Wherever the maternal and paternal chromosomes are substantially different from one another, the assembler becomes confused about which version should be included in the assembly, and most assemblies therefore represent a “smashed” representation of the genome that includes portions from maternal and paternal chromosomes, such that the final assembly does not represent either actual set of chromosomes. The present manuscript describes a new method to use information contained in long reads to identify and separate the parental alleles, or haplotypes, into separate contigs, reducing confusion and increasing accuracy of representation of the chromosomes present in the individual being sequenced. We had previously reported an even more accurate method for doing this that produced accurate phasing of the assembly into two complete genomes, one representing the maternally-derived chromosomes and one the paternally-derived genome, but that method depended on also collecting sequence from the parents of the individual. This new method, called Falcon-Phase, supports creation of phased genome assembly without the requirement of parental sequence, which is sometimes difficult or impossible to obtain.

Technical Abstract: Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. To date, these assemblies have been best created with complex protocols, such as cultured cells that contain a single-haplotype (haploid) genome, single cells where haplotypes are separated, or co-sequencing of parental genomes in a triobased approach. These approaches are impractical in most situations. To address this issue, we present FALCON-Phase, a phasing tool that uses ultra-long-range Hi-C chromatin interaction data to extend phase blocks of partially-phased diploid assembles to chromosome or scaffold scale. FALCON-Phase uses the inherent phasing information in Hi-C reads, skipping variant calling, and reduces the computational complexity of phasing. Our method is validated on three benchmark datasets generated as part of the Vertebrate Genomes Project (VGP), including human, cow, and zebra finch, for which high-quality, fully haplotyperesolved assemblies are available using the trio-based approach. FALCON-Phase is accurate without having parental data and performance is better in samples with higher heterozygosity. For cow and zebra finch the accuracy is 97% compared to 80–91% for human. FALCON-Phase is applicable to any draft assembly that contains long primary contigs and phased associate contigs.