Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #374771

Research Project: Mapping Crop Genome Functions for Biology-Enabled Germplasm Improvement

Location: Plant, Soil and Nutrition Research

Title: Highly accurate HiFi long read sequencing data for five complex genome samples

Author
item HON, TING - Pacific Biosciences Inc
item MARS, KRISTIN - Pacific Biosciences Inc
item YOUNG, GREG - Pacific Biosciences Inc
item TSAI, YU-CHIH - Pacific Biosciences Inc
item KAURALIS, JOSEPH - Pacific Biosciences Inc
item LANDOLIN, JANE - Ravel Biotechnology
item MAURER, NICHOLAS - University Of California Santa Cruz
item KUDRNA, DAVID - Arizona Genomics Institute
item HARDIGAN, MICHAEL - University Of California, Davis
item STEINER, CYNTHIA - Beckman Research Institute
item KNAPP, STEVE - University Of California, Davis
item Ware, Doreen
item SHAPIRO, BETH - University Of California Santa Cruz
item PELUSO, PAUL - Pacific Biosciences Inc
item RANK, DAVID - Pacific Biosciences Inc

Submitted to: Scientific Data - Nature
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 10/27/2020
Publication Date: 10/27/2020
Citation: Hon, T., Mars, K., Young, G., Tsai, Y., Kauralis, J., Landolin, J.M., Maurer, N., Kudrna, D., Hardigan, M.A., Steiner, C.C., Knapp, S., Ware, D., Shapiro, B., Peluso, P., Rank, D.R. 2020. Highly accurate HiFi long read sequencing data for five complex genome samples. Scientific Data - Nature. 7. Article e399. https://doi.org/10.1038/s41597-020-00743-4.
DOI: https://doi.org/10.1038/s41597-020-00743-4

Interpretive Summary: There is a need for benchmarking data sets to validate and support improved algorithms for assembly. In this paper we present deep coverage of PacBio HiFi sequencing reads for mouse, frog, corn, and strawberry genomes with an average size of 10-25kb, and greater than 99.5% accuracy. We also include mock microbial community meta genome data set. These data sets can be used without restriction to develop new algorithms to support assembly and analyses of complex genome structure and evolution.

Technical Abstract: The PacBio® HiFi sequencing method yields highly accurate long-read sequencing datasets whose reads average 10-25 kb with accuracies of greater than 99.5%. These accurate long reads are applicable and improve results for complex applications such as improved single nucleotide and structural variant detection, improved genome assembly, assembly of difficult polyploid or highly repetitive genomes, and the assembly of metagenomes. Currently, there is a need for sample data sets to both evaluate the benefits of these long accurate reads as well as for development of bioinformatic tools including genome assemblers, variant callers and haplotyping algorithms. We present deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus, and Zea mays, as well as two outbred complex genomes, the octoploid Fragaria ananassa, and the anuran Rana muscosa. Additionally, we release sequence data from a mock metagenome community. The datasets reported here can be used without restriction to develop new algorithms and explore complex genome structure and evolution. Data were generated on the PacBio Sequel II instrument.