Skip to main content
ARS Home » Northeast Area » Beltsville, Maryland (BARC) » Beltsville Agricultural Research Center » Bee Research Laboratory » Research » Publications at this Location » Publication #392023

Research Project: Managing Honey Bees Against Disease and Colony Stress

Location: Bee Research Laboratory

Title: polishCLR: a Nextflow workflow for polishing PacBio CLR genome assemblies

Author
item CHANG, JENNIFER - Orise Fellow
item Stahlke, Amanda
item CHUDALAYANDI, SIVANANDAN - Iowa State University
item Rosen, Benjamin - Ben
item Childers, Anna
item SEVERIN, ANDREW - Iowa State University

Submitted to: Genome Biology and Evolution
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 2/8/2023
Publication Date: 2/16/2023
Citation: Chang, J., Stahlke, A.R., Chudalayandi, S., Rosen, B.D., Childers, A.K., Severin, A. 2023. polishCLR: a Nextflow workflow for polishing PacBio CLR genome assemblies. Genome Biology and Evolution. 15(3): Article evad020. https://doi.org/10.1093/gbe/evad020.
DOI: https://doi.org/10.1093/gbe/evad020

Interpretive Summary: There are many steps necessary to computationally assemble and error-correct a high-quality genome. In order to simplify the error-correction processes so users can more easily follow the accepted best practices we have developed a new software workflow called polishCLR that is easily implemented on a range of computer systems. The workflow allows flexibility for special input assembly considerations, manages files through numerous steps and provides reports with metrics computed throughout the process.

Technical Abstract: Long-read sequencing has revolutionized genome assembly, yielding highly contiguous, chromosome-level contigs. However, assemblies from some third generation long read technologies, such as Pacific Biosciences (PacBio) Continuous Long Reads (CLR), have a high error rate. Such errors can be corrected with short reads through a process called polishing. Although best practices for polishing non-model de novo genome assemblies were recently described by the Vertebrate Genome Project (VGP) Assembly community, there is a need for a publicly available, reproducible workflow that can be easily implemented and run on a conventional high performance computing environment. Here, we describe polishCLR (https://github.com/isugifNF/polishCLR), a reproducible Nextflow workflow that implements best practices for polishing assemblies made from CLR data. PolishCLR can be initiated from several input options that extend best practices to suboptimal cases. It also provides re-entry points throughout several key processes including identifying duplicate haplotypes in purge_dups, allowing a break for scaffolding if data are available, and throughout multiple rounds of polishing and evaluation with Arrow and FreeBayes. PolishCLR is containerized and publicly available for the greater assembly community as a tool to complete assemblies from existing, error-prone long-read data.