Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #396771

Research Project: Improving Crop Efficiency Using Genomic Diversity and Computational Modeling

Location: Plant, Soil and Nutrition Research

Title: A multiple alignment workflow shows the effect of repeat masking and parameter tuning on alignment in plants

Author
item WU, YAOYAO - Cornell University
item JOHNSON, LYNN - Cornell University
item SONG, BAOXING - Cornell University
item ROMAY, MARIA - Cornell University
item STITZER, MICHELLE - Cornell University
item SIEPEL, ADAM - Cold Spring Harbor Laboratory
item Buckler, Edward - Ed
item SCHEBEN, ARMIN - Cold Spring Harbor Laboratory

Submitted to: The Plant Genome
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 2/21/2021
Publication Date: 4/13/2022
Citation: Wu, Y., Johnson, L., Song, B., Romay, M.C., Stitzer, M., Siepel, A., Buckler IV, E.S., Scheben, A. 2022. A multiple alignment workflow shows the effect of repeat masking and parameter tuning on alignment in plants. The Plant Genome. 15(2). Article e20204. https://doi.org/10.1002/tpg2.20204.
DOI: https://doi.org/10.1002/tpg2.20204

Interpretive Summary: In order to identify the underlying genetic causes of differences between species, one must first compare their genome sequences. Existing tools can accomplish this, but require specialist knowledge to implement. The many requirements and types of software involved can make the seemingly straightforward task of multiple sequence comparison technically challenging for individual researchers. We developed the msa_pipeline workflow (https://bitbucket.org/bucklerlab/msa_pipeline) to allow comparison of diverged plant genomes with minimal user inputs. The msa_pipeline leverages existing tools to provide a practical solution for rapid multiple alignment of genomes with minimal user effort. As the pace of genome sequencing and assembly accelerates, comparison of the genomes of tens to hundreds of species will drive biological discovery in plants. Our workflow presented here provides a practical first step to perform these comparisons.

Technical Abstract: Alignments of multiple genomes are a cornerstone of comparative genomics, but generating these alignments remains technically challenging and often impractical. We developed the msa_pipeline workflow (https://bitbucket.org/bucklerlab/msa_pipeline) to allow practical and sensitive multiple alignment of diverged plant genomes and calculation of conservation scores with minimal user inputs. As high repeat content and genomic divergence are substantial challenges in plant genome alignment, we also explored the effect of different masking approaches and parameters of the LAST aligner using genome assemblies of 33 grass species. Compared with conventional masking with RepeatMasker, a masking approach based on k-mers (nucleotide sequences of k length) increased the alignment rate of coding sequence and noncoding functional regions by 25 and 14%, respectively. We further found that default alignment parameters generally perform well, but parameter tuning can increase the alignment rate for noncoding functional regions by over 52% compared with default LAST settings. Finally, by increasing alignment sensitivity from the default baseline, parameter tuning can increase the number of noncoding sites that can be scored for conservation by over 76%. Overall, tuning of masking and alignment parameters can generate optimized multiple alignments to drive biological discovery in plants.