Location: Characterization and Interventions for Foodborne Pathogens
Title: The GEA pipeline for characterizing Escherichia coli and Salmonella genomesAuthor
Submitted to: Scientific Reports
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 6/3/2024 Publication Date: 6/10/2024 Citation: Dickey, A.M., Schmidt, J.W., Bono, J.L., Guragain, M. 2024. The GEA pipeline for characterizing Escherichia coli and Salmonella genomes. Scientific Reports. 14. Article 13257. https://doi.org/10.1038/s41598-024-63832-z. DOI: https://doi.org/10.1038/s41598-024-63832-z Interpretive Summary: Escherichia coli and Salmonella are major foodborne human pathogenic bacteria, and their genomes (total DNA) are routinely sequenced for various reasons including risk assessment, hazard identification, and clinical surveillance. We report the development of the Gammaproteobacteria Epidemiologic Annotation (GEA) pipeline, a series of computer applications, for the analysis of large batches of E. coli and Salmonella genomes using the Center of Genomic Epidemiology’s resources in high performance computing systems. The GEA pipeline is designed to assemble and characterize the bacterial genomes by utilizing the most current information from annotation databases and increase the coverage of these databases over time. The predictive genome annotations resulting from the GEA pipeline allows the identification of pathogenic E. coli (both intestinal and extraintestinal) and Salmonella. In this work, we have successfully tested the GEA pipeline on E. coli genomes across multiple compute environments and also demonstrated large scale annotation of more than 14,000 Salmonella genome assemblies. The GEA pipeline is flexible and can be continuously updated, thereby allowing its evolution to ensure availability of new tools and features. High throughput genome assembly and characterization by the pipeline allows for rapid identification and comprehensive characterization of potential pathogens. Technical Abstract: Salmonella enterica and Escherichia coli are major food-borne human pathogens, and their genomes are routinely sequenced for clinical surveillance. Computational pipelines designed for analyzing pathogen genomes should both utilize the most current information from annotation databases and increase the coverage of these databases over time. We report the development of the Gammaproteobacteria Epidemiologic Annotation (GEA) pipeline to analyze large batches of E. coli and S. enterica genomes. The GEA pipeline takes as input paired Illumina raw reads files which are then assembled followed by annotation. Alternatively, assemblies can be provided as input and directly annotated. The pipeline provides predictive genome annotations for E. coli and S. enterica with a focus on the Center for Genomic Epidemiology tools. Annotation results are provided as a tab delimited text file. The GEA pipeline is designed for large-scale E. coli and S. enterica genome assembly and characterization using the Center for Genomic Epidemiology command-line tools and high-performance computing. Large scale annotation is demonstrated by an analysis of more than 14,000 Salmonella genome assemblies. Testing the GEA pipeline on E. coli raw reads demonstrates reproducibility across multiple compute environments and computational usage is optimized on high performance computers. |