Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Emerging Pests and Pathogens Research » Research » Research Project #432869

Research Project: Development of Tools, Models and Datasets for Genome-enabled Studies of Bacterial Phytopathogens

Location: Emerging Pests and Pathogens Research

2019 Annual Report


Objectives
Objective 1: Develop datasets and computational tools to facilitate the study of large-scale genomic and pan-genomic features of plant-associated bacteria, including genomic islands and virulence pathways. [NP303, C2, PS2A] Subobjective 1A: Develop deep proteogenomic data sets to guide the annotation of poorly characterized type strains and field isolates of select strains of bacterial plant pathogens and other plant-associated bacteria. Subobjective 1B: Develop or refine annotation methods for genomic regions of anomalous nucleotide composition and the systems-level analysis of pathways related to virulence and adaptation to plant-associated niches. Objective 2: Identify genes and candidate transcription factor binding sites using comparative genomics and available CHIP-Seq, RNA-Seq and proteomics data sets, and ensure that gene calls include experimental evidence whenever appropriate. [NP303, C2, PS2A] Subobjective 2A: Extend comparative genomics methods to propagate the experimentally-supported genome annotate updates from targeted bacterial strains to related strains. Subobjective 2B: Leverage proteomics and other high-throughput datasets, along with comparative genomics methods, to identify conserved motifs representing candidate promoters and other regulatory binding sites.


Approach
A good genome annotation includes a complete set of biological components (e.g., coding and non-coding genes) and a description of the interactions between them (e.g., promoters and bind- ing sites for transcriptional regulators). Constructing this level of detail relies on painstaking ex- perimental investigations on individual genes and their regulation – a luxury enjoyed by a small handful of model organisms such as Escherichia coli, Pseudomonas aeruginosa, and Bacillus subtilus. The goal of this project is to use proteomics and other evidence based computational analysis to rapidly produce high-quality bacterial genome annotations that can be used by biologists to design experiments and interpret experimental results. Our primary goal is to develop high-quality genomic resources for field isolates currently causing disease outbreaks including Clavibacter michiganensis, Pantoea ananatis, Xylella fastidiosa, and Dickeya species. In addition, we will use existing and novel computation methods to establish pipelines for prop- agating our experimentally-driven genome annotations to other members of their clades, with special emphasis on pathways related to virulence and fitness. This work will be conducted in collaboration with the prokaryotic genome annotation pipeline (PGAP) team at the National Cen- ter for Biotechnology Information (NCBI). In this manner, the improvements to a small number of genomes will be result in improvements to literally thousands of genome annotations. Both of these objectives build on our prior experience leading experimental and computational efforts to develop genomic resources for P. syringae pv. tomato DC3000.


Progress Report
Subobjective 1A: Researchers are continuing to work on the proteogenomics analysis of the important plant pathogenic bacterium Xylella fastidiosa, the causative agent of Pierce’s disease in grapes, citrus variegated chlorosis (CVC), and olive quick decline syndrome (OQCS). Initial efforts focused on improving the existing genome of the “Temecula” strain, which was isolated from diseased grapes in California. Cultures of this strain are being grown by ARS researchers at Ft. Detrick, Maryland. X. fastidiosa is difficult to maintain in the laboratory because of its growth requirements, however difficulties in obtaining sufficient quantities of protein for proteomics were overcome and samples were recently sent to ARS researchers in Ithaca, New York. These samples are currently being prepared for analysis by mass spectrometry. After several rounds of peer review and additional laboratory assays, a paper describing a proteogenomic analysis of Clavibacter michiganensis michiganensis was published. The raw and processed spectral datasets were deposited at the ProteomeXchange. During the review process, some important considerations were raised with the approach to the proteomics analysis and the computational pipeline. These are being evaluated and assessed in order to improve the current proteogenomics pipeline. Subobjective 1B: One goal of this project is to use the proteomics data to discover gene regulation binding motifs in bacteria. These motifs are patterns in the DNA that provide targets for regulatory proteins. Binding to these targets allows regulator proteins to activate or deactivate entire sets of genes. Unraveling these regulation networks provides a great deal of information about how a bacterium survives and responds to environmental signals, such as a plant host. Previously, Ithaca ARS researchers and other research groups have used RNA-Seq datasets to identify putatively co-regulated genes and used bioinformatic tools, such as MEME, for discovering common DNA sequence motifs that could potentially serve as regulatory binding targets. Follow-on experiments have validated many of these computational predictions. We hypothesized that proteomics datasets could be used in place of RNA-Seq datasets to perform a similar analysis. This required a trivial modification to our existing motif discovery software pipeline. Preliminary studies have been performed using the DC3000 proteomics datasets in order to understand the efficacy of this approach. We found that the quality of the results using the proteomics datasets was significantly lower than previously published results based on RNA-Seq. We have subsequently determined several reasons for poor quality of the results. One reason is that the coverage provided by RNA-Seq is 100x-1000x greater than proteomics, so the view of gene expression it provides is much more complete and nuanced. Another reason is that motifs discovered by RNA-Seq involve the regulation of transcriptional activity, in contrast to proteomics, which measures proteins (translational products). Therefore, when available, RNA-Seq datasets will be preferable to proteomics datasets for motif discovery. To determine if proteomics datasets can be useful for motif discovery, we plan to perform a motif discovery using X. fastidiosa proteomics dataset. There is a lack of RNA-Seq data for X. fastidiosa, so motif discovery using proteomics may be able to provide insights about how this organism is able survive within the plant and cause disease symptoms. These preliminary results will help in the design of future experiments. Existing genome annotation pipelines have an especially difficult time correctly annotating regions in bacterial genomes that have been acquired through horizontal gene transfer. Unfortunately, virulence factors and other bacterial genes important to survival within the plant are often found in these regions. A key motivation of this project was to use proteogenomics to improve the genome annotation of important phytopathogenic bacteria, especially in regions acquired from other organisms. During the proteogenomic analysis of Clavibacter two important parts of our method were identified that require further study. The first part is the use of a very large database containing all theoretically possible protein translations of the target bacterial genome. Such a database is used so that spectra can be matched with peptides in a manner that is not biased towards the existing genome annotation. However, most of the records in the database are spurious, and providing these spurious records increases the likelihood of spurious matches, or false discoveries. Several methods have been proposed to address this concern, including (a) using a smaller database or a series of databases of increasing size and (b) imposing very strict statistical tests to attempt to reduce false discoveries. The second matter of concern is that there is no way to measure the power or precision of any proteogenomics pipeline without additional high-throughput data developed using an independent method. For example, we could run the proteomics pipeline using several different methods for constructing the protein database and several different statistical methods for filtering the results and will likely yield many different sets of results. There is no analytical method for determining the combination of methods that yields the best results. The best approach for evaluating the quality of proteomic results is to compare them with results from an RNA-Seq experiment. The idea is that, if a protein is observed, then a corresponding mRNA should be present under the same conditions. Currently we are working to obtain an adequate suite of RNA-Seq datasets for calibrating and evaluating the different proteogenomics pipelines. This research project has been operating with only one of two planned scientist positions and without any support personnel due to the federal hiring freeze. As a result, progress on Objective 1 has been delayed until a support scientist is hired. In addition, research outlined in Objective 2 has been abandoned.


Accomplishments
1. ARS scientists use novel method for discovering genes in tomato pathogen. Bacteria, including those that cause disease, swap DNA with other bacteria. This so-called “horizontal gene transfer” results in the emergence of antibiotic resistance, the ability to overcome resistant varieties of crops, and other factors that reduce crop yield. It also makes it more difficult to identify all of the genes in emerging and persistent pathogens using existing computational methods. ARS scientists in Ithaca, New York, and their Cornell collaborators have performed an experiment using Clavibacter michiganensis michiganensis, the causative agent of bacterial wilt and canker of tomato, in which the proteins of the bacteria were captured and identified. The observed proteins were compared with the bacteria’s existing genome annotation. Approximately 70% of the annotated proteins were observed using this approach. Fifty nine existing gene annotations were found to be wrong and required correction. In addition, 26 unannotated proteins were discovered, some of which appear in a genomic region known to be involved with disease progression and others that are predicted to be membrane-bound and may be involved with survival in the plant environment. These results will enable a better understanding of this and other strains of this important plant pathogen. In addition, this method is currently being used to study other pathogenic bacteria.


Review Publications
Peritore-Galve, F., Schneider, D.J., Yang, Y., Thannhauser, T.W., Smart, C.D., Stodghill, P. 2019. Proteome profile and genome refinement of the tomato-pathogenic bacterium Clavibacter michiganensis subsp. michiganensis. Proteomics. https://doi.org/10.1002/pmic.201800224.