Project : USDA ARS

ARS Home » Pacific West Area » Albany, California » Western Regional Research Center » Crop Improvement and Genetics Research » Research » Research Project #434601

Research Project: GrainGenes: Enabling Data Access and Sustainability for Small Grains Researchers

Location: Crop Improvement and Genetics Research

2023 Annual Report

Objectives
GrainGenes is an international, centralized crop database for peer-reviewed small grains data and information portal that serves the small grains research and breeding communities (wheat, barley, oat, and rye). The GrainGenes project ensures long-term data curation, accessibility, and sustainability so that small grains researchers can develop new, more nutritious, disease and pest resistant, high yielding cultivars. Objective 1: Accelerate small grains (wheat, oats, barley, and rye) trait analysis, germplasm analysis, genetic studies, and breeding by providing open access to small grains genome sequences, germplasm diversity information, trait mapping information, and phenotype data at GrainGenes. Goal 1A: Integrate small grains genome assemblies, pangenomes, and annotations into GrainGenes. Goal 1B: Integrate genetic, diversity, functional, and phenotypic data into GrainGenes with a genome-centric focus. Objective 2: Develop an infrastructure to curate, integrate, query, and visualize the genetic, genomic, and phenotypic relationships in small grains germplasm. Goal 2A: Develop methods and pipelines to link genetic, genomic, functional, and phenotypic information and to enhance genome-centric focus. Goal 2B: Implement web-based and computational tools to integrate and visualize genomic data linked with genetic, expression, functional, and diversity data. Goal 2C: Update database structure to align with community migration to a unified interface. Objective 3: Collaborate with database developers and plant researchers to develop improved methods and mechanisms for open, standardized data and knowledge exchange to enhance database utility and interoperability. Goal 3A: Collaborate with data and germplasm repositories and organizations to facilitate the curation, sharing, and linking of data. Goal 3B: Collaborate with community software development efforts to adopt database schema design and tool development. Objective 4: Provide community support and training for small grains researchers through workshops, webinars, and other outreach activities. Goal 4: Facilitate communication and information sharing among the small grains communities and GrainGenes to support research needs.

Approach
As a service project, the GrainGenes team does not perform hypothesis-driven research, but rather fulfills its long-term objectives by adding value to peer-reviewed data generated by others. It provides data curation, management and integration, long-term sustainability, and digital platforms as needed. Driven by stakeholder input, GrainGenes will maintain a central location for curated genomic, genetic, functional, and phenotypic data sets, downloadable in standardized formats, enhanced by intuitive query and visualization tools. Tutorial videos will be created to train small grains researchers on how to efficiently access and retrieve information from GrainGenes, and to show them different ways to reach and use multiple types of data to help develop better small grain crops. Objective 1: Our approach will be to (a) curate genomic, pangenomic, and diversity data into the GrainGenes database; (b) create gene model pages to aggregate and link genomic and genetic data at GrainGenes; (c) curate high-impact, peer-reviewed genetic, trait, phenotypic data into GrainGenes; (d) visualize more accurate genetic maps at GrainGenes; and (e) curate functional gene annotations. Objective 2: We will implement computational pipelines to (a) align genomic and genetic features between different genome assemblies; (b) assign gene function for small grain genomes; (c) facilitate data curation into the GrainGenes database; (d) visualize SNP data online; and (e) display pedigree information. In addition, we will implement and maintain genome browsers to display tracks for multiple genome assemblies and create a multi-species Basic Local Alignment Search Tool (BLAST) interface to allow users to align their sequences against small grains genome assemblies; in parallel, we will prepare for a new release of GrainGenes with an updated content management system. Objective 3: We will enhance links and data sharing between GrainGenes and the Triticeae Toolbox for small grains data, and collaborate with other data and germplasm repositories, groups, and organizations to facilitate the curation, sharing, and linking of data. Objective 4: We will (a) present GrainGenes tools and resources at conferences and site visits; (b) create training videos to teach our users how they can use GrainGenes more efficiently; (c) organize annual meetings between GrainGenes and the GrainGenes Liaison Committee to receive community feedback; and (d) maintain GrainGenes e-mail lists to facilitate communication among members of the small grains community.

Progress Report
This is the final report for project 2030-21000-024-000D, GrainGenes: Enabling Data Access and Sustainability for Small Grains Researchers, which has been replaced by new project 2030-21000-056-000D, GrainGenes - A Global Data Repository for Small Grains. For additional information, see the new project report. In support of Objective 1, multiple genome browsers were made publicly available and new genome browser tracks were added for wheat, barley, and oat. ARS researchers in Albany, California, worked with the genome sequencing consortiums to get their genomic and genetic data. In some cases, these genetic data sets required heavy curation, and we were able to fix a couple of oversights. Among the new assemblies and new tracks, wheat received the most data. A durum wheat genome assembly and annotations were publicly released, along with version 2 of the wild emmer genome assembly and annotations. The largest addition was for the Chinese Spring hexaploid wheat genome. The paper for the 1,000 wheat exomes project was published in May 2019, and the genomic project outcomes were added as a separate section on the International Wheat Genome Sequencing Consortium Reference Sequence (RefSeq) v1.0 genome browser. Multiple data sets were curated into GrainGenes: 1) Spring Wheat Nested Association (NAM) Map; 2) the Global Tetraploid Wheat Collection germplasm; 3) Durum quantitative trait loci; 4) uniform regional nursery data; 5) legacy oat maps; 6) the Oat 2018 Consensus map; 7) Updated GrainGenes trait records. GrainGenes indexed the following data sets into the Wheat Information System: 1) 16,106 germplasm records including lines from the Global Tetraploid Wheat Collection, and all other diploid, tetraploid and hexaploid accessions in GrainGenes; 2) 548 quantitative trait loci; 3) 101 genetic and physical maps; 4) 3,119 genes from Wheat Gene Catalogue. The GrainGenes curation team worked toward a complete representation of the data in the Wheat Gene Catalog and has converted the data into the interactive Structured Query Language-based database at GrainGenes. More than 70 genes from this gold-standard set were curated to display at the GrainGenes website. New gene classes, genes, loci, maps, references, and germplasm records were being created to accommodate these data and their links to the Komugi database were added. Data including 15 yield-related quantitative trait loci, four maps, 26 KASP (Competitive allele specific polymerase chain reaction) probes with primer information, and germplasm were added to GrainGenes. Hyperlinks from 13,826 genes from the International Wheat Genome Sequencing Consortium’s Chinese Spring version 1 assembly were added to probe records to take users directly to 1) ExpVIP (wheatexpression.com), which is an RNA-seq data analysis and visualization platform that holds expression data for nearly 40 studies, 2) KnetMiner (https://knetminer.com), which is a graph-based gene discovery platform, 3) PhyloGenes (http://www.phylogenes.org), which displays pre-computed phylogenetic trees of gene families alongside experimental gene function data, 4) Ensembl Plants (http://plants.ensembl.org), which is a genome-centric portal for plant species, 5) Persephone (https://persephonesoft.com), which is a genome browser that facilitate comparative genomic views, and 6) Rust expression browser (http://www.rust-expression.com), which specializes in expression of rust-disease related genes. The following genome assemblies were brought into GrainGenes, and genome browsers and associated pages were created: Kariega wheat, Fielder wheat, Morex barley version 3, Lo7 rye, Weining rye, Sang oat, Avena longiglumis oat, Avena insularis oat, Aegilops tauschii Aet version 5.0, Aegilops tauschii T093, Aegilops tauschii AY61, Aegilops tauschii XJ02, and Aegilops tauschii AY17. 37,789 records for loci, probes, and sequences for the 2016 50K marker set acquired from the James Hutton Institute Germinate project page were added to GrainGenes. Significant markers in genome-wide association studies in barley were linked to quantitative trait loci during curation. An example sequence record, JHI-Hv50k-2016-100012 was aligned against over 100 small grains genome databases many of which were linked to the alignment results on the accompanying JBrowse genome browser. In addition, for the Morex barley version 3 and the PepsiCo OT3098 v2 hexaploid oat genome browsers, 659 quantitative trait loci were curated and reciprocal links from the browser to the GrainGenes quantitative trait loci and probe pages for significant markers were created. For Objective 2, the reference genome sequence data is available to download from the genome browser pages and from the data download site created on GrainGenes. GrainGenes collaborated with a group in U.C. Berkeley to develop a prototype JBrowse plug-in called JBlast to allow BLASTing of sequences directly from tracks on JBrowse-based browsers and link them with genetic marker information. The production-quality plug-in will be publicly available for anyone who uses a JBrowse-based genome browser, which is one of the most popular genome browsers used in the world. GrainGenes data sources including the formal MySQL relational database (315 tables), the companion CMap genetic maps’ MySQL database (over 250 data sets), and the BLAST nucleic acid and peptide databases (over 500 data sets) were deconstructed to created base files for future integration into newer database tools in development. Docker containers were created to prepare migration strategies for newer operating systems, migrated programming tools, and updated software versions used for data visualization and curation. Attention focused on the Tripal suite of modules, and JBrowse for genome-visualization and Pretzel for genetic map-visualization. Steps were taken to convert the current MySQL (v5.5) version of the GrainGenes database into a PostgreSQL (v10.8) version in preparation for migration to a Content Management System (CMS)(Drupal7) driven suite of modules (Tripal v3) for biological data. A PostgreSQL version of GrainGenes was created, and a workflow of test data-queries were tested and adjusted. The Drupal7 environment has a projected EOL (Jan 2025) and preparations were ceased to prepare for for the Drupal 8+ and Tripal v4 releases. Computational pipelines were run and subsequent manual curation were performed for Morex barley version 3 and the PepsiCo OT3098 v2 hexaploid oat browsers for the following genetic quantitative trait loci data to assign track positions and genomic sequences: 83 quantitative trait loci for beta-glucan, as well as 576 quantitative trait loci for net blotch, other diseases, agronomic traits, and malt trait. The diversity data tracks were created on JBrowse-based genome browsers at GrainGenes for the International Wheat Genome Sequencing Consortium Chinese Spring wheat version 1. These tracks included varietal single nucleotide polymorphism datasets, and the datasets displaying the outcomes of the 1000 Wheat Exomes Project, which, according to their publication, “aimed to generate a haplotype map on the basis of targeted re-sequencing of 890 diverse wheat landraces and cultivars, and tetraploid wild and domesticated relatives to identify genomic regions showing the signals of introgression from wild emmer.” The resulting track in GrainGenes contains 348,372 single nucleotide polymorphisms on the A and B genomes. GrainGenes now contains 176 BLAST databases, including 81 JBrowse-linked databases (32 new BLAST databases, of which 24 are JBrowse-linked) (where hit results have links back to our JBrowse instances. In support of Objective 3, collaboration with the Wheat Information System (WheatIS; wheatis.org) and personnel at Unité de Recherche en Génomique-Info (URGI) in Versailles, France, continued in FY19. Operating under the Wheat Initiative, WheatIS is a platform that provides a single hub of access to the wheat data that is distributed among the small grains databases worldwide through a common API. GrainGenes have started a closer collaboration with the USDA-ARS Triticeae Toolbox (T3) project for the genomic data representation. To reduce cost and increase efficiency, both databases decided to maintain and populate a common set of genome browsers housed at GrainGenes. These collaborative efforts were described in a GrainGenes Database article. A PostgreSQL-based CMS hosting the Tripal v3 module suite was created but placed on hold for Tripal v4 development. The collaboration with Agriculture and Agri-Food Canada was continued in curating oat genetic maps, genomics data, and pedigree information into GrainGenes. More than 325 oat pedigrees were entered and linked to the Triticeae Toolbox database. Several oat maps, along with more than 100 new locus markers, were entered. Through the collaboration with Agriculture and Agri-Food Canada, 74 oat genetic and physical maps are now available at GrainGenes. Participation in monthly meetings and community message boards continues to keep track of modules and newer version of the Tripal package adapted for Drupal version 9. In support of Objective 4, ARS researchers designed a new interface for the USDA-ARS Small Grains Genotyping Labs. This is the website describing the four ARS genotyping labs in the United States. They use the site for information, links, and contact information. Barley Genetics Newsletter volumes 47 - 49 (2017-2020) created by USDA-ARS scientists, were made available at GrainGenes as pdf documents. Barley Genetics Newsletter issues can be found here: https://wheat.pw.usda.gov/ggpages/bgn/. The online Annual Wheat Newsletter issues from volumes 63 - 68 between 2018-2022 back to the 37th issue are at https://wheat.pw.usda.gov/ggpages/awn/. Additionally, 10 training videos were created and disseminated through GrainGenes - YouTube to make them freely and publicly available without a firewall to reach globally distributed stakeholders.

Accomplishments
1. GrainGenes increased its global userbase by 16.75%. GrainGenes (https://wheat.pw.usda.gov) is the ARS flagship database for small grains data, including wheat, barley, rye, and oat. The userbase of GrainGenes is distributed across six continents, more than half of which are located in the United States, China, and India. In comparison to the previous year, GrainGenes site visitors increased by 16.75% to 54,128 based on unique internet protocol (IP) addresses.

2. Improved phosphorylation prediction was reached using gradient boosting and protein language models. Protein phosphorylation is a dynamic and reversible post-translational modification that regulates a variety of essential biological processes, but the prediction of phosphorylation sites fails to identify many experimentally verified sites, i.e., they have low recall values. Recent developments in the field of natural language processing enabled the development of protein language models (pLMs). Here, ARS researchers in Albany, California, presented a novel machine learning approach, PhosBoost, that harnesses pLMs and gradient boosting trees to predict protein phosphorylation from experimentally derived phosphorylation data. Benchmarking the performance of PhosBoost at serine and threonine prediction indicates that PhosBoost achieves higher recall at all probability thresholds. ARS researchers demonstrate that PhosBoost provides more confident scores for true positives than for false positives and develop a complementary approach for improving phosphorylation site annotation by using sequence alignments to make inferences from experimental data. PhosBoost is simple and scalable to implement, freely and publicly available at GitHub, and enables practical genome-wide predictions of protein phosphorylation coupled with improved phosphorylation site annotation.

Review Publications
Boden, S.A., McIntosh, R.A., Uauy, C., Krattinger, S., Dubcovsky, J., Rogers, W., Xia, X., Badaeva, E.D., Bentley, A.R., Brown Guedira, G.L., Caccamo, M., Cattivelli, L., Chhuneja, P., Cockram, J., Contreras-Moreira, B., Dreisigacker, S., Edwards, D., Gonzalez, F., Guzman, C., Ikeda, T., Karsai, E.I., Nasuda, S., Pozniak, C., Prins, R., Sen, T.Z., Silva, P., Simkova, H., Zhang, Y. 2023. Updated guidelines for gene nomenclature in wheat. Theoretical and Applied Genetics. 136. Article 72. https://doi.org/10.1007/s00122-023-04253-w.
Cagirici, B.H., Andorf, C.M., Sen, T.Z. 2022. Co-expression pan-network reveals genes involved in complex traits within maize pan-genome. BMC Plant Biology. 22. Article 595. https://doi.org/10.1186/s12870-022-03985-z.
Ranganathan, S., Mahesh, S., Suresh, S., Nagarajan, A., Sen, T.Z., Yennamalli, R.M. 2022. Experimental and computational studies of cellulases as bioethanol enzymes. Bioengineered. 13(5):14028-14046. https://doi.org/10.1080/21655979.2022.2085541.

U.S. DEPARTMENT OF AGRICULTURE

Crop Improvement and Genetics Research: Albany, CA