Project : USDA ARS

ARS Home » Pacific West Area » Albany, California » Western Regional Research Center » Crop Improvement and Genetics Research » Research » Research Project #434601

Research Project: GrainGenes: Enabling Data Access and Sustainability for Small Grains Researchers

Location: Crop Improvement and Genetics Research

2020 Annual Report

Objectives
GrainGenes is an international, centralized crop database for peer-reviewed small grains data and information portal that serves the small grains research and breeding communities (wheat, barley, oat, and rye). The GrainGenes project ensures long-term data curation, accessibility, and sustainability so that small grains researchers can develop new, more nutritious, disease and pest resistant, high yielding cultivars. Objective 1: Accelerate small grains (wheat, oats, barley, and rye) trait analysis, germplasm analysis, genetic studies, and breeding by providing open access to small grains genome sequences, germplasm diversity information, trait mapping information, and phenotype data at GrainGenes. Goal 1A: Integrate small grains genome assemblies, pangenomes, and annotations into GrainGenes. Goal 1B: Integrate genetic, diversity, functional, and phenotypic data into GrainGenes with a genome-centric focus. Objective 2: Develop an infrastructure to curate, integrate, query, and visualize the genetic, genomic, and phenotypic relationships in small grains germplasm. Goal 2A: Develop methods and pipelines to link genetic, genomic, functional, and phenotypic information and to enhance genome-centric focus. Goal 2B: Implement web-based and computational tools to integrate and visualize genomic data linked with genetic, expression, functional, and diversity data. Goal 2C: Update database structure to align with community migration to a unified interface. Objective 3: Collaborate with database developers and plant researchers to develop improved methods and mechanisms for open, standardized data and knowledge exchange to enhance database utility and interoperability. Goal 3A: Collaborate with data and germplasm repositories and organizations to facilitate the curation, sharing, and linking of data. Goal 3B: Collaborate with community software development efforts to adopt database schema design and tool development. Objective 4: Provide community support and training for small grains researchers through workshops, webinars, and other outreach activities. Goal 4: Facilitate communication and information sharing among the small grains communities and GrainGenes to support research needs.

Approach
As a service project, the GrainGenes team does not perform hypothesis-driven research, but rather fulfills its long-term objectives by adding value to peer-reviewed data generated by others. It provides data curation, management and integration, long-term sustainability, and digital platforms as needed. Driven by stakeholder input, GrainGenes will maintain a central location for curated genomic, genetic, functional, and phenotypic data sets, downloadable in standardized formats, enhanced by intuitive query and visualization tools. Tutorial videos will be created to train small grains researchers on how to efficiently access and retrieve information from GrainGenes, and to show them different ways to reach and use multiple types of data to help develop better small grain crops. Objective 1: Our approach will be to (a) curate genomic, pangenomic, and diversity data into the GrainGenes database; (b) create gene model pages to aggregate and link genomic and genetic data at GrainGenes; (c) curate high-impact, peer-reviewed genetic, trait, phenotypic data into GrainGenes; (d) visualize more accurate genetic maps at GrainGenes; and (e) curate functional gene annotations. Objective 2: We will implement computational pipelines to (a) align genomic and genetic features between different genome assemblies; (b) assign gene function for small grain genomes; (c) facilitate data curation into the GrainGenes database; (d) visualize SNP data online; and (e) display pedigree information. In addition, we will implement and maintain genome browsers to display tracks for multiple genome assemblies and create a multi-species Basic Local Alignment Search Tool (BLAST) interface to allow users to align their sequences against small grains genome assemblies; in parallel, we will prepare for a new release of GrainGenes with an updated content management system. Objective 3: We will enhance links and data sharing between GrainGenes and the Triticeae Toolbox for small grains data, and collaborate with other data and germplasm repositories, groups, and organizations to facilitate the curation, sharing, and linking of data. Objective 4: We will (a) present GrainGenes tools and resources at conferences and site visits; (b) create training videos to teach our users how they can use GrainGenes more efficiently; (c) organize annual meetings between GrainGenes and the GrainGenes Liaison Committee to receive community feedback; and (d) maintain GrainGenes e-mail lists to facilitate communication among members of the small grains community.

Progress Report
In support of Sub-objective 1A, GrainGenes acquired, curated, and displayed multiple genome assemblies, annotations, and genetic/sequence diversity datasets that form the basis of pangenome development. Chinese Spring RefSeq 1.0 genome browsers were populated by the diversity data sets from thewheat Target Induced Local Lesions IN Genome (TILLING), haplotype map (HAPMAP), and WHEAt and barley Legacy for Breeding Improvement (WHEALBI) projects. The Barley cv. Morex, version MorexV2 by TRITEX genome browser was created by the GrainGenes team. The following oat genome assemblies and annotations are being displayed at GrainGenes: diploid Avena atlantica and Avena eriantha and hexaploid OT3098. A new variant track has been added with the data generated from 3,000-year-old Egyptian wild emmer wheat chaff on to the wild emmer wheat (Zavitan) WEWSeq v2.0 genome browser. In support of Sub-objective 1B, functional annotations of the high-confidence gene datasets from the International Wheat Genome Sequencing Consortium - Chinese Spring wheat RefSeq 1.0 and International Barley Sequencing Consortium-barley Morex RefSeq 1.0 were loaded onto GrainGenes. Their probe records for the collection provide external links to the Pfam (a database of protein families), InterPro (a database of protein families, domains and functional sites), and AmiGo (a database of gene ontology terms) websites. The GrainGenes curation team is working toward a complete representation of the data in the Wheat Gene Catalog and has converted the data into the interactive Structured Query Language-based database at GrainGenes. More than 70 genes from this gold-standard set were curated to display at the GrainGenes website. New gene classes, genes, loci, maps, references, and germplasm records are being created to accommodate these data and their links to the Komugi database have been added. Data including 15 yield-related quantitative trait loci, 4 maps, 26 Kompetitive allele specific polymerase (KASP) chain reaction) probes with primer information, and germplasm were added to GrainGenes. In addition, a dataset including 3, 408 new quantitative trait loci, 56 genetic maps, and 2 physical maps were indexed to be discoverable and searchable by WheatIS.org, a website being managed by the International Wheat Information System committee. For Sub-objective 2A, current scripts were modified, and new scripts were created to facilitate genomic and genetic data integration into the back-end database. After sanitizing confidential information from the scripts, they were deposited to the GitHub website for public availability, sharing, and long-term data sustainability. Research in support of Sub-objective 2B included the creation of an online Multi-Basic Local Alignment Search Tool (BLAST) application interface to allow users to align their sequences against multiple sequence sets. A freely available new software suite called SequenceServer developed by a team at Université de Lausanne was also implemented at GrainGenes. More than 20 genome assembly databases were converted to be used in the new software. The new BLAST online tool allows multiple sequence searches and enables linking to genome browsers for wheat, barley, rye, and oat, enhancing the data integration and linking at GrainGenes and facilitating the ability of users to reach appropriate information. In support of Sub-objective 2C, GrainGenes data tables were successfully extracted for migration from MySQL to a PostgreSQL (v10) platform; however, optimization of data handling scripts from Perl5 to Python3 are needed. Python 2.7 was no longer supported, and replaced by Python 3.6, requiring the ODBC data connectors to be updated. Successful data capture is still in the testing phases. As MariaDB is a replacement for the MySQL environment, indexing adjustments were required. Docker will serve to supplement some of the testing as the schema is being developed to export to a comparison schema based on Chado (v3.1), which is an ontology-based modular schema standard adopted by the Generic Model Organism Database (GMOD) research community to describe biological data that can increase the speed of data query and extraction, enabling users to reach data faster at GrainGenes. For Sub-objective 3A, the collaboration with Agriculture and Agri-Food Canada was continued in order to curate oat genetic maps, genomic data, and pedigree information into GrainGenes. More than 325 oat pedigrees were entered and linked to the Triticeae Toolbox database. Several oat maps, along with more than 100 new locus markers, were entered. Through the collaboration with Agriculture and Agri-Food Canada, 74 oat genetic and physical maps are now available at GrainGenes. Progress in support of Sub-objective 3B included monthly Tripal Core and User group meetings through Zoom to discuss trends in developing databases around the Tripal module platform. Tripal, based on a set of Drupal7 modules (near end-of-support timeline) required modifications for Drupal8; Drupal9 is already the current version. A rescue effort to support Drupal7 using the BackdropCMS (v1.16.2) may be possible. There are radical changes between Drupal7 and Drupal8. Discussions within the Tripal group are also directed at other invocations of the Tripal suite using different programming languages other than the current PHP7 environment. A newer Tripal3.3 version is now available; this version allows integrating additional databases alongside the Drupal and Chado databases with configured hooks in the Tripal module. For Sub-objective 4A, the following two new training videos about how to use GrainGenes were created; 1) “Navigating between Database Records and Genome Browsers”, and 2) “Saving Information from GrainGenes Genome Browsers.” The videos were uploaded to YouTube, and they are freely and publicly available without a firewall. In the FY20, they received 24 and 26 views respectively. One presentation about GrainGenes activities was made in the Plant and Animal Genome Conference in San Diego, California.

Accomplishments
1. A private-academic-government partnership resulted in free open access to critical genomics and genome sequence assembly data for oat improvement. Oats are among the healthiest grains on earth and free access to critical genetic and genomics resource information is important to the research community for crop improvement research. ARS scientists at Albany, California, worked with private companies and scientists for universities to host, share, and display the OT3098 hexaploid oat genome sequence data at GrainGenes database. The oat genome sequence data were generated by PepsiCo and Corteva in collaboration with GrainGenes, the University of North Carolina Charlotte, and the University of Saskatchewan to advance oat research. GrainGenes provided a home for the long-term data stewardship for the oat assembly and annotations. An ARS press release was distributed and is available at https://www.ars.usda.gov/news-events/news/research-news/2020/oat-genome-available-on-ars-website

2. The G-quadruplex putative regulatory regions were identified for the first time in wheat. Wheat is one of the world’s most important staple foods, providing approximately 20% of the calories consumed by humans. ARS scientists in Albany, California, identified close to 1 million G-quadruplex motifs across the bread wheat genome. Functional enrichment analysis revealed that the gene models enriched with G-quadruplex motifs were shown to be involved in developmental processes, localization, and cellular component organization or biogenesis. A publicly available track showing G-quadruplex positions was created in the Chinese Spring wheat genome browser at GrainGenes. The study shows for the first time the prevalence and possible functional roles of G-quadruplexes in wheat.

Review Publications
Walsh, J.R., Woodhouse, M.R., Andorf, C.M., Sen, T.Z. 2020. Tissue-specific gene expression and protein abundance patterns are associated with fractionation bias in maize. Biomed Central (BMC) Plant Biology. 20. https://doi.org/10.1186/s12870-019-2218-8.

U.S. DEPARTMENT OF AGRICULTURE

Crop Improvement and Genetics Research: Albany, CA