Location: Corn Insects and Crop Genetics Research
2023 Annual Report
Objectives
Objective 1: Accelerate trait analyses, germplasm analyses, genetic studies, and breeding of soybean and other economically important legume crops through stewardship of genomes, genetic data, genotype data, and phenotype data.
Objective 2: Develop an infrastructure that enhances the integration of genotype and phenotype information and corresponding data sets with query and visualization tools to facilitate efficient plant breeding for soybean and select legume crops.
Objective 3: Collaborate with database developers and plant researchers to develop improved methods and mechanisms for open, standardized data and knowledge exchange to enhance database utility and interoperability.
Objective 4: Provide support and research coordination services for the soybean and other legume research and breeding communities; train new scientists and expand outreach activities through workshops, web-based tutorials, and other communications.
Approach
Incorporate revised primary reference genome sequence for soybean into SoyBase. House and provide access to genome sequences for other soybean accessions, haplotype data, and related annotations. Incorporate revised gene models and annotations into SoyBase. Install or implement web-based tools for curation and improvement of soybean gene models and gene annotations. Incorporate available legume genome sequences and annotations. Working with collaborators, collect and add genetic map and QTL data for crop legumes. Extend web-based tools for navigation among biological sequence data across the legumes. Extend and develop methods and storage capacity for accepting genomic data sets for soybean and other legume species. Develop a complete set of descriptors (ontologies) for soybean biology (anatomy, traits, and development), and for other significant crop legumes as needed. Work with the relevant ontology communities-of-practice to incorporate these descriptors into broadly accessible ontologies. Develop web tutorials for important typical uses of SoyBase and the Legume Clade Database. Present and train about features at relevant conferences and workshops. Regularly seek feedback from users about desired features and usability.
Progress Report
The SoyBase and Legume Clade Database project in Ames, Iowa, focuses on development of two online databases: SoyBase (soybase.org) and the Legume Information System (LIS) (legumeinfo.org). LIS development is a collaboration between ARS researchers in Ames, Iowa, and developers at the National Center for Genome Resources (NCGR) in Santa Fe, New Mexico. Software development is shared by these databases to improve efficiency and user experience.
SoyBase and LIS are used by plant breeders and researchers throughout the U.S. and worldwide, as a source of information about the genetic basis for traits important in legume crops. The legume family includes crops such as soybean, chickpea, lentils, peanuts, alfalfa, and many others. SoyBase holds 52 reference-quality soybean genome assemblies and gene annotation sets, and LIS holds 86 genome assemblies and annotation sets for other crop and model legumes. These are integrated with other resources such as genetic markers, information about gene function and expression, and genotype information about many thousands of accessions in the U.S. National Plant Germplasm System. Because the LIS and SoyBase project team have curated genetic and genomic data for soybean and other legume species, these databases are heavily utilized by researchers and breeders worldwide. Plant breeders can use this information to find genetic elements responsible for traits such as flowering time, growth habit, or disease resistance, and directly target traits and select for desired features. The SoyBase database averages about three thousand users per month and the LIS database averages about two thousand users per month. Over the last five years, 1430 and 428 articles indexed in Google Scholar referenced SoyBase and LIS respectively. SoyBase was referenced in 30 U.S. patents as well, indicating the scientific and commercial utility of these databases and resources.
In support of Objective 1, a central Data Store has been developed to provide data used by SoyBase and LIS for display in the respective websites – as well as for direct access to the underlying files by researchers. By the end of this project period, the Data Store included data for 44 legume species across 20 genera, from Arachis (peanut) to Vigna (cowpea, mung bean, and adzuki bean). The data types include genome assemblies, gene annotations (predicted genes locations and sequences), genetic markers and maps, marker-trait associations (both quantitative trait loci [QTL] and genome-wide association studies [GWAS]), genetic diversity collections, gene expression information, as well as other data. These files were manually processed and verified by project personnel. The Data Store holds 138 genome assemblies, with 52 for Glycine (including soybean and wild relatives), 32 for Medicago (alfalfa as well as the model species Medicago truncatula), 12 for Vigna (including cowpea, mung bean, and adzuki bean), and 42 for 15 other genera. Beyond genome assemblies and annotations, the largest number of data collections (groups of related files of a particular type and accession) are associated with genetic studies (QTL and GWAS). Altogether, the Data Store includes more than 900 collections and more than 3,000 primary data files.
Other key components of the data storage are metadata files and the InterMine data warehouse system for incorporating the data and providing web and computer interfaces to the data. The metadata files, which are found in each thematic collection (e.g., the set of gene annotation files for an accession and assembly), describe the origin, references, and other essential information about the files in each collection. InterMine provides a web interface that permits users to query the data and get reports about, for example, genes or markers or genetic studies; but it also provides an application programming interface (API) for programmatic access to the data by other programs facilitating interoperability of that data. These programmatic access methods are used by the main websites (SoyBase and LIS) for presenting and displaying the information in other ways, but they are also available to other projects to be used similarly in other contexts. There are currently ten mines, each corresponding to one genus in the Data Store: GlycineMine for soybean and relatives, CicerMine for chickpea and relatives, PhaseolusMine for common bean and relatives. There are similar mines for Arachis (peanut), Cajanus (cowpea), Lens (lima bean), Lupinus (Lupin), Medicago (Barrel medic), and Vigna (garbanzo bean).
In support of Objective 2, website work over this project period has consisted of both fundamental redevelopment of the project websites and databases to accommodate the rapidly increasing pace of genomic data generation, and incorporation of new web applications to provide improved user experiences and access to greatly expanded collections of data.
This website redevelopment has enabled both web/database sub-projects to make use of new technologies with improved capacities for handling large and numerous genomic data sets, as well as greater modularity of code and increased use of software container technologies. The transition to the new framework and technologies is mostly complete for LIS. The transition is well underway for SoyBase and is expected to be completed early in the next five-year project cycle (2024-2028).
The following new web applications have been added to help extend and enrich SoyBase and LIS. InterMine instances, available for ten legume genera, enable users to make queries, and to see useful reports about most types of data in the project. For example, gene pages include a genome browser view around the gene, expression information, functional descriptions, gene family membership, and information about location and neighboring genes. The Funnotate tool (Berendzen et al., 2021, LOG NO. 379729) enables users to submit one or more sequences to find related gene families. Users can also examine evolutionary relationships between the submitted sequences and similar sequences from other legume species. The Genome Context Viewer (Cleary and Farmer 2023, https://doi.org/10.1093/nar/gkad391) shows genomic neighborhoods around a submitted gene or region of interest, from related species or accessions of interest. A tool for conducting sequence searches to user-supplied sequences is provided by the third-party SequenceServer application (Priyam et al., 2019; https://doi.org/10.1093/molbev/msz185).
In support of Objective 3, the SoyBase and LIS projects have worked to provide and use more efficient machine access to data. Use of well-defined, stable, web-accessible APIs permits various websites to query one another and access and use data sets at those respective sites. This is one of the strengths and features of the InterMine instances: they provide a mature, well-described API for accessing data programmatically.
The SoyBase and LIS sites use other APIs for data access. For example, both sites make use of germplasm (variety) data at USDA’s GRIN-Global (Germplasm Resources Information Network), to access trait and collection-location data. This information is used for displaying the collection locations of GRIN germplasm on an interactive geographic information system map. The APIs are also used internally, for efficient and stable access to data within the SoyBase and LIS projects, for example accessing genes, genomic sequences from selected regions, or genomic variants from regions and accessions.
Objective 4, focusing on research community support and training, was met through outreach efforts and support of community meetings throughout the year. Outreach related to research and community support included responses to over 55 requests for information/data from the SoyBase database and support of the community meetings Soybean Breeders Workshop and the Biennial Meeting of the Molecular and Cellular biology of Soybean meeting. In addition, numerous job postings and other community events were disseminated. The project group participated in several working groups of the AgBioData Research Coordination Network (RCN) project, including the Generic Feature Format (GFF3) specification working group, the Ontology Working Group, the Data Federation Working Group, the Diversity Recruitment Working Group, and the Pan-Genome Working Group. A project scientist is also represented on the AgBioData Steering Committee. Other research community activities include memberships by a project scientist on the Soybean Genetics Committee and the REE Data Stewards Community of Practice.
Another important aspect of research-community support for soybean has been the incorporation of trial data for the Northern and Southern Uniform Soybean Tests. This breeding-variety data has been collected and added to the SoyBase database. These comprise the results of trials from 1989 to the present. Additionally, 577 new strain pedigrees have been added to the SoyBase Soybean Parentage database. The phenotype, strain, and trial data have been loaded into SoybeanBase. These data are also being added to the Breeding Insight sponsored SoybeanBase which is an instance of the USDA-sponsored BreedBase project.
Outreach to communicate project research in the recently concluded project cycle include manuscripts describing new features of SoyBase and LIS (Brown et al., 2020, LOG NO. 378108, LOG NO. 378108; Berendzen et al., 2021, LOG NO. 379729), methods for predicting rare or novel species-specific genes (Li et al., 2021; LOG NO. 384375), and manuscripts describing the analyses of diverse germplasm collections in mung bean, soybean and its wild relatives, and peanut and its wild relatives (Bertioli et al., 2021, LOG NO. 380899; Chiteri et al., 2022, LOG NO. 389399; Valliyodan et al., 2020, LOG NO. 360407; Otyama et al., 2020, LOG NO. 385124).
Accomplishments
1. Incorporation of 28 new legume genomes into the Legume Information System to aid breeders in development of improved traits to improve yields and decrease pesticide use. Genome sequences describe the order and content of the DNA in all chromosomes of an organism, providing a “road map” for the organism and serving as a common framework or backbone for much of the work done by breeders and other researchers. In 2022, ARS researchers in Ames, Iowa, have collected 39 new full genome assemblies and associated sets of gene predictions, across seven legume species, and incorporated these into the Legume Information System (LIS) Data Store. These included 28 new genomes for soybean, two for peanut, four for alfalfa and relatives, two for faba bean, one for mung bean, one for hyacinth bean, and one for the redbud tree. These data will be used by plant breeders and biologists, to aid in crop improvement. This information can be used to identify genetic markers for traits such as tolerance to environmental stresses and resistance to pathogens. Breeding and research on this group of crops impacts people worldwide, as legume crops provide protein and other nutrients for most of the global population.
2. Incorporation of the 2022 soybean variety trial data (Northern Uniform Soybean Tests) into SoyBase to aid in identification of superior genetics for breeding. For all major crops, variety trials are used to determine which new varieties are most suited to a particular region or to meet grower and consumer objectives. Traits that are typically assessed in soybean variety trials include yield, tolerance against adverse field conditions such as nutrient deficiencies or pathogens, seed characteristics such as protein and oil concentration and quality, and growth harvest characteristics such as germination rate and plant architecture at harvest. ARS researchers in Ames, Iowa, have added soybean phenotypic data for 582 testing strains submitted to the Northern Uniform Soybean Tests (NUST). Additionally, parentage information for those strains were added to the SoyBase Soybean Parentage Database. Incorporating phenotypic data on these strains into SoyBase (soybase.org) allows breeders access to performance data of testing strains from 1989 to the present. This will allow breeders to easily see the results of breeding activity across programs and to evaluate any increase in grain yield and other seed quality measurements and incorporate strains with superior genetics into their breeding programs.
Review Publications
Kulkarni, R., Zhang, Y., Cannon, S.B., Dorman, K.S. 2022. CAPG: comprehensive allopolyploid genotyper. Bioinformatics. 39(1).Article btac729. https://doi.org/10.1093/bioinformatics/btac729.
Otyama, P.I., Chamberlin, K., Ozias-Akins, P., Graham, M.A., Cannon, E.K.S., Cannon, S.B., MacDonald, G.E., Anglin, N.L. 2022. Genome-wide approaches delineate the additive, epistatic, and pleiotropic nature of variants controlling fatty acid composition in peanut (Arachis hypogaea L.). Genes, Genomes, Genetics. 12(1). Article jkab382. https://doi.org/10.1093/g3journal/jkab382.
Newman, C.S., Andres, R.J., Youngblood, R.C., Campbell, J.D., Simpson, S.A., Cannon, S.B., Scheffler, B.E., Oakley, A.T., Hulse-Kemp, A.M., Dunne, J.C. 2023. Initiation of genomics-assisted breeding in Virginia-type peanuts through the generation of a de novo reference genome and informative markers. Frontiers in Plant Science. 13.Article 1073542. https://doi.org/10.3389/fpls.2022.1073542.