Project : USDA ARS

ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Research Project #434435

Research Project: Improving Crop Efficiency Using Genomic Diversity and Computational Modeling

Location: Plant, Soil and Nutrition Research

2023 Annual Report

Objectives
Objective 1: Create approaches and tools for identifying causal variants directly from genomic sequencing of diverse germplasm and species of C4 crops. [NP301, C1, PS1A] Objective 2: Identify deleterious mutations, and model their impact on crop efficiency and heterosis in C4 crops. [NP301, C3, PS3A] Objective 3: Identify adaptive variants for drought and temperature tolerance across C4 crops. [NP301, C1, PS1B] Objective 4: Establish community tools for processing and integration of sequence haplotypes to estimate their breeding effects in crop productivity. [NP301, C4, PS4A]

Approach
Increasing grass crop productivity is key for feeding the world over the next 50 years and this will require removing the deleterious variants in every genome, as well as adapting the crops to highly variable and stressful environments. This project will build better breeding models for improving and adapting maize and sorghum by surveying the natural variation across their entire group of wild relative species - the Andropogoneae. With over 1,000 species, the Andropogoneae are the most productive and water-use efficient plants in the world. Yet, for applied purposes, we have only tapped the variation from a handful of species. This project will lead an effort to survey DNA-level variation across this entire clade and analyze the variation with statistical and machine learning approaches. This will allow us to develop two sets of applied models for maize and sorghum. First, we will quantitatively estimate the deleterious impact on yield for every nucleotide in the genome. Second, we will identify the genes with a high capacity for adaptation to drought, flooding, temperature tolerance and their properties. These approaches and models will be deployed via integration with big data bioinformatics. This project will produce DNA-level knowledge that can be used across breeding programs and crops, and applied through either genomic selection or genome editing.

Progress Report
Deleterious mutations are common, with each generation of a plant acquiring a dozen or more. Some are mildly harmful, others nearly lethal. Over recent years, we've shown how these mutations affect yield and hybrid vigor. In the last two years, our collaborations have revealed that harmful mutations, particularly in protein sequences, can be detected by machine learning models trained on all proteins found across life forms. These models, calibrated using recent evolutionary comparisons, can also prioritize the impact of these mutations. Our research has led to improved breeding models and the identification of causal variants in maize, cassava, and, with collaboration, in potato. Several seed companies now combine editing with these methods to eliminate these harmful variants. This year, we examined how transposons, mobile DNA contributing to nearly 60% of the maize genome, impact crop yield. Despite their diversity in age, size, and location within the genome, our largest and most sensitive study found that transposons only slightly affect fitness and yield. They've learned to coexist with their host, with minimal impact. This underscores that most harmful mutations occur in protein-coding and regulatory sequences. This year we continued our genomic efforts, focusing on sequencing wild species in the maize and sorghum clade – the Andropogoneae. 40 species have been assembled and annotated to a high quality. The 12 species more closely related to maize, are now publicly available through maizeGDB, while the remaining ones will be available soon in a second-round release. In addition, we have assembled, using short-read DNA sequencing, the gene space of over 400 genomes, including many from the U.S. National Plant Germplasm System (NPGS), and we can now see how individual genes have evolved and adapted to various environments across hundreds of closely related species. The first major application of this resource has been to resolve problems with the gene models of maize. Genome annotation (where we model which DNA sequences are transcribed into RNA and then into protein) is a significant bioinformatics challenge. Current methods focusing on functional data within species are often hampered by molecular assay noise and biological system noise. To overcome this, we combined machine learning and evolutionary comparison of nearly 100 sequenced species, enabling us to identify the most functional mRNA and protein. This tool allows us to locate crucial harmful mutations and functional genes, even those that are rarely expressed or only in specific conditions. We also developed a tool that identifies open chromatin across flowering plants, and we are currently developing machine-learning models to predict gene expression. Our most substantial project this year was the launch of the CERCA (Circular Economy for Reimaging Corn Agriculture) project, aimed at revolutionizing sustainable maize production while increasing productivity and efficiency through improved nitrogen cycling. The project involves 27 labs across the U.S., funded through a combination of USDA-ARS, FFAR, industry, and two foundations. The project's primary goals are nitrogen recycling on farms, reducing grain nitrogen demand, returning nitrogen to the soil at season's end, and extending the growing season through cold tolerance. In the first year, germplasm from a wide range of perennial wild relatives were advanced and field evaluations began. Complementing the CERCA project, we are exploring opportunities to redesign storage proteins, crucial for nitrogen provision in seedlings and winter storage in perennials, and our main source of grain and legume protein. Using machine learning algorithms, we've scanned the maize genome for evolved vegetative storage proteins. While we found a few candidates, proteomic analysis indicates that maize's perennial relatives use numerous proteins to store nitrogen, unlike some species. These tools, combined with analyses of hundreds of other species' seed storage proteins, are assisting us in redesigning storage proteins for durability, efficiency, nutrition, and digestibility. Our Practical Haplotype Graph (PHG), a powerful way to represent the haplotype diversity of a crop, has had a number of software improvements made to expand the efficiency and accessibility of the system. Several adjustments were made to allow the PHG system to utilize large public supercomputing systems like the USDA SCINet system, as well as to speed up the system and reduce the computing resources required. BioKotlin, another library designed to provide high-performance bioinformatics in a scripting environment, is continually being updated with new functionality to make it more useable. The main updates have been adding Multiple Sequence Alignment support, updating the documentation along with examples and initial support for parsing the common GFF file format for storing genomic annotation information. To make these breeding tools more available, we designed, implemented, and deployed a Breeder Genomics Hub in collaboration with USAID-funded Cornell collaborators. This hub bundles a number of breeding tools into a single computational platform utilizing the open-source software, JupyterHub, allowing users immediate access to these tools and cloud-based computation. JupyterHub supports the R programming language which many scientists use to analyze data and produce results. To further ease the use of our existing tools like the PHG and TASSEL, a tool for associating traits with genomic information, R interfaces of these tools (rPHG and rTASSEL) have been developed and are available on the Hub. These R interfaces utilize BrAPI-compliant services which allow users to load publicly available genotypic and phenotypic data, including our PHG databases, into the JupyterHub environment without needing to download or copy large files. As part of the development of this hub environment, we have given a number of talks, workshops, and poster presentations at various conferences. Currently, we have a test instance of this system publicly available for testing the most recent Maize build of the PHG. We have created a new build of the Maize PHG using 84 available assemblies to build the graph. This build (v 2.1) was used to impute ~2000 samples spanning a decade of historical data across several different sequencing technologies for the G2F project for a prediction competition (Nov-Dec 2022) organized by ARS in Columbia, Missouri, and Raleigh, North Carolina, in collaboration with University of Wisconsin. Additionally, nearly 5000 maize landraces were imputed from the CIMMYT SEEDs project, which is allowing a complete reanalysis of this key resource with complete genome sequence. Importantly, we are now identifying a key group of genes involved in temperature adaptation. The public results from this PHG build are available through a BrAPI service hosted by MaizeGDB. In parallel, substantial efforts have been made to generate assembly-based PHG databases for sorghum, in collaboration with HudsonAlpha and ARS Cold Spring Harbor leveraging USAID funding, and cassava, with funding from the Bill & Melinda Gates Foundation. Breeding Insight (BI) is the ARS initiative to increase the adoption of genomics, phenomics, and analytics tools (including data management software) in ARS specialty crop and animal breeding programs, which have lagged behind major crop and animal breeding programs. BI is currently in year 5 (phase II) and its sister program, BI OnRamp, is in year 3. Together, BI and OnRamp provide breeding support services for 19 ARS species (blueberry, table grape, sweetpotato, alfalfa, rainbow trout, and North American Atlantic salmon, honeybee, strawberry, cranberry, oat, pecan, lettuce, cucumber, sorghum, hemp, citrus, sugarcane, soybean, and cotton), with BI providing support to multiple breeder programs for some species. The future goal is expansion out to all ARS specialty crops, animal, and natural resource breeding programs. In FY 2022-2023, BI’s most significant accomplishment was proving the feasibility of using haplotype genotyping (from custom 3K panels) to create genetic maps, and run GWAS, QTL analysis, and genomic predictions for two autotetraploid species, blueberry and alfalfa, and one autohexaploid species, sweetpotato. Over 10,000 alfalfa and more than 8,000 blueberry samples were genotyped. This is a substantial and important leap forward in capabilities and genomic insight from 2019, when there were no genetic markers available for these species. We have created a genotyping ordering DB to manage all orders across all species. Furthermore, these data prove that breeders of species with complex polyploid genomes, a common feature in specialty crops and a major blocker for genomic resource creation, can successfully leverage genomic insight into their breeding program. These panels are both having impacts outside of ARS stakeholders as public breeders in academia and private breeders in the industry and around the globe genotype their own breeding material and contribute data back to the public upon publication. The adoption of the genotyping platform and pipeline benefits the entire global breeding effort such that all breeders have access to the same markers to improve data sharing with FAIR data principles. Given this success, Breeding Insight created additional species-specific 3K marker panels for cranberry, pecan, lettuce, and cucumber, none of which had routine or affordable genotyping platforms available to them. Putting these powerful analyses and genomic tools into the hands of ARS’s excellent specialty crop and animal breeders helps to improve breeding decisions and to meet public demand for more nutritious and flavorful foods.

Accomplishments
1. Natural and synthetic nitrogen is lost from our food systems before it reaches the consumer. Over 80% of natural and synthetic nitrogen is lost from our food systems before it reaches the consumer, contributing to 97% of US agricultural greenhouse gas emissions (nitrous oxide, methane) and over 60% of water pollution. The CERCA (Circular Economy for Reimaging Corn Agriculture) project being launched this year focuses on corn, the single largest player in the US agricultural nitrogen system. The goal of this project is to develop corn genetics in concert with agronomy that that reduces corn’s environmental impact by well over 50% by shifting the growing season earlier to capture natural soil nitrogen, reducing corn’s demand for nitrogen, and recycling nitrogen back to the soil at the end of season like perennials. The CERCA project lead by USDA scientists from across the country and university collaborators (27 total labs) have initiated integrated research covering modeling, agronomy, genetics, and physiology to accomplish these goals.

2. Specialty crops and livestock are central to human nutrition, wellbeing, and cultural preservation. Together their production is responsible for over $150 billion in cash receipts. The USDA-ARS and university partner breeder teams who work on these species develop outstanding biological and practical know-how but often lack specialized expertise in genomic DNA-based breeding or advanced information technologies. USDA-ARS Breeding Insight in collaboration with Cornell University centralizes that expertise and adds to it a flexibility to apply advanced genomic and information/automation technologies to these many idiosyncratic species across the country. This year, the project expanded to support 19 species including 32 breeding teams across 18 states. Genomic markers that accelerate breeding were developed for an additional 5 species (70% increase from last year). Nearly 40,000 potential new varieties were genomically evaluated (80% increase from last year). Information and machine learning technologies deployed to 18 species have integrated historical data and automated the collection of new field data resulting in a 190% increase in databased records from last year. Centralization and flexibility are working to enable a scaling not seen before within USDA specialty crops and livestock. Having more data, effectively organized, matters: In sugarcane, blueberry, citrus and sweet potato, ARS breeders are collecting data faster and with fewer errors while for the first time leveraging all aggregated historical datasets to improve precision in selection. In partnership with ARS breeders in St. Paul, Minnesota, and Prosser, Washington, working on alfalfa. Despite alfalfa's complex genome, the collaboration identified genomic markers for the key disease resistance that will accelerate the delivery of highly digestible feed alfalfa to farmers. Breeding Insight helps US breeders accelerate the delivery of nutritious and resilient crops and livestock.

Review Publications
Monier, B., Casstevens, T.M., Bradbury, P., Buckler IV, E.S. 2022. rTASSEL: An R interface to TASSEL for analyzing genomic diversity. Journal of Open Source Software. https://doi.org/10.21105/joss.04530.
Washburn, J.D., Cimen, E., Ramstein, G., Reeves, T., O'Briant, P., McLean, G., Cooper, M., Hammer, G., Buckler IV, E.S. 2021. Predicting phenotypes from genetic, environment, management, and historical data using CNNs. Theoretical and Applied Genetics. 134:3997–4011. https://doi.org/10.1007/s00122-021-03943-7.
Long, E.K., Romay, M., Ramstein, G., Buckler IV, E.S., Robbins, K.R. 2023. Utilizing evolutionary conservation to detect deleterious mutations and improve genomic prediction in cassava. Frontiers in Plant Science. 13:1041925. https://doi.org/10.3389/fpls.2022.1041925.
Wrightsman, T., Marand, A.P., Crisp, P.A., Springer, N.M., Buckler IV, E.S. 2022. Modeling chromatin state from sequence across angiosperms using recurrent convolutional neural networks. The Plant Genome. 15(3):e20249. https://doi.org/10.1002/tpg2.20249.
Khaipho-Burch, M., Ferebee, T., Giri, A., Ramstein, G., Monier, B., Yi, E., Romay, M., Buckler IV, E.S. 2023. Elucidating the patterns of pleiotropy and its biological relevance in maize. PLoS Genetics. PLoS Genet 19(3): e1010664. https://doi.org/10.1371/journal.pgen.1010664.
Bradbury, P.J., Casstevens, T., Jensen, S.E., Johnson, L.E., Miller, Z.R., Monier, B., Romay, M., Song, B., Buckler IV, E.S. 2022. The practical haplotype graph, a platform for storing and using pangenomes for imputation. Bioinformatics. 38(15):3698-3702. https://doi.org/10.1093/bioinformatics/btac410.
Samayoa, L., Olukolu, B.A., Yang, C.J., Chen, Q., Stetter, M.G., York, A.M., Sanchez-Gonzalez, J., Glaubitz, J.C., Bradbury, P., Cinta Romay, M., Sun, Q., Yang, J., Ross-Ibarra, J., Buckler IV, E.S., Doebley, J.F., Holland, J.B. 2021. Domestication reshaped the genetic basis of inbreeding depression in a maize landrace compared to its wild relative, teosinte. PLoS Genetics. 2:1009797. https://doi.org/10.6084/m9.figshare.14750790.
Lima, D.C., Washburn, J.D., Varela, J.I., Chen, Q., Gage, J.L., Romay, M.C., Holland, J.B., Ertl, D., Lopez-Cruz, M., Aguate, F.M., De Los Campos, G., Kaeppler, S., Beissinger, T., Bohn, M., Buckler IV, E.S., Edwards, J.W., Flint Garcia, S.A., Gore, M.A., Hirsch, C.N., Knoll, J.E., Mckay, J., Minyo, R., Murray, S.C., Ortez, O.A., Schnable, J., Sekhon, R.S., Singh, M.P., Sparks, E.E., Thompson, A., Tuinstra, M., Wallace, J., Weldekidan, T., Xu, W., De Leon, N. 2023. Genomes to fields 2022 maize genotype by environment prediction competition. BMC Research Notes. 16: Article 148. https://doi.org/10.1186/s13104-023-06421-z.

U.S. DEPARTMENT OF AGRICULTURE

Plant, Soil and Nutrition Research: Ithaca, NY