Skip to main content
ARS Home » Pacific West Area » Albany, California » Western Regional Research Center » Crop Improvement and Genetics Research » Research » Research Project #446734

Research Project: Curation of Nutrient-Associated Enzymes and Machine Learning Analysis of Mutational Effects in Enzymatic Activity in Wheat and Barley Pangenomes

Location: Crop Improvement and Genetics Research

Project Number: 2030-21000-056-009-S
Project Type: Non-Assistance Cooperative Agreement

Start Date: Sep 1, 2024
End Date: Aug 31, 2026

Objective:
The primary goal of this summer residency proposal is to equip the Cooperator PI with skills and research linkages at an ARS laboratory (Albany, CA) to conduct cooperative research of mutual interests that are relevant to the ARS mission. This sabbatical grant will allow the Cooperator PI to receive training from an ARS mentor and to improve the PI's seed research expertise in the newest analysis tools. The specific project objectives are: Objective 1: Harness the outcomes of computational annotation pipelines by identifying a set of enzymes involved in the production of zinc, phosphorus, and nitrogen in wheat and barley pangenome Objective 2: Enhance this set of nutrient-associated enzymes based on experimental evidence and curate and display them in the GrainGenes small grains community database (https://wheat.pw.usda.gov) Objective 3: Design machine learning pipelines using large language models to model the impact of mutations in nutrient-associated enzymes against their activities to develop more realistic sequence-activity models

Approach:
Obj. 1: Computational pipelines for gene functional annotations are available for plants and they are widely used for many genome assemblies, which was used for the functional annotations in bread wheat and barley. For some other plants, such as for einkorn, other pipelines were used that harness protein sequence similarities against functional annotation databases. As part of this Objective, we will use what is already available in the publicly available databases and scientific literature to gather a detailed list of nutrition-associated enzymes, their sequences, and their enzymatic activity profiles for wheat, barley, and related species, such as einkorn. Obj. 2: Obtaining a set of automatically-generated nutrient-associated enzymes from scientific literature is already valuable for researchers as these enzyme family groups may reveal gaps in scientific knowledge. However, genome assemblies and their annotations are generated in bulk using automated pipelines and may not always generate the most accurate results for specific gene and protein families. In Objective 2, we will aim to improve the structural and functional gene annotations manually by comparing the outcomes of genome assembly and annotation pipelines against the scientific literature, by focusing only enzymes individually. This comparison will help enhance 1) The structural annotations of genes, 2) The functional annotations of genes, where additional annotation will be added through literature search, and 3) Features that define enzymatic activity, where we will for example delineate “hot spots” in enzymes that contribute to enzymatic activity. We will then create genome browser tracks and associated pages in GrainGenes (https://wheat.pw.usda.gov) to display these highly-curated sets in their genomic contexts and make them available for data download. Obj. 3: Generating a resource of highly curated enzymes for nutrition is useful in itself to demonstrate how mutations influence nutritional content in plants. In addition, we will harness this valuable resource as a training set for machine learning methods in order to associate the effect of mutations on enzymatic activity for nutrition-associated enzymes. We will use simple large language models and other machine learning methods, such as neural networks and support vector machines against experimentally available enzymatic activity data to create predictive models for sequence-enzymatic activities to help assist plant breeders to form hypotheses as to what mutations would be more desirable to increase the activity of specific nutrition-associated enzymes.