Skip to main content
ARS Home » Midwest Area » Ames, Iowa » Corn Insects and Crop Genetics Research » Research » Publications at this Location » Publication #399916

Research Project: MaizeGDB: Enabling Access to Basic, Translational, and Applied Research Information

Location: Corn Insects and Crop Genetics Research

Title: FASSO: An AlphaFold based method to assign functional annotations by combining sequence and structure orthology

Author
item Andorf, Carson
item SEN, SHATABDI - Iowa State University
item HAYFORD, RITA - Orise Fellow
item Portwood, John
item Cannon, Ethalinda
item HARPER, LISA - Retired ARS Employee
item GARDINER, JACK - University Of Missouri
item Sen, Taner
item Woodhouse, Margaret

Submitted to: bioRxiv
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 11/12/2022
Publication Date: 11/15/2022
Citation: Andorf, C.M., Sen, S., Hayford, R.K., Portwood II, J.L., Cannon, E.K., Harper, L.C., Gardiner, J.M., Sen, T.Z., Woodhouse, M.H. 2022. FASSO: An AlphaFold based method to assign functional annotations by combining sequence and structure orthology. bioRxiv. https://doi.org/10.1101/2022.11.10.516002.
DOI: https://doi.org/10.1101/2022.11.10.516002

Interpretive Summary: The advancements in sequencing technologies have allowed researchers to generate whole-genome assemblies for a wide range of organisms, including most crop species. However, determining the function of genes and their relationship to traits remains expensive and labor-intensive. For this reason, the functional annotation of most plant species relies on inferring gene function by protein sequence similarity with other genomes. The process is limited since it relies solely on sequence data and does not consider the 3-D structural information of proteins and its role in function. With recent advances in machine learning, there are over 200 million available predicted protein structures, including predictions for all protein structures in dozens of key organisms. We developed a software pipeline to functionally annotate proteins by finding annotated proteins from other species with both sequence and structure similarity. We used our method to annotate five plant species (maize, sorghum, rice, soybean, Arabidopsis) and three well-annotated outgroups (human, budding yeast, and fission yeast). The approach assigned 270,000 functional annotations across the eight proteomes, including annotating over 5,600 proteins with previously unknown functions. These results demonstrate the benefit of using both sequence and structure similarity to provide high-quality annotations for plant species lacking experimentally validated functions.

Technical Abstract: Methods to predict orthology play an important role in bioinformatics for phylogenetic analysis by identifying orthologs within or across any level of biological classification. Sequence-based reciprocal best hit approaches are commonly used in functional annotation since orthologous genes are expected to share functions. The process is limited as it relies solely on sequence data and does not consider structural information and its role in function. Previously, determining protein structure was highly time-consuming, inaccurate, and limited to size of the protein, all of which resulted in a structural biology bottleneck. With the release of AlphaFold, there are now over 200 million predicted protein structures, including full proteomes for dozens of key organisms. The reciprocal best structural hit approach uses protein structure alignments to identify structural orthologs. We propose combining both sequence- and structure-based reciprocal best hit approaches to obtain a more accurate and complete set of orthologs across diverse species, called Functional Annotations using Sequence and Structure Orthology (FASSO). Using FASSO, we annotated orthologs between five plant species (maize, sorghum, rice, soybean, Arabidopsis) and three distance outgroups (human, budding yeast, and fission yeast). We inferred over 270,000 functional annotations across the eight proteomes including annotations for over 5,600 uncharacterized proteins. FASSO provides confidence labels on ortholog predictions and flags potential misannotations in existing proteomes. We further demonstrate the utility of the approach by exploring the annotation of the maize proteome.