Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #417843

Research Project: Enabling Mechanistic Allele Mining to Accelerate Genomic Selection for New Agro-Ecosystems

Location: Plant, Soil and Nutrition Research

Title: Fishing for a reelGene: evaluating gene models with evolution and machine learning

Author
item SCHULZ, AIMEE - Cornell University
item ZHAI, JINGJING - Cornell University
item AUBUCHON-ELDER, TAYLOR - Danforth Plant Science Center
item EL-WALID, MOHAMED - Cornell University
item FEREBEE, TAYLOR - Cornell University
item GILMORE, ELIZABETH - Cornell University
item HUFFORD, MATTHEW - Iowa State University
item JOHNSON, LYNN - Cornell University
item KELLOGG, ELIZABETH - Danforth Plant Science Center
item LA, THUY - Cornell University
item LONG, EVAN - Cornell University
item MILLER, ZACHARY - Cornell University
item ROMAY, M CINTA - Cornell University
item SEETHARAM, ARUN - Iowa State University
item STITZER, MICHELLE - Cornell University
item WRIGHTSMAN, TRAVIS - Cornell University
item Buckler, Edward - Ed
item MONIER, BRANDON - Cornell University
item HSU, SHENG-KAI - Cornell University

Submitted to: bioRxiv
Publication Type: Pre-print Publication
Publication Acceptance Date: 9/29/2024
Publication Date: 9/29/2024
Citation: Schulz, A.J., Zhai, J., Aubuchon-Elder, T., El-Walid, M., Ferebee, T.H., Gilmore, E.H., Hufford, M.B., Johnson, L.C., Kellogg, E.A., La, T., Long, E., Miller, Z.R., Romay, M., Seetharam, A.S., Stitzer, M.C., Wrightsman, T., Buckler Iv, E.S., Monier, B., Hsu, S. 2024. Fishing for a reelGene: evaluating gene models with evolution and machine learning. bioRxiv. https://doi.org/10.1101/2023.09.19.558246.
DOI: https://doi.org/10.1101/2023.09.19.558246

Interpretive Summary: Scientists have made significant progress in understanding how genes function by examining the genetic codes of different organisms. However, with each new genetic sequence decoded, there arise new hypotheses about gene function. Occasionally, these hypotheses prove incorrect due to factors like misidentifying specific types of genes, mobile genetic elements, or errors in interpreting the genetic code. To address this challenge, researchers have developed a computer program called reelGene, employing machine learning techniques to enhance the precision of gene predictions. This program analyzes patterns in the genetic code and compares them to similar patterns in other related species. This allows reelGene to evaluate the integrity of each gene prediction and assess whether they are producing functional proteins. When tested on maize (corn) genes, reelGene identified that nearly 28% of previous gene predictions were inaccurate or nonfunctional. In summary, reelGene is a valuable tool that aids scientists in gaining a deeper understanding of genes and their operations by meticulously verifying the accuracy of their predictions, drawing on a wealth of genetic information from closely related species.

Technical Abstract: Assembled genomes and their associated annotations have transformed our study of gene function. However, each new assembly generates new gene models. Inconsistencies between annotations likely arise from biological and technical causes, including pseudogene misclassification, transposon activity, and intron retention from sequencing of unspliced transcripts. To evaluate gene model predictions, we developed reelGene, a pipeline of machine learning models focused on (1) transcription boundaries, (2) mRNA integrity, and (3) protein structure. The first two models leverage sequence characteristics and evolutionary conservation across related taxa to learn the grammar of conserved transcription boundaries and mRNA sequences, while the third uses conserved evolutionary grammar of protein sequences to predict whether a gene can produce a protein. Evaluating 1.8 million gene models in maize, reelGene found that 28% were incorrectly annotated or nonfunctional. By leveraging a large cohort of related species and through learning the conserved grammar of proteins, reelGene provides a tool for both evaluating gene model accuracy and genome biology.