Location: Plant, Soil and Nutrition Research
Title: Fishing for a reelGene: evaluating gene models with evolution and machine learningAuthor
SCHULZ, AIMEE - Cornell University | |
ZHAI, JINGJING - Cornell University | |
AUBUCHON-ELDER, TAYLOR - Danforth Plant Science Center | |
EL-WALID, MOHAMED - Cornell University | |
FEREBEE, TAYLOR - Cornell University | |
GILMORE, ELIZABETH - Cornell University | |
HUFFORD, MATTHEW - Iowa State University | |
JOHNSON, LYNN - Cornell University | |
KELLOGG, ELIZABETH - Danforth Plant Science Center | |
LA, THUY - Cornell University | |
LONG, EVAN - Cornell University | |
MILLER, ZACHARY - Cornell University | |
ROMAY, M CINTA - Cornell University | |
SEETHARAM, ARUN - Iowa State University | |
STITZER, MICHELLE - Cornell University | |
WRIGHTSMAN, TRAVIS - Cornell University | |
Buckler, Edward - Ed | |
MONIER, BRANDON - Cornell University | |
HSU, SHENG-KAI - Cornell University |
Submitted to: bioRxiv
Publication Type: Pre-print Publication Publication Acceptance Date: 9/29/2024 Publication Date: 9/29/2024 Citation: Schulz, A.J., Zhai, J., Aubuchon-Elder, T., El-Walid, M., Ferebee, T.H., Gilmore, E.H., Hufford, M.B., Johnson, L.C., Kellogg, E.A., La, T., Long, E., Miller, Z.R., Romay, M., Seetharam, A.S., Stitzer, M.C., Wrightsman, T., Buckler Iv, E.S., Monier, B., Hsu, S. 2024. Fishing for a reelGene: evaluating gene models with evolution and machine learning. bioRxiv. https://doi.org/10.1101/2023.09.19.558246. DOI: https://doi.org/10.1101/2023.09.19.558246 Interpretive Summary: Scientists have made significant progress in understanding how genes function by examining the genetic codes of different organisms. However, with each new genetic sequence decoded, there arise new hypotheses about gene function. Occasionally, these hypotheses prove incorrect due to factors like misidentifying specific types of genes, mobile genetic elements, or errors in interpreting the genetic code. To address this challenge, researchers have developed a computer program called reelGene, employing machine learning techniques to enhance the precision of gene predictions. This program analyzes patterns in the genetic code and compares them to similar patterns in other related species. This allows reelGene to evaluate the integrity of each gene prediction and assess whether they are producing functional proteins. When tested on maize (corn) genes, reelGene identified that nearly 28% of previous gene predictions were inaccurate or nonfunctional. In summary, reelGene is a valuable tool that aids scientists in gaining a deeper understanding of genes and their operations by meticulously verifying the accuracy of their predictions, drawing on a wealth of genetic information from closely related species. Technical Abstract: Assembled genomes and their associated annotations have transformed our study of gene function. However, each new assembly generates new gene models. Inconsistencies between annotations likely arise from biological and technical causes, including pseudogene misclassification, transposon activity, and intron retention from sequencing of unspliced transcripts. To evaluate gene model predictions, we developed reelGene, a pipeline of machine learning models focused on (1) transcription boundaries, (2) mRNA integrity, and (3) protein structure. The first two models leverage sequence characteristics and evolutionary conservation across related taxa to learn the grammar of conserved transcription boundaries and mRNA sequences, while the third uses conserved evolutionary grammar of protein sequences to predict whether a gene can produce a protein. Evaluating 1.8 million gene models in maize, reelGene found that 28% were incorrectly annotated or nonfunctional. By leveraging a large cohort of related species and through learning the conserved grammar of proteins, reelGene provides a tool for both evaluating gene model accuracy and genome biology. |