Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #365055

Research Project: Mapping Crop Genome Functions for Biology-Enabled Germplasm Improvement

Location: Plant, Soil and Nutrition Research

Title: Double triage to identify poorly annotated genes in maize: The missing link in community curation

Author
item TELLO-RUIZ, MARCELA - Cold Spring Harbor Laboratory
item MARCO, CHRISTINA - Cold Spring Harbor Laboratory
item HSU, FEI-MAN - University Of Tokyo
item KHANGURA, RAJDEEP - Purdue University
item QIAO, PENGFEI - Cornell University
item SAPKOTA, SIRJAN - Clemson University
item STISZER, MIHCELLE - Uc Davis Medical Center
item WASIKOWSKI, RACHAEL - University Of Toledo
item WU, HAO - Iowa State University
item JUNPENG, ZHAN - University Of Arizona
item CHOUGULE, KAPEEL - Cold Spring Harbor Laboratory
item BARONE, LINDSAY - Cold Spring Harbor Laboratory
item GHIBAN, CORNEL - Cold Spring Harbor Laboratory
item MUNA, DEMITRI - Cold Spring Harbor Laboratory
item OLSON, ANDREW - Cold Spring Harbor Laboratory
item WANG, LIYA - Cold Spring Harbor Laboratory
item Ware, Doreen
item MICKLOS, DAVID - Cold Spring Harbor Laboratory

Submitted to: PLoS Computational Biology
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 10/5/2019
Publication Date: 10/28/2019
Citation: Tello-Ruiz, M.K., Marco, C.F., Hsu, F., Khangura, R.S., Qiao, P., Sapkota, S., Stiszer, M.C., Wasikowski, R., Wu, H., Junpeng, Z., Chougule, K., Barone, L.C., Ghiban, C., Muna, D., Olson, A.C., Wang, L.C., Ware, D., Micklos, D.A. 2019. Double triage to identify poorly annotated genes in maize: The missing link in community curation. PLoS Computational Biology. 14(10). https://doi.org/10.1371/journal.pone.0224086.
DOI: https://doi.org/10.1371/journal.pone.0224086

Interpretive Summary: Maize is the most important cereal crop worldwide and a crucial staple for global food security. The maize reference genome sequence was published a decade ago, and a combination of genetic, molecular and automated software programs have been used to identify genes and other genome features. The process of identifying the locations of genes and determining what those genes do is called gene annotation. Gene prediction software tools remain error prone, so that, in spite of extensive improvements in sequencing technologies having dramatically increased the amount and quality of genome assemblies, accurate gene annotation continues to be a challenge. We estimated that in the current maize sequence assembly, about 13% protein-coding transcripts are poorly supported by the available biological evidence, so their structural annotation is incorrect. Manual gene curation offers a solid although laborious approach to tackle this problem. In this study, we worked with graduate students to characterize gene that had suspect structures using two different approaches and for built the new gene structures. The approach offers the potential to support community curation of eukaryotic genomes by scientists, students, and potentially even citizen scientists.

Technical Abstract: The sophistication of gene prediction algorithms and the abundance of RNA-based evidence for the maize genome may suggest that manual curation of gene models is no longer necessary. However, quality metrics generated by the MAKER-P gene annotation pipeline identified 17,225 of 130,330 (13%) protein-coding transcripts in the B73 Reference Genome V4 gene set with models of low concordance to available biological evidence. Working with eight graduate students, we used the Apollo annotation editor to curate 86 transcript models flagged by quality metrics and a complimentary method using the Gramene gene tree visualizer. All of the triaged models had significant errors – including missing or extra exons, non-canonical splice sites, and incorrect UTRs. A correct transcript model existed for about 60% of genes (or transcripts) flagged by quality metrics; we attribute this to the convention of elevating the transcript with the longest coding sequence (CDS) to the canonical, or first, position. The remaining 40% of flagged genes resulted in novel annotations and represent a manual curation space of about 10% of the maize genome (~4,000 protein-coding genes). MAKER-P metrics have a specificity of 100%, and a sensitivity of 85%; the gene tree visualizer has a specificity of 100%. Together with the Apollo graphical editor, our double triage provides an infrastructure to support the community curation of eukaryotic genomes by scientists, students, and potentially even citizen scientists.