Skip to main content
ARS Home » Pacific West Area » Albany, California » Western Regional Research Center » Crop Improvement and Genetics Research » Research » Publications at this Location » Publication #377155

Research Project: GrainGenes: Enabling Data Access and Sustainability for Small Grains Researchers

Location: Crop Improvement and Genetics Research

Title: LncMachine: a machine learning algorithm for long noncoding RNA annotation in plants

Author
item CAGIRICI, BUSRA - Oak Ridge Institute For Science And Education (ORISE)
item GALVEZ, SERGIO - University Of Malaga
item Sen, Taner
item BUDAK, HIKMET - Montana Bioagriculture Inc

Submitted to: Functional and Integrative Genomics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 1/25/2021
Publication Date: 2/26/2021
Citation: Cagirici, B.H., Galvez, S., Sen, T.Z., Budak, H. 2021. LncMachine: a machine learning algorithm for long noncoding RNA annotation in plants. Functional and Integrative Genomics. 21(1): 195-204. https://doi.org/10.1007/s10142-021-00769-w.
DOI: https://doi.org/10.1007/s10142-021-00769-w

Interpretive Summary: Wheat is among the top of the most important crops for human nutrition. However, our understanding of how genes contribute to wheat plant traits are limited. Identifying the regulatory elements of wheat genome will help improve wheat breeding by providing information about specific genes for selection. Long noncoding ribonucleic acids (lncRNAs) play critical roles in cell regulation. In this study, genome sequences that produce lncRNAs are predicted using a new machine learning methodology called LncMachine. The prediction performance of the machine learning methodology was compared with the currently available methods and shown to be superior. The program code, prediction models, and the training/test datasets are freely and publicly available.

Technical Abstract: Following elucidation of the critical roles they play in numerous important biological processes, long noncoding RNAs (lncRNAs) have gained vast attention in recent years. Manual annotation of lncRNAs is restricted by known gene annotations and is prone to false prediction due to the incompleteness of available data. However, with the advent of high-throughput sequencing technologies, a magnitude of high-quality data has become available for annotation, especially for plant species such as wheat. Here, we compared prediction accuracies of several machine learning algorithms using a 10-fold cross validation. This study includes a comprehensive feature selection step to refine irrelevant and repeated features. We present an alignment-free coding potential prediction machinery, LncMachine with Random Forest algorithm, specific to crop species with higher accuracies than the currently available popular tools (CPC2, CPAT, and CNIT). In addition, LncMachine with Random Forest also performed well on human and mouse data, with an average accuracy of 92.67%. LncMachine can implement several algorithms in real-time and provide the best model for a specific study. It accepts either a FASTA file or a TAB separated CSV file containing features for each sample. As it is open to implementation, LncMachine can be applied to a wide range of studies.