Location: Crop Improvement and Genetics Research
Title: LncMachine: a machine learning algorithm for long noncoding RNA annotation in plantsAuthor
CAGIRICI, BUSRA - Oak Ridge Institute For Science And Education (ORISE) | |
GALVEZ, SERGIO - University Of Malaga | |
Sen, Taner | |
BUDAK, HIKMET - Montana Bioagriculture Inc |
Submitted to: Functional and Integrative Genomics
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 1/25/2021 Publication Date: 2/26/2021 Citation: Cagirici, B.H., Galvez, S., Sen, T.Z., Budak, H. 2021. LncMachine: a machine learning algorithm for long noncoding RNA annotation in plants. Functional and Integrative Genomics. 21(1): 195-204. https://doi.org/10.1007/s10142-021-00769-w. DOI: https://doi.org/10.1007/s10142-021-00769-w Interpretive Summary: Wheat is among the top of the most important crops for human nutrition. However, our understanding of how genes contribute to wheat plant traits are limited. Identifying the regulatory elements of wheat genome will help improve wheat breeding by providing information about specific genes for selection. Long noncoding ribonucleic acids (lncRNAs) play critical roles in cell regulation. In this study, genome sequences that produce lncRNAs are predicted using a new machine learning methodology called LncMachine. The prediction performance of the machine learning methodology was compared with the currently available methods and shown to be superior. The program code, prediction models, and the training/test datasets are freely and publicly available. Technical Abstract: Following elucidation of the critical roles they play in numerous important biological processes, long noncoding RNAs (lncRNAs) have gained vast attention in recent years. Manual annotation of lncRNAs is restricted by known gene annotations and is prone to false prediction due to the incompleteness of available data. However, with the advent of high-throughput sequencing technologies, a magnitude of high-quality data has become available for annotation, especially for plant species such as wheat. Here, we compared prediction accuracies of several machine learning algorithms using a 10-fold cross validation. This study includes a comprehensive feature selection step to refine irrelevant and repeated features. We present an alignment-free coding potential prediction machinery, LncMachine with Random Forest algorithm, specific to crop species with higher accuracies than the currently available popular tools (CPC2, CPAT, and CNIT). In addition, LncMachine with Random Forest also performed well on human and mouse data, with an average accuracy of 92.67%. LncMachine can implement several algorithms in real-time and provide the best model for a specific study. It accepts either a FASTA file or a TAB separated CSV file containing features for each sample. As it is open to implementation, LncMachine can be applied to a wide range of studies. |