Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #358236

Research Project: Improving Crop Efficiency Using Genomic Diversity and Computational Modeling

Location: Plant, Soil and Nutrition Research

Title: Evolutionarily informed deep learning methods: Predicting transcript abundance from DNA sequence

Author
item WASHBURN, JACOB - Cornell University
item MEJIA GUERRA, MARIA KATHERINE - Cornell University
item RAMSTEIN, GUILLAUME - Cornell University
item KREMLING, KARL - Cornell University
item VALLURU, RAVI - Cornell University
item Buckler, Edward - Ed
item WANG, HAI - Chinese Academy Of Agricultural Sciences

Submitted to: Proceedings of the National Academy of Sciences (PNAS)
Publication Type: Review Article
Publication Acceptance Date: 1/31/2019
Publication Date: 3/19/2019
Citation: Washburn, J.D., Mejia Guerra, M., Ramstein, G., Kremling, K., Valluru, R., Buckler IV, E.S., Wang, H. 2019. Evolutionarily informed deep learning methods: Predicting transcript abundance from DNA sequence. Proceedings of the National Academy of Sciences. 116(12):5542-5549. https://doi.org/10.1073/pnas.1814551116.
DOI: https://doi.org/10.1073/pnas.1814551116

Interpretive Summary: Machine learning methodologies can easily be applied to biological problems, but standard training and testing methods are not designed to control for evolutionary relatedness or other biological phenomena. To overcome this challenge, two novel methods to control for and utilize evolutionary relatedness were developed within a predictive deep learning framework. The methods are tested and applied within the context of predicting mRNA expression levels from whole genome DNA sequence data, and are applicable across biological organisms. Potential use cases for the methods include plant and animal breeding, disease research, gene editing, and others.

Technical Abstract: Deep learning methodologies have revolutionized prediction in many fields, and show potential to do the same in molecular biology and genetics. However, applying these methods in their current forms ignores evolutionary dependencies within biological systems and can result in false positives and spurious conclusions. We developed two novel approaches that account for evolutionary relatedness in machine learning models: 1) gene-family guided splitting, and 2) ortholog contrasts. The first approach accounts for evolution by constraining the models training and testing sets to include different gene families. The second, uses evolutionarily informed comparisons between orthologous genes to both control for and leverage evolutionary divergence during the training process. The two approaches were explored and validated within the context of mRNA expression level prediction, and have prediction auROC values ranging from 0.72 to 0.94. Model weight inspections showed biologically interpretable patterns, resulting in the novel hypothesis that the 3' UTR is more important for fine tuning mRNA abundance levels while the 5' UTR is more important for large scale changes.