Location: Plant, Soil and Nutrition Research
Title: Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language modelAuthor
ZHAI, JINGJING - Cornell University | |
GOKASLAN, AARON - Cornell University | |
SCHIFF, YAIR - Cornell University | |
BERTHEL, ANA - Cornell University | |
LIU, ZONG-YAN - Cornell University | |
MILLER, ZACHARY - Cornell University | |
SCHEBEN, ARMIN - Cold Spring Harbor Laboratory | |
STITZER, MICHELLE - Cornell University | |
ROMAY, M CINTA - Cornell University | |
Buckler, Edward - Ed | |
KULESHOV, VOLODYMYR - Cornell University |
Submitted to: bioRxiv
Publication Type: Pre-print Publication Publication Acceptance Date: 6/10/2024 Publication Date: 6/10/2024 Citation: Zhai, J., Gokaslan, A., Schiff, Y., Berthel, A., Liu, Z., Miller, Z.R., Scheben, A., Stitzer, M.C., Romay, M., Buckler Iv, E.S., Kuleshov, V. 2024. Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model. bioRxiv. https://doi.org/10.1101/2024.06.04.596709. DOI: https://doi.org/10.1101/2024.06.04.596709 Interpretive Summary: Crop improvement and adaptation require scaling genetic discoveries from a few model species and varieties to tens of thousands of species and varieties. USDA ARS scientists in Ithaca, NY, in collaboration with Cornell University, developed PlantCaduceus, a plant DNA language model using a novel computing architecture optimized for DNA and trained on genomes from diverse Angiosperms. It was then fine-tuned on model plant Arabidopsis data to predict the key elements defining a species' gene and protein composition. PlantCaduceus outperformed existing models by several fold in cross-species predictions, demonstrating high transferability across flowering plants. PlantCaduceus was also able to identify deleterious mutations without comparisons to other data, which facilitates directly breeding for yield and hybrid vigor. PlantCaduceus accelerates research and practical applications in plant genomics, potentially leading to the development of more resilient and productive crop varieties, benefiting farmers, scientists, and society as a whole. Technical Abstract: Understanding the function and fitness effects of diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation, thus expected to offer better cross-species prediction through fine-tuning on limited labeled data compared to supervised deep learning models. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a carefully curated dataset consisting of 16 diverse Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks involving transcription and translation modeling demonstrated high transferability to maize that diverged 160 million years ago, outperforming the best baseline model by 1.45-fold to 7.23-fold. PlantCaduceus also enables genome-wide deleterious mutation identification without multiple sequence alignment (MSA). PlantCaduceus demonstrated a threefold enrichment of rare alleles in prioritized deleterious mutations compared to MSA-based methods and matched state-of-the-art protein LMs. PlantCaduceus is a versatile pre-trained DNA LM expected to accelerate plant genomics and crop breeding applications. |