Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #417822

Research Project: Enabling Mechanistic Allele Mining to Accelerate Genomic Selection for New Agro-Ecosystems

Location: Plant, Soil and Nutrition Research

Title: Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model

Author
item ZHAI, JINGJING - Cornell University
item GOKASLAN, AARON - Cornell University
item SCHIFF, YAIR - Cornell University
item BERTHEL, ANA - Cornell University
item LIU, ZONG-YAN - Cornell University
item MILLER, ZACHARY - Cornell University
item SCHEBEN, ARMIN - Cold Spring Harbor Laboratory
item STITZER, MICHELLE - Cornell University
item ROMAY, M CINTA - Cornell University
item Buckler, Edward - Ed
item KULESHOV, VOLODYMYR - Cornell University

Submitted to: bioRxiv
Publication Type: Pre-print Publication
Publication Acceptance Date: 6/10/2024
Publication Date: 6/10/2024
Citation: Zhai, J., Gokaslan, A., Schiff, Y., Berthel, A., Liu, Z., Miller, Z.R., Scheben, A., Stitzer, M.C., Romay, M., Buckler Iv, E.S., Kuleshov, V. 2024. Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model. bioRxiv. https://doi.org/10.1101/2024.06.04.596709.
DOI: https://doi.org/10.1101/2024.06.04.596709

Interpretive Summary: Crop improvement and adaptation require scaling genetic discoveries from a few model species and varieties to tens of thousands of species and varieties. USDA ARS scientists in Ithaca, NY, in collaboration with Cornell University, developed PlantCaduceus, a plant DNA language model using a novel computing architecture optimized for DNA and trained on genomes from diverse Angiosperms. It was then fine-tuned on model plant Arabidopsis data to predict the key elements defining a species' gene and protein composition. PlantCaduceus outperformed existing models by several fold in cross-species predictions, demonstrating high transferability across flowering plants. PlantCaduceus was also able to identify deleterious mutations without comparisons to other data, which facilitates directly breeding for yield and hybrid vigor. PlantCaduceus accelerates research and practical applications in plant genomics, potentially leading to the development of more resilient and productive crop varieties, benefiting farmers, scientists, and society as a whole.

Technical Abstract: Understanding the function and fitness effects of diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation, thus expected to offer better cross-species prediction through fine-tuning on limited labeled data compared to supervised deep learning models. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a carefully curated dataset consisting of 16 diverse Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks involving transcription and translation modeling demonstrated high transferability to maize that diverged 160 million years ago, outperforming the best baseline model by 1.45-fold to 7.23-fold. PlantCaduceus also enables genome-wide deleterious mutation identification without multiple sequence alignment (MSA). PlantCaduceus demonstrated a threefold enrichment of rare alleles in prioritized deleterious mutations compared to MSA-based methods and matched state-of-the-art protein LMs. PlantCaduceus is a versatile pre-trained DNA LM expected to accelerate plant genomics and crop breeding applications.