Skip to main content
ARS Home » Pacific West Area » Albany, California » Western Regional Research Center » Crop Improvement and Genetics Research » Research » Publications at this Location » Publication #406703

Research Project: GrainGenes- A Global Data Repository for Small Grains

Location: Crop Improvement and Genetics Research

Title: PhosBoost: improved phosphorylation prediction using gradient boosting and protein language models

Author
item PORETSKY, ELLY - Oak Ridge Institute For Science And Education (ORISE)
item Andorf, Carson
item Sen, Taner

Submitted to: Journal of Plant Physiology
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 11/26/2023
Publication Date: 1/20/2023
Citation: Poretsky, E., Andorf, C.M., Sen, T.Z. 2023. PhosBoost: improved phosphorylation prediction using gradient boosting and protein language models. Journal of Plant Physiology. https://doi.org/10.1002/pld3.554.
DOI: https://doi.org/10.1002/pld3.554

Interpretive Summary: Protein phosphorylation is a cellular processes that acts a cellular control in many metabolic pathways. Although some experimental datasets are available to identify which amino acids sites are phosphorylated, they are not comprehensive. Through the use of machine learning approaches and cutting edge language models, we developed a new methodology called PhosBoost. PhosBoost predicts true phosphorylation sites at a higher ratio than the current methods. We predict that PhosBoost will be highly useful for those who study how cells rely on protein phosphorylation to create plant traits.

Technical Abstract: Protein phosphorylation is a dynamic and reversible post-translational modification that regulates a variety of essential biological processes. It is one of the most extensively studied post-translational modifications with an extensive detection and quantification of phosphorylation events across diverse biological systems. The regulatory role of phosphorylation in cellular signaling pathways, protein-protein interactions, and enzymatic activities has motivated extensive research efforts to understand its functional implications. The significance of phosphorylation as a key regulatory post-translational modification, coupled with the abundance of data, has provided valuable resources for the development of increasingly sophisticated protein phosphorylation prediction tools. Recent developments in protein language models have shown great promise in improving protein phosphorylation prediction by inherently representing complex sequence patterns and dependencies in proteins. While the accuracy and precision of protein phosphorylation prediction methods have been steadily increasing, recall remains relatively low. Here, we present a novel machine learning approach, PhosBoost, that harnesses large language models and gradient boosting trees to predict protein phosphorylation from experimentally derived phosphorylation data. We show compelling results, demonstrating that PhosBoost performance is close, or better, to state-of-the-art classifiers, possibly due to higher robustness to data imbalance. Furthermore, PhosBoost is simple and scalable to implement, allowing for practical genome-wide predictions of protein phosphorylation coupled with improved phosphosite annotation.