Location: Crop Improvement and Genetics Research
Title: PhosBoost: Improved phosphorylation prediction recall using gradient boosting and protein language modelsAuthor
PORETSKY, ELLY - Oak Ridge Institute For Science And Education (ORISE) | |
Andorf, Carson | |
Sen, Taner |
Submitted to: Plant Direct
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 11/26/2023 Publication Date: 1/20/2024 Citation: Poretsky, E., Andorf, C.M., Sen, T.Z. 2024. PhosBoost: Improved phosphorylation prediction recall using gradient boosting and protein language models. Plant Direct. 7(12). Article e554. https://doi.org/10.1002/pld3.554. DOI: https://doi.org/10.1002/pld3.554 Interpretive Summary: Protein phosphorylation is a cellular processes that acts a cellular control in many metabolic pathways. Although some experimental datasets are available to identify which amino acids sites are phosphorylated, they are not comprehensive. Through the use of machine learning approaches and cutting edge language models, we developed a new methodology called PhosBoost. PhosBoost predicts true phosphorylation sites at a higher ratio than the current methods. We predict that PhosBoost will be highly useful for those who study how cells rely on protein phosphorylation to create plant traits. Technical Abstract: Protein phosphorylation is a dynamic and reversible post-translational modification that regulates a variety of essential biological processes. It is one of the most extensively studied post-translational modifications with an extensive detection and quantification of phosphorylation events across diverse biological systems. The regulatory role of phosphorylation in cellular signaling pathways, protein-protein interactions, and enzymatic activities has motivated extensive research efforts to understand its functional implications. The significance of phosphorylation as a key regulatory post-translational modification, coupled with the abundance of data, has provided valuable resources for the development of increasingly sophisticated protein phosphorylation prediction tools. Recent developments in protein language models have shown great promise in improving protein phosphorylation prediction by inherently representing complex sequence patterns and dependencies in proteins. While the accuracy and precision of protein phosphorylation prediction methods have been steadily increasing, recall remains relatively low. Here, we present a novel machine learning approach, PhosBoost, that harnesses large language models and gradient boosting trees to predict protein phosphorylation from experimentally derived phosphorylation data. We show compelling results, demonstrating that PhosBoost performance is close, or better, to state-of-the-art classifiers, possibly due to higher robustness to data imbalance. Furthermore, PhosBoost is simple and scalable to implement, allowing for practical genome-wide predictions of protein phosphorylation coupled with improved phosphosite annotation. |