Skip to main content
ARS Home » Pacific West Area » Albany, California » Western Regional Research Center » Crop Improvement and Genetics Research » Research » Publications at this Location » Publication #391111

Research Project: GrainGenes: Enabling Data Access and Sustainability for Small Grains Researchers

Location: Crop Improvement and Genetics Research

Title: G4Boost: A machine learning-based tool for quadruplex identification and stability prediction

Author
item CAGIRICI, H - Oak Ridge Institute For Science And Education (ORISE)
item BUDAK, HIKMET - Montana Bioagriculture Inc
item Sen, Taner

Submitted to: BMC Bioinformatics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 6/9/2022
Publication Date: 6/18/2022
Citation: Cagirici, H.B., Budak, H., Sen, T.Z. 2022. G4Boost: A machine learning-based tool for quadruplex identification and stability prediction. BMC Bioinformatics. 23. Article 240. https://doi.org/10.1186/s12859-022-04782-z.
DOI: https://doi.org/10.1186/s12859-022-04782-z

Interpretive Summary: G-quadruplexes (G4s) are four-stranded nucleic acid structures with closely spaced guanine bases forming square planar shapes in genomes that were implicated in genome stability and cellular regulation. In this study, we developed a machine learning approach called G4Boost to identify and predict the stability of G-quadruplexes based on sequence and energy calculation. Our method performs powerfully: G4Boost correctly predicts the folding state of the G4 structures with a greater than 93% accuracy. G4Boost was successfully applied and validated to predict the stability of experimentally-determined G4 structures, including for plants and humans. An accurate prediction of G4 quadruplexes and their stabilities will provide a better understanding of the role of these important functional structures play in cellular control, including in human diseases and plant traits.

Technical Abstract: Guanine-rich nucleic acid sequences can adopt a variety of functional structures which includes G-quadruplexes (G4s). A variety of functional G4s have been associated with many important biological functions, including telomere maintenance, replication, and recombination. Although every G4 motif has the potential to form a G4 structure, not every G4 motif forms a stable G4 structure, and accurate energy-based methods are needed to assess their structural stability. Here, we present a decision tree-based prediction tool, G4Boost, to locate quadruplex motifs and predict their secondary structure folding probability and stability based on their sequences, nucleotide compositions, and estimated structural topologies. 5-fold cross-validation experiments showed that G4Boost correctly predicts the folding state of the G4 structures with a greater than 93% accuracy, and predicts the secondary structure folding energy with high accuracy with an root-mean-square-error of 4.28 and R-squared of 0.95 for diverse species, including plants. Although mainly trained on plant species, G4Boost accurately classifies the experimental X-ray crystallography data for human quadruplexes as well. G4Boost was successfully applied and validated to predict the stability of experimentally-determined G4 structures, including for plants and humans. An accurate prediction of G4 quadruplexes and their stabilities will provide a better understanding of the role of these important functional structures play in cellular regulation.