Location: Plant, Soil and Nutrition Research
Title: A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome datasetAuthor
ZHOU, YONG - King Abdullah University Of Science And Technology | |
KATHIRESAN, NAGARAJAN - King Abdullah University Of Science And Technology | |
YU, ZHICHAO - King Abdullah University Of Science And Technology | |
RIVERA, LUIS - King Abdullah University Of Science And Technology | |
THIMMA, MANJULA - King Abdullah University Of Science And Technology | |
MANICKAM, KEERTHANA - King Abdullah University Of Science And Technology | |
CHEBOTAROV, DMYTRO - International Rice Research Institute | |
MAULEON, RAMIL - International Rice Congress | |
CHOUGULE, KAPEEL - Cold Spring Harbor Laboratory | |
WING, ROD - King Abdullah University Of Science And Technology | |
WEI, SHARON - Cold Spring Harbor Laboratory | |
GAO, TINGTING - Huazhong Agricultural University | |
GREEN, CARL - King Abdullah University Of Science And Technology | |
ZUCCOLO, ANDREA - King Abdullah University Of Science And Technology | |
Ware, Doreen | |
ZHANG, JIANWEI - Huazhong Agricultural University | |
MCNALLY, KENNETH - International Rice Research Institute | |
YANG, YUJIAN - Huazhong Agricultural University | |
XIE, WEIBO - Huazhong Agricultural University |
Submitted to: BMC Biology
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 12/18/2023 Publication Date: 1/25/2024 Citation: Zhou, Y., Kathiresan, N., Yu, Z., Rivera, L.F., Thimma, M., Manickam, K., Chebotarov, D., Mauleon, R., Chougule, K., Wing, R.A., Wei, S., Gao, T., Green, C., Zuccolo, A., Ware, D., Zhang, J., Mcnally, K.L., Yang, Y., Xie, W. 2024. A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset. BMC Biology. 22:13. https://doi.org/10.1186/s12915-024-01820-5. DOI: https://doi.org/10.1186/s12915-024-01820-5 Interpretive Summary: Genome Analysis Toolkit (GATK) is one of the most popular SNP identification software packages and has been used for SNP calling in many species. With the reduction of sequencing cost and increase of sequencing speed, vast amounts of resequencing data are being generated for genetic variation studies. These unprecedented big datasets pose new challenges to the SNP detection. They can be summarized into three areas. 1) intelligent data management solutions to compress and store data 2) high throughput computing with high-performance computing (HPC) systems 3) flexible workflows and job monitor software to run analysis on different HPC platforms efficiently. To address these challenges, the author developed a novel workflow called “HPC-based genome variant calling workflow” (HPC-GVCW). To demonstrate the usability, portability, efficiency and accuracy of HPC-GVCW, the author tested it on different platforms including supercomputers, clusters and high-end workstations with a subset of 3K-RGP dataset called against IRGSP reference genome. The results showed 83-94% identical call rate with previously published data. The author then applied this workflow on the most recently released high-quality pan genomes of different species, and efficiently called an average of 27.3 M, 32.6 M, 168.9 M, and 16.2 M SNPs for rice, sorghum, maize, and soybean, respectively. Analysis of a rice pan-genome reference panel revealed 2.1 M novel SNPs that have yet to be publicly released. The software is open source and available at GitHub (https://github.com/IBEXCluster/Rice-Variant-Calling) for download. Technical Abstract: A high-performance computing genome variant calling workflow was designed to run GATK on HPC platforms. This workflow efficiently called an average of 27.3 M, 32.6 M, 168.9 M, and 16.2 M SNPs for rice, sorghum, maize, and soybean, respectively, on the most recently released high-quality reference sequences. Analysis of a rice pan-genome reference panel revealed 2.1 M novel SNPs that have yet to be publicly released. |