Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #406689

Research Project: Mapping Crop Genome Functions for Biology-Enabled Germplasm Improvement

Location: Plant, Soil and Nutrition Research

Title: A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset

Author
item ZHOU, YONG - King Abdullah University Of Science And Technology
item KATHIRESAN, NAGARAJAN - King Abdullah University Of Science And Technology
item YU, ZHICHAO - King Abdullah University Of Science And Technology
item RIVERA, LUIS - King Abdullah University Of Science And Technology
item THIMMA, MANJULA - King Abdullah University Of Science And Technology
item MANICKAM, KEERTHANA - King Abdullah University Of Science And Technology
item CHEBOTAROV, DMYTRO - International Rice Research Institute
item MAULEON, RAMIL - International Rice Congress
item CHOUGULE, KAPEEL - Cold Spring Harbor Laboratory
item WING, ROD - King Abdullah University Of Science And Technology
item WEI, SHARON - Cold Spring Harbor Laboratory
item GAO, TINGTING - Huazhong Agricultural University
item GREEN, CARL - King Abdullah University Of Science And Technology
item ZUCCOLO, ANDREA - King Abdullah University Of Science And Technology
item Ware, Doreen
item ZHANG, JIANWEI - Huazhong Agricultural University
item MCNALLY, KENNETH - International Rice Research Institute
item YANG, YUJIAN - Huazhong Agricultural University
item XIE, WEIBO - Huazhong Agricultural University

Submitted to: BMC Biology
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 12/18/2023
Publication Date: 1/25/2024
Citation: Zhou, Y., Kathiresan, N., Yu, Z., Rivera, L.F., Thimma, M., Manickam, K., Chebotarov, D., Mauleon, R., Chougule, K., Wing, R.A., Wei, S., Gao, T., Green, C., Zuccolo, A., Ware, D., Zhang, J., Mcnally, K.L., Yang, Y., Xie, W. 2024. A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset. BMC Biology. 22:13. https://doi.org/10.1186/s12915-024-01820-5.
DOI: https://doi.org/10.1186/s12915-024-01820-5

Interpretive Summary: Genome Analysis Toolkit (GATK) is one of the most popular SNP identification software packages and has been used for SNP calling in many species. With the reduction of sequencing cost and increase of sequencing speed, vast amounts of resequencing data are being generated for genetic variation studies. These unprecedented big datasets pose new challenges to the SNP detection. They can be summarized into three areas. 1) intelligent data management solutions to compress and store data 2) high throughput computing with high-performance computing (HPC) systems 3) flexible workflows and job monitor software to run analysis on different HPC platforms efficiently. To address these challenges, the author developed a novel workflow called “HPC-based genome variant calling workflow” (HPC-GVCW). To demonstrate the usability, portability, efficiency and accuracy of HPC-GVCW, the author tested it on different platforms including supercomputers, clusters and high-end workstations with a subset of 3K-RGP dataset called against IRGSP reference genome. The results showed 83-94% identical call rate with previously published data. The author then applied this workflow on the most recently released high-quality pan genomes of different species, and efficiently called an average of 27.3 M, 32.6 M, 168.9 M, and 16.2 M SNPs for rice, sorghum, maize, and soybean, respectively. Analysis of a rice pan-genome reference panel revealed 2.1 M novel SNPs that have yet to be publicly released. The software is open source and available at GitHub (https://github.com/IBEXCluster/Rice-Variant-Calling) for download.

Technical Abstract: A high-performance computing genome variant calling workflow was designed to run GATK on HPC platforms. This workflow efficiently called an average of 27.3 M, 32.6 M, 168.9 M, and 16.2 M SNPs for rice, sorghum, maize, and soybean, respectively, on the most recently released high-quality reference sequences. Analysis of a rice pan-genome reference panel revealed 2.1 M novel SNPs that have yet to be publicly released.