Skip to main content
ARS Home » Pacific West Area » Wenatchee, Washington » Physiology and Pathology of Tree Fruits Research » Research » Publications at this Location » Publication #387519

Research Project: Enhancement of Apple, Pear, and Sweet Cherry Quality

Location: Physiology and Pathology of Tree Fruits Research

Title: GEMmaker: Process massive RNA-seq datasets on heterogeneous computational infrastructure

Author
item HADISH, JOHN - Washington State University
item BIGGS, TYLER - Washington State University
item SHEALY, BEN - Clemson University
item BENDER, M - Clemson University
item MCKNIGHT, COLEMAN - Clemson University
item WYTKO, CONNOR - Washington State University
item SMITH, MELISSA - Clemson University
item FELTUS, ALEX - Clemson University
item Honaas, Loren
item FICKLIN, STEPHEN - Washington State University

Submitted to: BMC Bioinformatics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 3/7/2022
Publication Date: 5/2/2022
Citation: Hadish, J., Biggs, T., Shealy, B., Bender, M.R., McKnight, C., Wytko, C., Smith, M., Feltus, A.F., Honaas, L.A., Ficklin, S. 2022. GEMmaker: Process massive RNA-seq datasets on heterogeneous computational infrastructure. BMC Bioinformatics. 23. Article 156. https://doi.org/10.1186/s12859-022-04629-7.
DOI: https://doi.org/10.1186/s12859-022-04629-7

Interpretive Summary: Successive revolutions in DNA sequencing technology have transformed the way researchers analyze gene activity. The second generation of technology allows researchers to measure the activity of all genes simultaneously. For instance, in Rosaceous tree fruit crop species, this means measuring the activity of roughly 40,000 genes in a single experiment. Just one gene measurement data set can consist of hundreds of millions of gene tags or counts (and GB file sizes) making it necessary to employ specialized software and powerful computers to process and analyze the data. As the second generation of sequencing technology has matured, the datasets have grown larger and more numerous, consisting of 100s or 1000s of individual data sets. While these advances have enabled researchers to relate gene activity to important traits, the sheer size of the data sets creates issues related to data management, computer resources, software expertise, and navigation of high-performance computing systems. Here we report the development of GEMmaker, a workflow software for processing these large gene activity data sets. It is unique among similar workflow softwares because it can scale up without also requiring a similar increase in data storage capability. This allows users with limited computer resources to process large data sets, while also addressing other issues related to the processing of very large data sets.

Technical Abstract: Background Quantification of gene expression from RNA-seq data is a prerequisite for transcriptome analysis such as differential gene expression (DGE) analysis and gene co-expression network (GCN) construction. RNA-seq experiments are increasing larger by samples and there is continued interest to form large sample sets by combining experiments in large sequence repositories. However, processing hundreds to thousands of RNA-seq data can result in challenges related to data management, access to sufficient computational resources, navigation of high-performance computing (HPC) systems, installation of required software dependencies, and reproducibility. Processing of larger and deeper RNA-seq experiments will become more common as sequencing technology matures. Results GEMmaker, is a nf-core compliant, Nextflow workflow, that quantifies gene expression- from small to massive Illumina RNA-seq datasets. It ensures results are highly reproducible through use of versioned containerized software, and can execute on a single workstation, institutional compute cluster, Kubernetes platform or the cloud. It supports popular alignment and quantification tools providing results in raw and normalized formats. GEMmaker is unique in that it can scale to process thousands of local or remote stored samples without exceeding available data storage. Conclusions Workflows that quantify gene expression are not new, and many already address issues of portability, re-usability and scale in terms of access to CPUs. GEMmaker provides these benefits but adds the ability to scale despite low data storage infrastructure. Such an advantage will allow anyone to process hundreds of RNA-seq samples even when data storage resources are limited. GEMmaker is freely available and fully documented with step-by-step setup and execution instructions.