Skip to main content
ARS Home » Midwest Area » Urbana, Illinois » Global Change and Photosynthesis Research » Research » Publications at this Location » Publication #320830

Title: Accurate detection and quantification of functional genes in complex short-read metagenomic datasets: method development and application to nitrogen cycle genes

Author
item ORELLANA, L - GEORGIA TECH
item RODRIGUEZ-R, L - GEORGIA TECH
item Chee Sanford, Joanne
item SANFORD, R - UNIVERSITY OF ILLINOIS
item LOEFFLER, F - UNIVERSITY OF TENNESSEE
item KONSTANTINIDIS, K - GEORGIA TECH

Submitted to: mBio
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 3/1/2016
Publication Date: 8/1/2016
Citation: Orellana, L.H., Rodriguez-R, L.M., Chee Sanford, J.C., Sanford, R.A., Loeffler, F.E., Konstantinidis, K.T. 2016. Accurate detection and quantification of functional genes in complex short-read metagenomic datasets: method development and application to nitrogen cycle genes. mBio. 7:e01693-16.

Interpretive Summary: The use of new sequencing technology (-omics) to characterize genes in natural systems like soils allows the advantage of obtaining a huge database of gene sequences that can help us predict the nature of unseen and uncultured microbial populations and their activities in their natural environments. These activities can identify the presence of important processes like nutrient (carbon and nitrogen) cycling, greenhouse gas emissions, antibiotic resistances, biodegradation, etc. To make sense of the large database of sequence information obtained requires new computational tools that allow accurate identification of genes against a background of billions of sequenced gene fragments that can result from such a methodological approach. A new computational tool available to the public was developed based on a new approach to identify true proteins, in comparison to proven reference proteins that have been previously identified, among the pool of many unknown sequences generated. Among the genes used to demonstrate the utility of this new computational approach, the nosZ gene that codes for the gene involved in the reduction of the greenhouse gas N2O to N2 was used to show the high abundance of a recently discovered "atypical" nosZ gene over the "typical" nosZ counterpart, the latter that had been traditionally considered to be the only group known for decades. This study resulted in a new bioinformatics tool ("ROCker") to reliably and accurately detect genes of interest in natural environments without the bias normally associated with molecular-based methods of study, and the significance will be a large advancement for users of new sequencing technology to gain a more thorough understanding of the abundance and diversity of important genes in natural systems.

Technical Abstract: Metagenomics can elucidate the diversity, abundance, and dynamics of the microbial genes participating in many of the biogeochemical transformations of key nutrients such as nitrogen in a variety of ecosystems. However, accurate thresholds that can discriminate between true and false positive matches during sequence similarity searches are rarely evaluated. Further, the effects of read length on the frequencies remain also speculative. To overcome these limitations, we developed a methodology aimed to identify position-specific, most-discriminant thresholds for functional genes of interest. To determine the thresholds, we employe Receiver-Operator Curve (ROX)-analysis of in silico generated metagenomic reads that mapped on well-curated reference gene sequences, and a sliding window across the length of the aligment of the reference sequences to deal with non-discriminative domains and motifs that are shared between different proteins. Using popular similarity search algorithms such as Blastx or DIAMOND, our strategy showed an improved false discovery rate (FDR), from ~2 to 14-fold, when compared to the common practice of using a fixed e-value cut-off across the length of the alignment. This strategy also exhibited better sensitivity (average increase of ~24%) compared to Hidden Markov model searches of the same parts of the sequence of the target gene, although at the expense of computational time. Based on the determined thresholds, we investigated the abundance and diversity of (unassembled) metagenomic reads encoding nitrous oxide reductase (nosZ), mediating the reduction of N2O to N2, in terrestrial publicly available, short-read metagenomes. The results revealed, for example, an increased abundance of the recently discovered atypical nosZ genes compared to their typical counterparts, which have been preferentially studied to date, in most soils. Therefore, this study provides a bioinformatic strategy for reliable detection of target short gene fragments in metagenomes and advances our understanding of the abundance and diversity of the nitrogen cycle genes in soils. Our publicly available pipeline "ROCker" is fully automated and can be used to investageate any other genes or process of interest (www.enveomics.gatech.edu).