Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #365648

Research Project: Database Tools for Managing and Analyzing Big Data Sets to Enhance Small Grains Breeding

Location: Plant, Soil and Nutrition Research

Title: A statistical framework for detecting mislabeled and contaminated samples using shallow-depth sequence data

Author
item CHAN, ARIEL - Cornell University
item WILLIAMS, AMY - Cornell University
item Jannink, Jean-Luc

Submitted to: BMC Bioinformatics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 11/19/2018
Publication Date: 12/12/2018
Citation: Chan, A.W., Williams, A.L., Jannink, J. 2018. A statistical framework for detecting mislabeled and contaminated samples using shallow-depth sequence data. BMC Bioinformatics. 19:478. https://doi.org/10.1186/s12859-018-2512-8.
DOI: https://doi.org/10.1186/s12859-018-2512-8

Interpretive Summary: Genotype mislabeling is a more common occurrence than most breeders or researchers care to admit. Before merging data from replicate analyses of the same genotype it is important to verify that no mix-up occurred. We developed a method for this verification suitable for analyses on next generation sequencing. The method gives an easily interpreted output indicating which samples are correct. The method is implemented as an R package called Bayes Inferred Genotype Replicate Error Detector (BIGRED), which is freely available.

Technical Abstract: Background Researchers typically sequence a given individual multiple times, either re-sequencing the same DNA sample (technical replication) or sequencing different DNA samples collected on the same individual (biological replication) or both. Before merging the data from these replicate sequence runs, it is important to verify that no errors, such as DNA contamination or mix-ups, occurred during the data collection pipeline. Methods to detect such errors exist but are often ad hoc, cannot handle missing data and several require phased data. Because they require some combination of genotype calling, imputation, and haplotype phasing, these methods are unsuitable for error detection in low- to moderate-depth sequence data where such tasks are difficult to perform accurately. Additionally, because most existing methods employ a pairwise-comparison approach for error detection rather than joint analysis of the putative replicates, results may be difficult to interpret. Results We introduce a new method for error detection suitable for shallow-, moderate-, and high-depth sequence data. Using Bayes Theorem, we calculate the posterior probability distribution over the set of relations describing the putative replicates and infer which of the samples originated from an identical genotypic source. Conclusions Our method addresses key limitations of existing approaches and produced highly accurate results in simulation experiments. Our method is implemented as an R package called BIGRED (Bayes Inferred Genotype Replicate Error Detector), which is freely available for download: https://github.com/ac2278/BIGRED.