Skip to main content
ARS Home » Midwest Area » Urbana, Illinois » Global Change and Photosynthesis Research » Research » Publications at this Location » Publication #352389

Title: Variable selection in omics data: a practical evaluation of small sample sizes

Author
item KIRPICH, ALEXANDER - University Of Florida
item Ainsworth, Elizabeth - Lisa
item WEDOW, JESSICA - University Of Illinois
item NEWMAN, JEREMY R B - University Of Florida
item MICHALIDIS, GEORGE - University Of Florida
item MCINTYRE, LAUREN - University Of Florida

Submitted to: PLOS ONE
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 5/10/2018
Publication Date: 6/2/2018
Citation: Kirpich, A., Ainsworth, E.A., Wedow, J.M., Newman, J., Michalidis, G., McIntyre, L.M. 2018. Variable selection in omics data: a practical evaluation of small sample sizes. PLoS One. 13(6):e0197910.

Interpretive Summary: Transcriptomics and metabolomics experiments have statistical challenges resulting from the large number of features (genes or metabolites) and small number of samples. The goal of many 'omics experiments is to identify a few features that differ among experimental conditions. Correlation among features as well as structural aspects of the experiment design such as blocking or batch processing samples can complicate analysis. This study aimed to test different statistical approaches for analysis omics data sets, and found that ANOVA was an effective analytical tool for initial screening of features in omics experiments.

Technical Abstract: In omics experiments, variable selection involves a large number of metabolites/ genes and a small number of samples (the n < p problem). The ultimate goal is often the identification of one, or a few features that are different among conditions- a biomarker. Complicating biomarker identification, the p variables often contain a correlation structure due to the biology of the experiment making identifying causal compounds from correlated compounds difficult. Additionally, there may be elements in the experimental design (blocks, batches) that introduce structure in the data. While this problem has been discussed in the literature and various strategies proposed, the over fitting problems concomitant with such approaches are rarely acknowledged. Instead of viewing a single omics experiment as a definitive test for a biomarker, an unrealistic analytical goal, we propose to view such studies as screening studies where the goal of the study is to reduce the number of features present in the second round of testing, and to limit the Type II error. Using this perspective, the performance of LASSO, ridge regression and Elastic Net was compared with the performance of an ANOVA via a simulation study and two real data comparisons. Interestingly, a dramatic increase in the number of features had no effect on Type I error for the ANOVA approach. ANOVA, even without multiple test correction, has a low false positive rates in the scenarios tested. The Elastic Net has an inflated Type I error (from 10 to 50%) for small numbers of features which increases with sample size. The Type II error rate for the ANOVA is comparable or lower than that for the Elastic Net leading us to conclude that an ANOVA is an effective analytical tool for the initial screening of features in omics experiments.