Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #308924

Title: Training set optimization under population structure in genomic selection

Author
item ISIDRO, JULIO - Cornell University
item Jannink, Jean-Luc
item AKDEMIR, DENIZ - Cornell University
item POLAND, JESSE - Kansas State University
item HESLOT, NICHOLAS - Cornell University
item SORRELLS, MARK - Cornell University

Submitted to: Theoretical and Applied Genetics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 10/12/2014
Publication Date: 1/5/2015
Publication URL: http://DOI: 10.1007/s00122-014-2418
Citation: Isidro, J., Jannink, J., Akdemir, D., Poland, J., Heslot, N., Sorrells, M. 2015. Training set optimization under population structure in genomic selection. Theoretical and Applied Genetics. 128(1):145-158.

Interpretive Summary: Genomic selection requires a training set (TRS) of individuals with genotype and phenotype upon which the prediction model is based. The optimization of the TRS in genomic selection has received much interest in both animal and plant breeding, because it is critical to the accuracy of the prediction models. Population structure in a breeding program indicates the existence of groups such that individuals are more related within than between groups. In this study, we evaluated five different TRS design algorithms for prediction accuracy in the presence of different levels of population structure. Accuracies ranged across traits and population size from 0.12 to 0.59 and from 0.20 to 0.72 in wheat and rice populations, respectively. The sampling method that captured the most phenotypic variation in the TRS also had the best performance in the presence of population structure. The wheat dataset showed mild population structure while the rice dataset had high population structure and this difference affected which method to use. Our results indicated that the best method to optimize the TRS depends on the level of population structure in the dataset, indicating that population structure plays an important role in the optimization of the TRS.

Technical Abstract: The optimization of the training set (TRS) in genomic selection (GS) has received much interest in both animal and plant breeding, because it is critical to the accuracy of the prediction models. In this study, five different TRS sampling algorithms, stratified sampling, mean of the Coefficient of Determination (CDmean), mean of Predictor Error Variance (PEVmean), stratified CDmean (StratCDmean) and random sampling, were evaluated for prediction accuracy in the presence of different levels of population structure. Accuracies ranged across traits and population size from 0.12 to 0.59 and from 0.20 to 0.72 in wheat and rice populations, respectively. The sampling method that captured the most phenotypic variation in the TRS had the best performance in the presence of population structure. The wheat dataset showed mild population structure and CDmean and stratified CDmean methods showed the highest accuracies for all the traits except for test weight and heading date. The rice dataset had high population structure and the approach based on stratified sampling showed the highest accuracies for all traits. In general, CDmean minimized the relationship between genotypes in the TRS, maximizing the relationship between TRS and TS. This makes it suitable as an optimization criterion for long-term selection. Our results indicated that the best criterion used to optimize the TRS depends on the level of population structure in the dataset, indicating that population structure plays an important role in the optimization of the TRS.