Author
ISIDRO, JULIO - Cornell University | |
Jannink, Jean-Luc | |
AKDEMIR, DENIZ - Cornell University | |
POLAND, JESSE - Kansas State University | |
HESLOT, NICHOLAS - Cornell University | |
SORRELLS, MARK - Cornell University |
Submitted to: Theoretical and Applied Genetics
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 10/12/2014 Publication Date: 1/5/2015 Publication URL: http://DOI: 10.1007/s00122-014-2418 Citation: Isidro, J., Jannink, J., Akdemir, D., Poland, J., Heslot, N., Sorrells, M. 2015. Training set optimization under population structure in genomic selection. Theoretical and Applied Genetics. 128(1):145-158. Interpretive Summary: Genomic selection requires a training set (TRS) of individuals with genotype and phenotype upon which the prediction model is based. The optimization of the TRS in genomic selection has received much interest in both animal and plant breeding, because it is critical to the accuracy of the prediction models. Population structure in a breeding program indicates the existence of groups such that individuals are more related within than between groups. In this study, we evaluated five different TRS design algorithms for prediction accuracy in the presence of different levels of population structure. Accuracies ranged across traits and population size from 0.12 to 0.59 and from 0.20 to 0.72 in wheat and rice populations, respectively. The sampling method that captured the most phenotypic variation in the TRS also had the best performance in the presence of population structure. The wheat dataset showed mild population structure while the rice dataset had high population structure and this difference affected which method to use. Our results indicated that the best method to optimize the TRS depends on the level of population structure in the dataset, indicating that population structure plays an important role in the optimization of the TRS. Technical Abstract: The optimization of the training set (TRS) in genomic selection (GS) has received much interest in both animal and plant breeding, because it is critical to the accuracy of the prediction models. In this study, five different TRS sampling algorithms, stratified sampling, mean of the Coefficient of Determination (CDmean), mean of Predictor Error Variance (PEVmean), stratified CDmean (StratCDmean) and random sampling, were evaluated for prediction accuracy in the presence of different levels of population structure. Accuracies ranged across traits and population size from 0.12 to 0.59 and from 0.20 to 0.72 in wheat and rice populations, respectively. The sampling method that captured the most phenotypic variation in the TRS had the best performance in the presence of population structure. The wheat dataset showed mild population structure and CDmean and stratified CDmean methods showed the highest accuracies for all the traits except for test weight and heading date. The rice dataset had high population structure and the approach based on stratified sampling showed the highest accuracies for all traits. In general, CDmean minimized the relationship between genotypes in the TRS, maximizing the relationship between TRS and TS. This makes it suitable as an optimization criterion for long-term selection. Our results indicated that the best criterion used to optimize the TRS depends on the level of population structure in the dataset, indicating that population structure plays an important role in the optimization of the TRS. |