Skip to main content
ARS Home » Southeast Area » Raleigh, North Carolina » Plant Science Research » Research » Publications at this Location » Publication #417138

Research Project: Genetic Diversity and Disease Resistance in Maize

Location: Plant Science Research

Title: Don’t BLUP twice

Author
item Holland, Jim - Jim
item PIEPHO, H.-P. - University Of Hohenheim

Submitted to: G3, Genes/Genomes/Genetics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 10/4/2024
Publication Date: N/A
Citation: N/A

Interpretive Summary: Genome-wide association study is an important tool used by geneticists to identify genes affecting complex traits. In crops, these studies often involve associating genetic variation with mean trait values of lines replicated within and across environments. There has been confusion about exactly how to compute mean trait values to use as inputs to genome-wide association studies. In this paper we point out that an often-used approach of computing best linear unbiased predictors (BLUPs) of lines is not optimal. We review the best practice approaches for summarizing trait data as inputs to genome-wide association studies.

Technical Abstract: Large and complex data sets can be difficult to model in a single comprehensive analysis. In many crop experiments, phenotypic data are collected on lines replicated within and across environments. Optimal analyses of these experiments would account for replication and experimental design factors along with genomic relationships among lines. In the context of genome-wide association study (GWAS) scans, fitting complex models accounting for multiple levels of random variation (both genetic and non-genetic) in the data can result in prohibitively large computation time requirements. Ignoring these complexities, however, results in increased residual variance, reduced statistical power, and potential confounding of marker-trait associations with population structure. A useful analytical approach in this situation is two-stage analysis, in which a first stage analysis is conducted to account for the complexities of experimental design and generate summary values for each line. For data from a single trial, this will be a summary value across replicates for each line. For multi-environment data, the summary will be across environments. Importantly, a first-stage model can include many covariance parameters and incur a relatively high computational burden and time cost because it needs to be fit to the data only once per trait, no matter how many markers have been scored. The summary values of each line obtained from the first stage can then be used as phenotypic inputs to the second-stage analysis to test each marker effect. Because the replicated line phenotypes have been reduced to a single summary value already adjusted for any covariates or random non-genetic factors, a computationally efficient GWAS model can be fit to test millions of markers in reasonable time. Two-stage analysis, therefore, is a good compromise that accounts for complex experimental designs and non-genetic factors while keeping computational time constrained. However, confusion remains about how to model line effects in two-stage analyses for GWAS, quantitative trait locus (QTL) mapping, and genomic selection (GS) analyses.