Author
SCHMITZ CARLEY, CARI - University Of Wisconsin | |
COOMBS, JOSEPH - Michigan State University | |
DOUCHES, DAVID - Michigan State University | |
Bethke, Paul | |
PALTA, JIWAN - University Of Wisconsin | |
Novy, Richard - Rich | |
ENDELMAN, JEFFREY - University Of Wisconsin |
Submitted to: Theoretical and Applied Genetics
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 12/22/2016 Publication Date: 1/9/2017 Publication URL: http://handle.nal.usda.gov/10113/5678119 Citation: Schmitz Carley, C.A., Coombs, J.J., Douches, D.S., Bethke, P.C., Palta, J.P., Novy, R.G., Endelman, J.B. 2017. Automated tetraploid genotype calling by hierarchical clustering. Theoretical and Applied Genetics. 130(4):717-726. doi: 10.1007/s00122-016-2845-5. Interpretive Summary: Molecular markers are used to describe differences between genetically distinct plants. Plant breeders and researchers use these markers to identify superior crops or link plant genetics with observed plant characteristics. Describing the molecular markers within an individual is difficult when plants contain four copies of each chromosome in their DNA. Here we report on a computational method developed to do this. The method was found to have high accuracy when tested using data from a wide range of potatoes. By developing a method that simplifies and automates the process of characterizing the genetic makeup of individual plants in crops such as potato, highbush blueberry and alfalfa, we have enabled plant breeders and others to better utilize molecular marker data for research and crop improvement. Technical Abstract: SNP arrays are transforming breeding and genetics research for autotetraploids. To fully utilize these arrays, however, the relationship between signal intensity and allele dosage must be inferred independently for each marker. We developed an improved computational method to automate this process, which is provided as the R package ClusterCall. In the training phase of the algorithm, hierarchical clustering within a biparental family is used to group samples with similar intensity values, and allele dosages are assigned to clusters based on expected segregation ratios. In the prediction phase, multiple biparental families and the prediction set are clustered together, and the genotype for each cluster is the mode of the training set samples. A concordance metric, defined as the proportion of training set samples equal to the mode, can be used to eliminate unreliable markers and compare different algorithms. Across three potato families genotyped with an 8K potato SNP array, ClusterCall had 4711 markers, representing 98.5% the total, with at least 95% concordance. By comparison, the benchmark software fitTetra scored 4600 markers, representing 86.0% of its total, above the same threshold. The three families were used to predict genotypes for 3716 SNPs in the SolCAP diversity panel, compared with 3521 SNPs in a previous study where genotypes were called manually. One of the additional markers produced a significant association for vine maturity near a well-known causal locus on chromosome 5. In conclusion, ClusterCall is an efficient method for making accurate genotype calls that enables those working with tetraploids to better utilize SNP data for research and plant breeding. |