Cluster Analysis - Intro
The ever-present decision facing the NAGP is which males to collect for the gene bank. In order to capture the maximum genetic diversity of each breed, we try to sample lowly related males from across the population. The lowest related males should have the fewest alleles in common, allowing us to capture all (or most) of the allelic diversity within the breed. Pedigree relationship data is used to group the available pool of animals into groups that are similar to each other (highly related) and lowly related to other groups. Then, individual males within each cluster can be targeted for collection. If chosen males are not available, a substitute from the same cluster can be used instead. As the active animals within a breed are constantly changing, the clusters will change over time, too. Cluster analysis is an on-going tool used to assess the current status of the collection and help plan additional animals to add to the repository.
The Clustering Process
After receiving a pedigree file from a participating breed association, we create a list of the current breeding male population to cluster. We are limited by the SAS software (Version 9.2; SAS, 2009) to less than 5,000 animals for the clustering procedure, so the 'current breeding population ' is defined differently depending on breed registration counts, species, etc. For a large dataset, we may need to limit the list of males to those producing more than a minimum number of offspring. Once the list is determined, their pedigrees are used to calculate the relationships among all animals. These relationships are subtracted from one to generate a file of distances between animals.
The Ward method of the Cluster procedure in SAS initially places all animals in their own cluster, and then proceeds to join closely related animals in clusters until the entire list is in a single cluster. This is what you see when you look at a cluster graph. The graph shows the cluster tree with a single cluster at the top and every animal in their own cluster at the bottom. We use a combination of t-statistics and input from the breed association to look for a natural break in the cluster output to determine the number of clusters we are going to use. There should be enough clusters to adequately split up the breed into groups, but not so many that we are splitting each cluster into specific breeders or bull/son groups. Since the decision of number of clusters is based on a combination of practical input from the breed associations and statistics generated from the analysis, it is more art than science. We have taken a slightly different approach to determining clusters for pig breeds. Unlike cattle breeds, where we receive already cryopreserved material, we receive fresh boar collections and do the cryopreservation in our lab. We know we want to have 100 boars per major breed in the repository, so we set the cluster number to 100 for those breeds. This eliminates the need to make decisions within cluster, simplifying the process and still getting the lowest related animals into the collection.
Once the number of clusters is determined, the relationship within and between clusters is computed. This is a good indicator of how well the number of clusters was chosen; there should be high within cluster relationships and low between cluster relationships. How high these relationships are will also depend on the overall relationship within the breed.
For each cluster analysis, the following is displayed:
- A graph of the clusters
- A list of within cluster relationships, number of animals per cluster, & number of repository males per cluster
- A matrix of within and between cluster relationship
Literature Cited:
SAS Institute Inc. 2009. Base SAS? 9.2 Procedures Guide. Cary, NC: SAS Institute Inc.