Publication : USDA ARS

ARS Home » Research » Publications at this Location » Publication #150080

Title: WOLFPAC: BUILDING A HIGH-PERFORMANCE DISTRIBUTED COMPUTING NETWORK FOR PHYLOGENETIC ANALYSIS USING "OBSOLETE" COMPUTATIONAL RESOURCES

Author

	Reeves, Patrick
	FRIEDMAN, PHILIP - CSU, DEPT. OF NATURAL SCI
	Richards, Christopher

Submitted to: Applied Bioinformatics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 5/28/2004
Publication Date: 7/1/2005
Citation: Reeves, P.A., Friedman, P.H., Richards, C.M. 2005. Wolfpac: Building a high-preformance distributed computing network for phylogenetic analysis using "obsolete" computational resources. Applied Bioinformatics 4:61-64.

Interpretive Summary: Mathematically complex computer-intensive procedures are necessary in order to reconstruct historical relationships among species or populations from genetic data. Rigorous analyses of large data sets may require months or even years of processing time on a single computer. This software allows a researcher to use many computers simultaneously, during times of the day when they would otherwise be idle to dramatically reduce the overall time necessary to complete an analysis. The software can be made available to the research community at large.

Technical Abstract: The reconstruction of phylogenetic trees under the maximum parsimony criterion is an Nondeterministic Polynomal time problem (Foulds and Graham, 1982). For unrooted, bifurcating trees, the number of possible solutions increases as (2n-5)!/{2n-3[(n-3)!]} where n is the number of terminals; thus, a significant computational effort is required to discover optimal solutions. Other commonly used optimality criteria, such as maximum likelihood, share a similar level of computational complexity (Sanderson and Kim, 2000). While heuristic methods have been developed to expedite searches, local optima, and islands of equivalent optima are common occurrences on both the parsimony and likelihood surface (Maddison, 1991; Salter, 2001). Therefore the use of simple hill climbing algorithms alone cannot guarantee that global optima will be discovered (Sanderson and Kim, 2000). In order to increase the probability of finding all globally optimal trees, the use of multiple independent searches, with random starting points for each search, has been suggested (Maddison, 1991; Salter, 2001). For very large data sets (e.g. > 500 terminals), the computer processor time necessary to complete multiple independent searches can be prohibitive. While refinements in search strategy (Salter and Pearl, 1997; Goloboff, 1999; Nixon, 1999), new search algorithms (Dopazo and Carazo, 1997; Lewis, 1998), and parallel computing approaches (Snell et al., 2000; Brauer et al., 2002) have shown promise for decreasing overall search times, the continually increasing size of phylogenetic data sets has resulted in a situation where the computational resources available to a typical researcher may not be adequate to complete rigorous analyses in a timely manner. In order to facilitate rapid and thorough searches of phylogenetic tree space, we have developed a distributed computing environment, wolfPAC, that utilizes the batch processing capability of the phylogenetic analysis software PAUP* (Swofford, 1999) to perform multiple, independent searches on numerous, networked Macintosh computers. The wolfPAC analysis environment is a hierarchically organized network of processors that communicate via AppleScript using the Apple File Protocol (AFP) over an IP connection (Figure 1A). Core elements include a server side directory structure which mediates job acquisition and acts as a repository for result files generated by the client processors. Two AppleScripts are used in conjunction with third-party script scheduling software to trigger a job query from the client, commence a search, and establish search duration. Searches may be scheduled to utilize recurring idle periods (e.g. overnight) in large Macintosh computer labs. In addition to streamlining the initiation and termination of PAUP* runs on numerous computers, wolfPAC offers features which facilitate sharing of computational resources among researchers. First, because it operates over IP, users can queue jobs and recover results from any Macintosh with internet access. Second, jobs may be submitted with two levels of priority. A researcher with a local wolfPAC may permit colleagues to access secondary priority processing time with the knowledge that, should the need arise, the privilege can be usurped by submission of a primary priority job. An optional, third priority level is available to provide anonymous user access when no higher priority jobs have been submitted. This allows a wolfPAC to be shared with the phylogenetics research community at large when it is not in use by its owner. Existing parallel processing systems for phylogenetic analysis have shown decreasing marginal performance as processors are added (Snell, 2000; Brauer et al., 2002). Systems which use a "massively serial" approach should exhibit near-linear scaling properties for replicated procedures (until a fixed minimum time, equal