Publication : USDA ARS

ARS Home » Research » Publications at this Location » Publication #167615

Title: APPLICATION OF MACHINE LEARNING PROGRAMS TOWARDS ACCELERATING POLYMORPHISMS DISCOVERY

Author

	MATUKUMALLI, LAKSHMI - GEORGE MASON UNIVERSITY
	GREFENSTETTE, JOHN - GEORGE MASON UNIVERSITY
	Van Tassell, Curtis - Curt
	CHOII, IK-YOUNG - 1275-31-00
	Cregan, Perry

Submitted to: Meeting Abstract
Publication Type: Abstract Only
Publication Acceptance Date: 8/3/2004
Publication Date: 10/21/2004
Citation: Matukumalli, L.K., Grefenstette, J.J., Van Tassell, C.P., Choii, I., Cregan, P.B. 2004. Application of machine learning programs towards accelerating polymorphisms discovery [abstract]. 7th Annual Conference on Computational Genomics. p. 30.

Interpretive Summary:

Technical Abstract: Along with the whole genome sequence projects, major efforts are now being placed on identifying sequence variations and haplotypes between different individuals or species. Results from computational tools to identify SNP from sequence data need to be expertly annotated to reject false SNP. Implementation of machine learning (ML) program for confirming polymorphisms can reduce the expert intervention, thereby reducing cost and time. PolyBayes program was used for analyzing polymorphisms across several soybean (inbred species) genotypes. The prediction accuracy was only 50% even with 1.00 probabilities by PolyBayes. We have carefully selected a set of 10 parameters that can influence the expert decision and used 2417 polymorphisms identified by PolyBayes that were expert evaluated (1066 True, 1351 False) to implement a ML program called C4.5. The prediction accuracy was 90.6 %. We optimized the parameters and re-evaluated the polymorphisms falsely predicted by the ML program. This increased the prediction accuracy to 97.7%. The optimized parameters were tested on a large data set of 17,590 expert evaluated polymorphisms (2445 True, 15145 False). The average prediction accuracy was 97.3% in the 5-way cross validation. This program along with a web interface for viewing sequence assemblies was implemented as part of SNP pipeline.