Publication : USDA ARS

ARS Home » Northeast Area » Washington, D.C. » National Arboretum » Floral and Nursery Plants Research » Research » Publications at this Location » Publication #185172

Title: MICROARRAY DATA ANALYSIS USING MACHINE LEARNING METHODS

Author

	RESSOM, HABTOM - GEORGETOWN UNIV, WASH, DC
	Lakshman, Dilip
	YUN, SONG JOONG - INST OF AG SCI/TECH,KOREA
	PRAMANIK, SAROJ - MORGAN STATE UNIV,BALT,MD
	REYES, B.G. DE LOS - UNIV OF MAINE, ORONO, ME

Submitted to: Biosystems Engineering
Publication Type: Book / Chapter
Publication Acceptance Date: 6/1/2010
Publication Date: 7/29/2010
Citation: Ressom, H., Lakshman, D.K., Yun, S., Pramanik, S.K., Reyes, B. 2010. Microarray data analysis using machine learning methods. In: Nag, A., editor. Biosystems Engineering. India: Prentice Hall of India Learning Private Limited. p. 1-32.

Interpretive Summary: In a microarray, DNA molecules representing many genes are placed in discrete spots on a microscope slide to study gene expression patterns. Microarray technology allows us look at many genes at once and determine which are expressed in a particular cell type. Microarray is increasingly used to study genome-wide changes in gene expression, genome organization and chromatin structure. Using this technology, we can now classify genes based on expression profile, infer their physiological role and construct databases of genes involved in specific functions. Microarray data can be organized as matrices where the rows represent genes (or clones) and columns represent various sample phenotypes or experimental conditions. Each entry in the matrix corresponds to the expression level of a gene for a given condition or sample. A set of entries in a row or a column forms an expression pattern. A gene expression data matrix may consist of 10,000s of rows (genes) and 10s to 100s columns (samples). Many times the “raw” microarray data are not the best data for discovering biological knowledge. Low-level analysis methods are applied to the raw microarray data to reduce background noise, normalize, and transform the data into a form acceptable to a selected analysis method. Once the data is properly pre-processed, high-level analysis methods are applied to elucidate biologically significant information such as identification of differentially expressed genes, clustering genes to identify a new transduction pathway or novel genes that may be co-regulated through the same known pathway, discovery/prediction of unknown phenotypic class, selection of genes that may have a functional role in specific phenotypes, and deciphering gene regulatory networks. Many computational methods have been proposed to perform low- and high-level analysis of microarray data. Machine learning methods have received impetus in recent years for use in high-level microarray data analysis. This review discusses various such aspects of Microarray data analysis.

Technical Abstract: This chapter introduces computational methods for analysis of microarray data including gene clustering, marker gene selection, prediction of phenotypic classes, and modeling of genetic networks. As large volume and high dimensional data are being generated by the rapidly expanding microarray technology, the number of reported applications of machine learning methods is expected to increase. With the increasing demand, however, comes the need for further improvements that can make implementation of machine learning algorithms in microarray data analysis more efficient. Key improvements include: (i) enhanced computational power to handle the high dimensionality and large volume data; (ii) improved microarray technology with high resolution scanners, low background noise, low technical variability, etc.; (iii) enhanced quality control and protocol; (iv) well-designed low level analysis methods for background correction, cross-talk removal, normalization, outlier screening, and summary measures; (v) improved visualization tools to assess data quality and interpret results; (vi) better data storage and retrieval mechanisms ; (vii) advances in machine learning methods to enhance their speed and make them more accessible to the user.