Cluster Analysis for Gene Expression Data: A Survey

Daxin Jiang, Chun Tang, and Aidong Zhang

Reviewer

Zhan Gao

zgao014@aucklanduni.ac.nz

Reference

[1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," SIGMOD 1998, Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 94-105, 1998.

[2] A.A. Alizadeh et al., "Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling," Nature, vol. 403, pp. 503-511, Feb. 2000.

[3] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, "Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Array," Proc. Nat'l Academy of Science, vol. 96, no. 12, pp. 6745-6750, June 1999.

[4] O. Alter, P.O. Brown, and D. Bostein, "Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling," Proc. Nat'l Academy of Science, vol. 97, no. 18, pp. 10101-10106, Aug. 2000.

[5] M. Ankerst, M.M. Breunig, H.-P. Kriegel, and J. Sander, "OPTICS: Ordering Points to Identify the Clustering Structure," Sigmod, pp. 49-60, 1999.

[6] A. Ben-Dor, N. Friedman, and Z. Yakhini, "Class Discovery in Gene Expression Data," Proc. Fifth Ann. Int'l Conf. Computational Molecular Biology (RECOMB 2001), pp. 31-38, 2001.

[7] A. Ben-Dor, R. Shamir, and Z. Yakhini, "Clustering Gene Expression Patterns," J. Computational Biology, vol. 6, nos. 3/4, pp. 281-297, 1999.

[8] M. Blat, S. Wiseman, and E. Domany, "Super-Paramagnetic Clustering of Data," Physical Review Letters, vol. 76, pp. 3251- 3255, 1996.

[9] A. Brazma and J. Vilo, "Minireview: Gene Expression Data Analysis," Federation of European Biochemical Soc., vol. 480, pp. 17- 24, June 2000.

[10] M.P.S. Brown, W.N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet, T.S. Furey, M. Ares Jr., and D. Haussler, "Knowledge-Based Analysis of Microarray Gene Expression Data Using Support Vector Machines," Proc. Nat'l Academy of Science, vol. 97, no. 1, pp. 262-267, Jan. 2000.

[11] Y. Cheng and G.M. Church, "Biclustering of Expression Data," Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology (ISMB), vol. 8, pp. 93-103, 2000.

[12] R.J. Cho, M.J. Campbell, E.A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T.G. Wolfsberg, A.E. Gabrielian, D. Landsman, D.J. Lockhart, and R.W. Davis, "A Genome-Wide Transcriptional Analysis of the Mitotic Cell Cycle," Molecular Cell, vol. 2, no. 1, pp. 65-73, July 1998.

[13] S. Chu et al., "The Transcriptional Program of Sporulation in Budding Yeast," Science, vol. 282, no. 5389, pp. 699-705, 1998.

[14] D.R. Bickel, "Robust Cluster Analysis of DNA Microarray Data: An Application of Nonparametric Correlation Dissimilarity," Proc. Joint Statistical Meetings of the Am. Statistical Assoc., (Biometrics Section), 2001.

[15] J.L. DeRisi, V.R. Iyer, and P.O. Brown, "Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale," Science, pp. 680-686, 1997.

[16] P. D'haeseleer, X. Wen, S. Fuhrman, and R. Somogyi, "Mining the Gene Expression Matrix: Inferring Gene Relationships From Large Scale Gene Expression Data," Information Processing in Cells and Tissues, pp. 203-212, 1998.

[17] C. Ding, "Analysis of Gene Expression Profiles: Class Discovery and Leaf Ordering," Proc. Int'l Conf. Computational Molecular Biology (RECOMB), pp. 27-136, Apr. 2002.

[18] R. Dubes and A. Jain, Algorithms for Clustering Data. Prentice Hall, 1988.

[19] B. Efron, "The Jackknife, the Bootstrap, and Other Resampling Plans," Proc. CBMS-NSF Regional Conf. Series in Applied Math., vol. 38, 1982.

[20] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, "Cluster Analysis and Display of Genome-Wide Expression Patterns," Proc. Nat'l Academy of Science, vol. 95, no. 25, pp. 14863-14868, Dec. 1998.

[21] C. Fraley and A.E. Raftery, "How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis," The Computer J., vol. 41, no. 8, pp. 578-588, 1998.

[22] G. Getz, E. Levine, and E. Domany, "Coupled Two-Way Clustering Analysis of Gene Microarray Data," Proc. Nat'l Academy of Science, vol. 97, no. 22, pp. 12079-12084, Oct. 2000.

[23] D. Ghosh and A.M. Chinnaiyan, "Mixture Modelling of Gene Expression Data from Microarray Experiments," Bioinformatics, vol. 18, pp. 275-286, 2002.

[24] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gassenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, D.D. Bloomfield, and E.S. Lander, "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, no. 15, pp. 531-537, Oct. 1999.

[25] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, "On Clustering Validation Techniques," Intelligent Information Systems J., 2001.

[26] E. Hartuv and R. Shamir, "A Clustering Algorithm Based on Graph Connectivity," Information Processing Letters, vol. 76, nos. 4- 6, pp. 175-181, 2000.

[27] T. Hastie, R. Tibshirani, D. Boststein, and P. Brown, "Supervised Harvesting of Expression Trees," Genome Biology, vol. 2, no. 1, pp. 0003.1-0003.12, Jan. 2001.

[28] I. Hedenfalk, D. Duggan, Y.D. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, O.P. Kallioniemi, B. Wilfond, A. Borg, and J. Trent, "Gene-Expression Profiles in Hereditary Breast Cancer," The New England J. Medicine, vol. 344, no. 8, pp. 539-548, Feb. 2001.

[29] J. Herrero, A. Valencia, and J. Dopazo, "A Hierarchical Unsupervised Growing Neural Network for Clustering Gene Expression Patterns," Bioinformatics, vol. 17, pp. 126-136, 2001.

[30] L.J. Heyer, S. Kruglyak, and S. Yooseph, "Exploring Expression Data: Identification and Analysis of Coexpressed Genes," Genome Research, 1999.

[31] L.J. Heyer, S. Kruglyak, and S. Yooseph, "Exploring Expression Data: Identification and Analysis of Coexpressed Genes," Genome Research, vol. 9, no. 11, pp. 1106-1115, 1999.

[32] A. Hill, E. Brown, M. Whitley, G. Tucker-Kellogg, C. Hunter, and D. Slonim, "Evaluation of Normalization Procedures for Oligonucleotide Array Data Based on Spiked cRNA Contros," Genome Biology, vol. 2, no. 12, pp. research0055.-1-0055.13, 2001.

[33] V.R. Iyer, M.B. Eisen, D.T. Ross, G. Schuler, T. Moore, J.C.F. Lee, J.M. Trent, L.M. Staudt, J. Hudson Jr., M.S. Boguski, D. Lashkari, D. Shalon, D. Botstein, and P.O. Brown, "The Transcriptional Program in the Response of Human Fibroblasts to Serum," Science, vol. 283, pp. 83-87, 1999.

[34] A.K. Jain, M.N. Murty, and P.J. Flynn, "Data Clustering: A Review," ACM Computing Surveys, vol. 31, no. 3, pp. 254-323, Sept. 1999.

[35] L.M. Jakt, L. Cao, K.S.E. Cheah, and D.K. Smith, "Assessing Clusters and Motifs from Gene Expression Data," Genome Research, vol. 11, pp. 1112-123, 2001.

[36] D. Jiang, J. Pei, and A. Zhang, "DHC: A Density-Based Hierarchical Clustering Method for Time-Series Gene Expression Data," Proc. BIBE2003: Third IEEE Int'l Symp. Bioinformatics and Bioeng., 2003.

[37] D. Jiang, J. Pei, and A. Zhang, "Interactive Exploration of Coherent Patterns in Time-Series Gene Expression Data," Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '03), 2003.

[38] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, 1990.

[39] T. Kohonen, Self-Organization and Associative Memory. Berlin: Spring-Verlag, 1984.

[40] L. Lazzeroni and A. Owen, "Plaid Models for Gene Expression Data," Statistica Sinica, vol. 12, no. 1, pp. 61-86, 2002.

[41] E. Levine and E. Domany, "Resampling Methods for Unsupervised Estimation of Cluster Validity," Neural Computation, vol. 13, pp. 2573-2593, 2001.

[42] L. Li, W. Leping, C.R. Weinberg, T.A. Darden, and L.G. Pedersen, "Gene Selection for Sample Classification Based on Gene Expression Data: Study of Sensitivity to Choice of Parameters of the ga/ knn Method," Bioinformatics, vol. 17, pp. 1131-1142, 2001.

[43] W. Li, "Zipf's Law in Importance of Genes for Cancer Classification Using Microarray Data," Lab of Statistical Genetics, Rockefeller Univ., Apr. 2001.

[44] D. Lockhart et al., "Expression Monitoring by Hybridization to High-Density Oligonucleotide Arrays," Nature Biotechnology, vol. 14, pp. 1675-1680, 1996.

[45] G.J. McLachlan, R.W. Bean, and D. Peel, "A Mixture Model-Based Approach to the Clustering of Microarray Expression Data," Bioinformatics, vol. 18, 413-422, 2002.

[46] J.B. McQueen, "Some Methods for Classification and Analysis of Multivariate Observations," Proc. Fifth Berkeley Symp. Math. Statistics and Probability, vol. 1, pp. 281-297, 1967.

[47] E.J. Moler, M.L. Chow, and I.S. Mian, "Analysis of Molecular Profile Data Using Generative and Discriminative Methods." Physiological Genomics, vol. 4, no. 2, pp. 109-126, 2000.

[48] L.T. Nguyen et al., "Flow Cytometric Analysis of in Vitro Proinflammatory Cytokine Secretion in Peripheral Blood from Multiple Sclerosis Patients," J. Clinical Immunology, vol. 19, no. 3, pp. 179-185, 1999.

[49] P.J. Park, M. Pagano, and M. Bonetti, "A Nonparametric Scoring Algorithm for Identifying Informative Genes from Microarray Data," Proc. Pacific Symp. Biocomputing, pp. 52-63, 2001.

[50] C.M. Perou, S.S. Jeffrey, M.V.D. Rijn, C.A. Rees, M.B. Eisen, D.T. Ross, A. Pergamenschikov, C.F. Williams, S.X. Zhu, J.C.F. Lee, D. Lashkari, D. Shalon, P.O. Brown, and D. Bostein, "Distinctive Gene Expression Patterns in Human Mammary Epithelial Cells and Breast Cancers," Proc. Nat'l Academy of Science, vol. 96, no. 16, pp. 9212-9217, Aug. 1999.

[51] P.A. Ralf-Herwig, C. Muller, C. Bull, H. Lehrach, and J. O'Brien, "Large-Scale Clustering of cDNA-Fingerprinting Data," Genome Research, vol. 9, pp. 1093-1105, 1999.

[52] K. Rose, "Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems," Proc. IEEE, vol. 96, pp. 2210-2239, 1998.

[53] K. Rose, E. Gurewitz, and G. Fox, Physical Rev. Letters, vol. 65, pp. 945-948, 1990.

[54] M.D. Schena, R. Shalon, R. Davis, and P. Brown, "Quantitative Monitoring of Gene Expression Patterns with a Compolementatry DNA Microarray," Science, vol. 270, pp. 467-470, 1995.

[55] J. Schuchhardt, D. Beule, A. Malik, E. Wolski, H. Eickhoff, H. Lehrach, and H. Herzel, "Normalization Strategies for cDNA Microarrays," Nucleic Acids Research, vol. 28, no. 10, 2000.

[56] R. Shamir and R. Sharan, "Click: A Clustering Algorithm for Gene Expression Analysis," Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology (ISMB '00), 2000.

[57] G. Sherlock, "Analysis of Large-Scale Gene Expression Data," Current Opinion in Immunology, vol. 12, no. 2, pp. 201-205, 2000.

[58] J.N. Siedow, "Meeting Report: Making Sense of Microarrays," Genome Biology, vol. 2, no. 2, pp. reports 4003.1-4003.2, 2001.

[59] F.D. Smet, J. Mathys, K. Marchal, G. Thijs, M. Moor, D. Bart, and Y. Moreau, "Adaptive Quality-Based Clustering of Gene Expression Profiles," Bioinformatics, vol. 18, pp. 735-746, 2002.

[60] R.R. Sokal, "Clustering and Classification: Background and Current Directions," Classifincation and Clustering, J. Van Ryzin, ed., Academic Press, 1977.

[61] P.T. Spellman et al., "Comprehensive Identification of Cell Cycle- Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization," Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273-3297, 1998.

[62] P. Tamayo, D. Solni, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub, "Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation," Proc. Nat'l Academy of Science, vol. 96, no. 6, pp. 2907-2912, Mar. 1999.

[63] C. Tang, A. Zhang, and J. Pei, "Mining Phenotypes and Informative Genes from Gene Expression Data," Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '03), 2003.

[64] C. Tang, L. Zhang, A. Zhang, and M. Ramanathan, "Interrelated Two-Way Clustering: An Unsupervised Approach for Gene Expression Data Analysis," Proc. BIBE2001: Second IEEE Int'l Symp. Bioinformatics and Bioeng., pp. 41-48, 2001.

[65] C. Tang and A. Zhang, "An Iterative Strategy for Pattern Discovery in High-Dimensional Data Sets," Proc. 11th Int'l Conf. Information and Knowledge Management (CIKM '02), 2002.

[66] S. Tavazoie, D. Hughes, M.J. Campbell, R.J. Cho, and G.M. Church, "Systematic Determination of Genetic Network Architecture," Nature Genetics, pp. 281-285, 1999.

[67] A. Tefferi, E. Bolander, M. Ansell, D. Wieben, and C. Spelsberg, "Primer on Medical Genomics Part III: Microarray Experiments and Data Analysis," Mayo Clinic Proc., vol. 77, pp. 927-940, 2002.

[68] J.G. Thomas, J.M. Olson, S.J. Tapscott, and L.P. Zhao, "An Efficient and Robust Statistical Modeling Approach to Discover Differentially Expressed Genes Using Genomic Expression Profiles," Genome Research, vol. 11, no. 7, pp. 1227-1236, 2001.

[69] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. Altman, "Missing Value Estimation Methods for Dna Microarrays," Bioinformatics, in press.

[70] V.G. Tusher, R. Tibshirani, and G. Chu, "Significance Analysis of Microarrays Applied to the Ionizing Radiation Response," Proc. Nat'l Academy of Science, vol. 98, no. 9, pp. 5116-5121, Apr. 2001.

[71] H. Wang, W. Wang, Y. Wei, J. Yang, and P.S. Yu, "Clustering by Pattern Similarity in Large Data Sets," SIGMOD 2002, Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 394-405, 2002.

[72] X. Wen, S. Fuhrman, G.S. Michaels, D.B. Carr, S. Smith, J.L. Barker, and R. Smomgyi, "Large-Scale Temporal Gene Expression Mapping of Central Nervous System Development," Proc. Nat'l Academy of Science, vol. 95, pp. 334-339, Jan. 1998.

[73] E.P. Xing and R.M. Karp, "Cliff: Clustering of High-Dimensional Microarray Data via Iterative Feature Filtering Using Normalized Cuts," Bioinformatics, vol. 17, no. 1, pp. 306-315, 2001.

[74] J. Yang, W. Wang, H. Wang, and P.S. Yu, " -Cluster: Capturing Subspace Correlation in a Large Data Set," Proc. 18th Int'l Conf. Data Eng. (ICDE 2002), pp. 517-528, 2002.

[75] K.Y. Yeung and W.L. Ruzzo, "An Empirical Study on Principal Component Analysis for Clustering Gene Expression Data," Technical Report UW-CSE-2000-11-03, Dept. of Computer Science & Eng., Univ. of Washington, 2000.

[76] K.Y. Yeung, C. Fraley, A. Murua, A.E. Raftery, and W.L. Ruzz, "Model-Based Clustering and Data Transformations for Gene Expression Data," Bioinformatics, vol. 17, pp. 977-987, 2001.

[77] K.Y. Yeung, D.R. Haynor, and W.L. Ruzzo, "Validating Clustering for Gene Expression Data," Bioinformatics, vol. 17, no. 4, pp. 309- 318, 2001. <bibliographic reference to the paper>

Keywords

Microarray, Genes, Samples, Clustering, Unsupervised, Distance, Homogeneity, Separation

Related Papers

[1] T. Pang-Ning, S. Michael, and K. Vipin, "Introduction to Data Mining," Chaper 8, 2006.

[2] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, "On Clustering Validation Techniques," Intelligent Information Systems J., 2001.

[3] M. Andrew W.,"http://www.cs.cmu.edu/afs/andrew/course/15/381-f08/www/lectures/clustering.pdf", University of Carnegie Mellon, Lecture Notes.

[4] D. Jiang, J. Pei, and A. Zhang, "DHC: A Density-Based Hierarchical Clustering Method for Time-Series Gene Expression Data," Proc. BIBE2003: Third IEEE Int'l Symp. Bioinformatics and Bioeng., 2003.

Summary

This paper firstly introduced the concepts of microarray technology, figured out some basic elements of clustering on gene expression data. For example, clustering is an example of unsupervised classification and the proximity between object points for gene expression data can be measured by Euclidean Distance. Then they divided clustering analysis into three categories, gene-based clustering, sample-based clustering and subspace clustering. They also present specific challenges to each clustering category and introduced several representative approaches and various clustering algorithms. For example, they discussed K-Means, Hierarchical Clustering, Graph-Theoretical Clustering, Model-Based Clustering and Density-Based Hierarchical Clustering etc. At the end of this paper, they gave ideas of validation problems of clustering analysis in three aspects, clustering quality, reference partition agreement and reliability.

Evaluation

A good introduction for measurement between object points - the Euclidean Distance. It is the basic for clustering technique.

Briefly introduced various of clustering algorithms such as K-Means and Hierarchical clustering.

The algorithms are introduced very briefly in this paper, have to do more research for detailed algorithms.

This paper introduced good class validation methods for the clustering quality measurement- homogeneity and separation.

Interesting approaches of how to evaluate the agreement between clustering results and the "ground truth" - Rand index, Jaccard codfficient and Minkowski mesure.

However how to implement the "ground truth" binary matrix is not specified in this paper, and I can not find the answer even after research.