Cluster Analysis for Gene Expression Data: A Survey
Daxin Jiang, Chun Tang, and Aidong Zhang
Reviewer
Zhan Gao
zgao014@aucklanduni.ac.nz
Reference
[1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan,
"Automatic Subspace Clustering of High Dimensional Data for
Data Mining Applications," SIGMOD 1998, Proc. ACM SIGMOD
Int'l Conf. Management of Data, pp. 94-105, 1998.
[2] A.A. Alizadeh et al., "Distinct Types of Diffuse Large B-Cell
Lymphoma Identified by Gene Expression Profiling," Nature,
vol. 403, pp. 503-511, Feb. 2000.
[3] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack,
and A.J. Levine, "Broad Patterns of Gene Expression Revealed by
Clustering Analysis of Tumor and Normal Colon Tissues Probed
by Oligonucleotide Array," Proc. Nat'l Academy of Science, vol. 96,
no. 12, pp. 6745-6750, June 1999.
[4] O. Alter, P.O. Brown, and D. Bostein, "Singular Value Decomposition
for Genome-Wide Expression Data Processing and
Modeling," Proc. Nat'l Academy of Science, vol. 97, no. 18,
pp. 10101-10106, Aug. 2000.
[5] M. Ankerst, M.M. Breunig, H.-P. Kriegel, and J. Sander, "OPTICS:
Ordering Points to Identify the Clustering Structure," Sigmod,
pp. 49-60, 1999.
[6] A. Ben-Dor, N. Friedman, and Z. Yakhini, "Class Discovery in
Gene Expression Data," Proc. Fifth Ann. Int'l Conf. Computational
Molecular Biology (RECOMB 2001), pp. 31-38, 2001.
[7] A. Ben-Dor, R. Shamir, and Z. Yakhini, "Clustering Gene
Expression Patterns," J. Computational Biology, vol. 6, nos. 3/4,
pp. 281-297, 1999.
[8] M. Blat, S. Wiseman, and E. Domany, "Super-Paramagnetic
Clustering of Data," Physical Review Letters, vol. 76, pp. 3251-
3255, 1996.
[9] A. Brazma and J. Vilo, "Minireview: Gene Expression Data
Analysis," Federation of European Biochemical Soc., vol. 480, pp. 17-
24, June 2000.
[10] M.P.S. Brown, W.N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet,
T.S. Furey, M. Ares Jr., and D. Haussler, "Knowledge-Based
Analysis of Microarray Gene Expression Data Using Support
Vector Machines," Proc. Nat'l Academy of Science, vol. 97, no. 1,
pp. 262-267, Jan. 2000.
[11] Y. Cheng and G.M. Church, "Biclustering of Expression Data,"
Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology
(ISMB), vol. 8, pp. 93-103, 2000.
[12] R.J. Cho, M.J. Campbell, E.A. Winzeler, L. Steinmetz, A. Conway,
L. Wodicka, T.G. Wolfsberg, A.E. Gabrielian, D. Landsman, D.J.
Lockhart, and R.W. Davis, "A Genome-Wide Transcriptional
Analysis of the Mitotic Cell Cycle," Molecular Cell, vol. 2, no. 1,
pp. 65-73, July 1998.
[13] S. Chu et al., "The Transcriptional Program of Sporulation in
Budding Yeast," Science, vol. 282, no. 5389, pp. 699-705, 1998.
[14] D.R. Bickel, "Robust Cluster Analysis of DNA Microarray Data:
An Application of Nonparametric Correlation Dissimilarity," Proc.
Joint Statistical Meetings of the Am. Statistical Assoc., (Biometrics
Section), 2001.
[15] J.L. DeRisi, V.R. Iyer, and P.O. Brown, "Exploring the Metabolic
and Genetic Control of Gene Expression on a Genomic Scale,"
Science, pp. 680-686, 1997.
[16] P. D'haeseleer, X. Wen, S. Fuhrman, and R. Somogyi, "Mining the
Gene Expression Matrix: Inferring Gene Relationships From Large
Scale Gene Expression Data," Information Processing in Cells and
Tissues, pp. 203-212, 1998.
[17] C. Ding, "Analysis of Gene Expression Profiles: Class Discovery
and Leaf Ordering," Proc. Int'l Conf. Computational Molecular
Biology (RECOMB), pp. 27-136, Apr. 2002.
[18] R. Dubes and A. Jain, Algorithms for Clustering Data. Prentice Hall,
1988.
[19] B. Efron, "The Jackknife, the Bootstrap, and Other Resampling
Plans," Proc. CBMS-NSF Regional Conf. Series in Applied Math.,
vol. 38, 1982.
[20] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, "Cluster
Analysis and Display of Genome-Wide Expression Patterns," Proc.
Nat'l Academy of Science, vol. 95, no. 25, pp. 14863-14868, Dec. 1998.
[21] C. Fraley and A.E. Raftery, "How Many Clusters? Which
Clustering Method? Answers Via Model-Based Cluster Analysis,"
The Computer J., vol. 41, no. 8, pp. 578-588, 1998.
[22] G. Getz, E. Levine, and E. Domany, "Coupled Two-Way
Clustering Analysis of Gene Microarray Data," Proc. Nat'l
Academy of Science, vol. 97, no. 22, pp. 12079-12084, Oct. 2000.
[23] D. Ghosh and A.M. Chinnaiyan, "Mixture Modelling of Gene
Expression Data from Microarray Experiments," Bioinformatics,
vol. 18, pp. 275-286, 2002.
[24] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gassenbeek,
J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri,
D.D. Bloomfield, and E.S. Lander, "Molecular Classification of
Cancer: Class Discovery and Class Prediction by Gene Expression
Monitoring," Science, vol. 286, no. 15, pp. 531-537, Oct. 1999.
[25] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, "On Clustering
Validation Techniques," Intelligent Information Systems J., 2001.
[26] E. Hartuv and R. Shamir, "A Clustering Algorithm Based on
Graph Connectivity," Information Processing Letters, vol. 76, nos. 4-
6, pp. 175-181, 2000.
[27] T. Hastie, R. Tibshirani, D. Boststein, and P. Brown, "Supervised
Harvesting of Expression Trees," Genome Biology, vol. 2, no. 1,
pp. 0003.1-0003.12, Jan. 2001.
[28] I. Hedenfalk, D. Duggan, Y.D. Chen, M. Radmacher, M. Bittner,
R. Simon, P. Meltzer, B. Gusterson, M. Esteller, O.P. Kallioniemi,
B. Wilfond, A. Borg, and J. Trent, "Gene-Expression Profiles in
Hereditary Breast Cancer," The New England J. Medicine, vol. 344,
no. 8, pp. 539-548, Feb. 2001.
[29] J. Herrero, A. Valencia, and J. Dopazo, "A Hierarchical Unsupervised
Growing Neural Network for Clustering Gene Expression
Patterns," Bioinformatics, vol. 17, pp. 126-136, 2001.
[30] L.J. Heyer, S. Kruglyak, and S. Yooseph, "Exploring Expression
Data: Identification and Analysis of Coexpressed Genes," Genome
Research, 1999.
[31] L.J. Heyer, S. Kruglyak, and S. Yooseph, "Exploring Expression
Data: Identification and Analysis of Coexpressed Genes," Genome
Research, vol. 9, no. 11, pp. 1106-1115, 1999.
[32] A. Hill, E. Brown, M. Whitley, G. Tucker-Kellogg, C. Hunter, and
D. Slonim, "Evaluation of Normalization Procedures for Oligonucleotide
Array Data Based on Spiked cRNA Contros," Genome
Biology, vol. 2, no. 12, pp. research0055.-1-0055.13, 2001.
[33] V.R. Iyer, M.B. Eisen, D.T. Ross, G. Schuler, T. Moore, J.C.F. Lee,
J.M. Trent, L.M. Staudt, J. Hudson Jr., M.S. Boguski, D. Lashkari,
D. Shalon, D. Botstein, and P.O. Brown, "The Transcriptional
Program in the Response of Human Fibroblasts to Serum," Science,
vol. 283, pp. 83-87, 1999.
[34] A.K. Jain, M.N. Murty, and P.J. Flynn, "Data Clustering: A
Review," ACM Computing Surveys, vol. 31, no. 3, pp. 254-323, Sept.
1999.
[35] L.M. Jakt, L. Cao, K.S.E. Cheah, and D.K. Smith, "Assessing
Clusters and Motifs from Gene Expression Data," Genome
Research, vol. 11, pp. 1112-123, 2001.
[36] D. Jiang, J. Pei, and A. Zhang, "DHC: A Density-Based
Hierarchical Clustering Method for Time-Series Gene Expression
Data," Proc. BIBE2003: Third IEEE Int'l Symp. Bioinformatics and
Bioeng., 2003.
[37] D. Jiang, J. Pei, and A. Zhang, "Interactive Exploration of
Coherent Patterns in Time-Series Gene Expression Data," Proc.
Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data
Mining (SIGKDD '03), 2003.
[38] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An
Introduction to Cluster Analysis. John Wiley and Sons, 1990.
[39] T. Kohonen, Self-Organization and Associative Memory. Berlin:
Spring-Verlag, 1984.
[40] L. Lazzeroni and A. Owen, "Plaid Models for Gene Expression
Data," Statistica Sinica, vol. 12, no. 1, pp. 61-86, 2002.
[41] E. Levine and E. Domany, "Resampling Methods for Unsupervised
Estimation of Cluster Validity," Neural Computation, vol. 13,
pp. 2573-2593, 2001.
[42] L. Li, W. Leping, C.R. Weinberg, T.A. Darden, and L.G. Pedersen,
"Gene Selection for Sample Classification Based on Gene Expression
Data: Study of Sensitivity to Choice of Parameters of the ga/
knn Method," Bioinformatics, vol. 17, pp. 1131-1142, 2001.
[43] W. Li, "Zipf's Law in Importance of Genes for Cancer
Classification Using Microarray Data," Lab of Statistical
Genetics, Rockefeller Univ., Apr. 2001.
[44] D. Lockhart et al., "Expression Monitoring by Hybridization to
High-Density Oligonucleotide Arrays," Nature Biotechnology,
vol. 14, pp. 1675-1680, 1996.
[45] G.J. McLachlan, R.W. Bean, and D. Peel, "A Mixture Model-Based
Approach to the Clustering of Microarray Expression Data,"
Bioinformatics, vol. 18, 413-422, 2002.
[46] J.B. McQueen, "Some Methods for Classification and Analysis of
Multivariate Observations," Proc. Fifth Berkeley Symp. Math.
Statistics and Probability, vol. 1, pp. 281-297, 1967.
[47] E.J. Moler, M.L. Chow, and I.S. Mian, "Analysis of Molecular
Profile Data Using Generative and Discriminative Methods."
Physiological Genomics, vol. 4, no. 2, pp. 109-126, 2000.
[48] L.T. Nguyen et al., "Flow Cytometric Analysis of in Vitro
Proinflammatory Cytokine Secretion in Peripheral Blood from
Multiple Sclerosis Patients," J. Clinical Immunology, vol. 19, no. 3,
pp. 179-185, 1999.
[49] P.J. Park, M. Pagano, and M. Bonetti, "A Nonparametric Scoring
Algorithm for Identifying Informative Genes from Microarray
Data," Proc. Pacific Symp. Biocomputing, pp. 52-63, 2001.
[50] C.M. Perou, S.S. Jeffrey, M.V.D. Rijn, C.A. Rees, M.B. Eisen, D.T.
Ross, A. Pergamenschikov, C.F. Williams, S.X. Zhu, J.C.F. Lee, D.
Lashkari, D. Shalon, P.O. Brown, and D. Bostein, "Distinctive
Gene Expression Patterns in Human Mammary Epithelial Cells
and Breast Cancers," Proc. Nat'l Academy of Science, vol. 96, no. 16,
pp. 9212-9217, Aug. 1999.
[51] P.A. Ralf-Herwig, C. Muller, C. Bull, H. Lehrach, and J. O'Brien,
"Large-Scale Clustering of cDNA-Fingerprinting Data," Genome
Research, vol. 9, pp. 1093-1105, 1999.
[52] K. Rose, "Deterministic Annealing for Clustering, Compression,
Classification, Regression, and Related Optimization Problems,"
Proc. IEEE, vol. 96, pp. 2210-2239, 1998.
[53] K. Rose, E. Gurewitz, and G. Fox, Physical Rev. Letters, vol. 65,
pp. 945-948, 1990.
[54] M.D. Schena, R. Shalon, R. Davis, and P. Brown, "Quantitative
Monitoring of Gene Expression Patterns with a Compolementatry
DNA Microarray," Science, vol. 270, pp. 467-470, 1995.
[55] J. Schuchhardt, D. Beule, A. Malik, E. Wolski, H. Eickhoff, H.
Lehrach, and H. Herzel, "Normalization Strategies for cDNA
Microarrays," Nucleic Acids Research, vol. 28, no. 10, 2000.
[56] R. Shamir and R. Sharan, "Click: A Clustering Algorithm for Gene
Expression Analysis," Proc. Eighth Int'l Conf. Intelligent Systems for
Molecular Biology (ISMB '00), 2000.
[57] G. Sherlock, "Analysis of Large-Scale Gene Expression Data,"
Current Opinion in Immunology, vol. 12, no. 2, pp. 201-205, 2000.
[58] J.N. Siedow, "Meeting Report: Making Sense of Microarrays,"
Genome Biology, vol. 2, no. 2, pp. reports 4003.1-4003.2, 2001.
[59] F.D. Smet, J. Mathys, K. Marchal, G. Thijs, M. Moor, D. Bart, and
Y. Moreau, "Adaptive Quality-Based Clustering of Gene Expression
Profiles," Bioinformatics, vol. 18, pp. 735-746, 2002.
[60] R.R. Sokal, "Clustering and Classification: Background and
Current Directions," Classifincation and Clustering, J. Van Ryzin,
ed., Academic Press, 1977.
[61] P.T. Spellman et al., "Comprehensive Identification of Cell Cycle-
Regulated Genes of the Yeast Saccharomyces Cerevisiae by
Microarray Hybridization," Molecular Biology of the Cell, vol. 9,
no. 12, pp. 3273-3297, 1998.
[62] P. Tamayo, D. Solni, J. Mesirov, Q. Zhu, S. Kitareewan, E.
Dmitrovsky, E.S. Lander, and T.R. Golub, "Interpreting Patterns
of Gene Expression with Self-Organizing Maps: Methods and
Application to Hematopoietic Differentiation," Proc. Nat'l Academy
of Science, vol. 96, no. 6, pp. 2907-2912, Mar. 1999.
[63] C. Tang, A. Zhang, and J. Pei, "Mining Phenotypes and
Informative Genes from Gene Expression Data," Proc. Ninth
ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining
(SIGKDD '03), 2003.
[64] C. Tang, L. Zhang, A. Zhang, and M. Ramanathan, "Interrelated
Two-Way Clustering: An Unsupervised Approach for Gene
Expression Data Analysis," Proc. BIBE2001: Second IEEE Int'l
Symp. Bioinformatics and Bioeng., pp. 41-48, 2001.
[65] C. Tang and A. Zhang, "An Iterative Strategy for Pattern
Discovery in High-Dimensional Data Sets," Proc. 11th Int'l Conf.
Information and Knowledge Management (CIKM '02), 2002.
[66] S. Tavazoie, D. Hughes, M.J. Campbell, R.J. Cho, and G.M.
Church, "Systematic Determination of Genetic Network Architecture,"
Nature Genetics, pp. 281-285, 1999.
[67] A. Tefferi, E. Bolander, M. Ansell, D. Wieben, and C. Spelsberg,
"Primer on Medical Genomics Part III: Microarray Experiments
and Data Analysis," Mayo Clinic Proc., vol. 77, pp. 927-940, 2002.
[68] J.G. Thomas, J.M. Olson, S.J. Tapscott, and L.P. Zhao, "An Efficient
and Robust Statistical Modeling Approach to Discover Differentially
Expressed Genes Using Genomic Expression Profiles,"
Genome Research, vol. 11, no. 7, pp. 1227-1236, 2001.
[69] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R.
Tibshirani, D. Botstein, and R. Altman, "Missing Value Estimation
Methods for Dna Microarrays," Bioinformatics, in press.
[70] V.G. Tusher, R. Tibshirani, and G. Chu, "Significance Analysis of
Microarrays Applied to the Ionizing Radiation Response," Proc.
Nat'l Academy of Science, vol. 98, no. 9, pp. 5116-5121, Apr. 2001.
[71] H. Wang, W. Wang, Y. Wei, J. Yang, and P.S. Yu, "Clustering by
Pattern Similarity in Large Data Sets," SIGMOD 2002, Proc. ACM
SIGMOD Int'l Conf. Management of Data, pp. 394-405, 2002.
[72] X. Wen, S. Fuhrman, G.S. Michaels, D.B. Carr, S. Smith, J.L. Barker,
and R. Smomgyi, "Large-Scale Temporal Gene Expression
Mapping of Central Nervous System Development," Proc. Nat'l
Academy of Science, vol. 95, pp. 334-339, Jan. 1998.
[73] E.P. Xing and R.M. Karp, "Cliff: Clustering of High-Dimensional
Microarray Data via Iterative Feature Filtering Using Normalized
Cuts," Bioinformatics, vol. 17, no. 1, pp. 306-315, 2001.
[74] J. Yang, W. Wang, H. Wang, and P.S. Yu, " -Cluster: Capturing
Subspace Correlation in a Large Data Set," Proc. 18th Int'l Conf.
Data Eng. (ICDE 2002), pp. 517-528, 2002.
[75] K.Y. Yeung and W.L. Ruzzo, "An Empirical Study on Principal
Component Analysis for Clustering Gene Expression Data,"
Technical Report UW-CSE-2000-11-03, Dept. of Computer Science
& Eng., Univ. of Washington, 2000.
[76] K.Y. Yeung, C. Fraley, A. Murua, A.E. Raftery, and W.L. Ruzz,
"Model-Based Clustering and Data Transformations for Gene
Expression Data," Bioinformatics, vol. 17, pp. 977-987, 2001.
[77] K.Y. Yeung, D.R. Haynor, and W.L. Ruzzo, "Validating Clustering
for Gene Expression Data," Bioinformatics, vol. 17, no. 4, pp. 309-
318, 2001.
<bibliographic reference to the paper>
Keywords
Microarray, Genes, Samples, Clustering, Unsupervised, Distance, Homogeneity, Separation
Related Papers
[1] T. Pang-Ning, S. Michael, and K. Vipin, "Introduction to Data Mining," Chaper 8, 2006.
[2] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, "On Clustering
Validation Techniques," Intelligent Information Systems J., 2001.
[3] M. Andrew W.,"http://www.cs.cmu.edu/afs/andrew/course/15/381-f08/www/lectures/clustering.pdf", University of Carnegie Mellon, Lecture Notes.
[4] D. Jiang, J. Pei, and A. Zhang, "DHC: A Density-Based
Hierarchical Clustering Method for Time-Series Gene Expression
Data," Proc. BIBE2003: Third IEEE Int'l Symp. Bioinformatics and
Bioeng., 2003.
Summary
This paper firstly introduced the concepts of microarray technology, figured out some basic elements of clustering on gene expression data. For example, clustering is an example of unsupervised classification and the proximity between object points for gene expression data can be measured by Euclidean Distance. Then they divided clustering analysis into three categories, gene-based clustering, sample-based clustering and subspace clustering. They also present specific challenges to each clustering category and introduced several representative approaches and various clustering algorithms. For example, they discussed K-Means, Hierarchical Clustering, Graph-Theoretical Clustering, Model-Based Clustering and Density-Based Hierarchical Clustering etc. At the end of this paper, they gave ideas of validation problems of clustering analysis in three aspects, clustering quality, reference partition agreement and reliability.
Evaluation
A good introduction for measurement between object points - the Euclidean Distance. It is the basic for clustering technique.
Briefly introduced various of clustering algorithms such as K-Means and Hierarchical clustering.
The algorithms are introduced very briefly in this paper, have to do more research for detailed algorithms.
This paper introduced good class validation methods for the clustering quality measurement- homogeneity and separation.
Interesting approaches of how to evaluate the agreement between clustering results and the "ground truth" - Rand index, Jaccard codfficient and Minkowski mesure.
However how to implement the "ground truth" binary matrix is not specified in this paper, and I can not find the answer even after research.