An Era of Enhanced Subspace Clustering in High-Dimensional Data

J. Rama Devi * M. Venkateswara Rao **

* Research Scholar, Department of Computer Science and Engineering, GITAM Institute of Technology, Visakhapatnam, Andhra Pradesh, India.

** Professor, Department of Information Technology, GITAM Institute of Technology, Visakhapatnam, Andhra Pradesh, India.

Abstract

In many real world problems, data are collected in high dimensional space. Detecting clusters in high dimensional spaces are a challenging task in the data mining problem. Subspace clustering is an emerging method, which instead of finding clusters in the full space, it finds clusters in different subspace of the original space. Subspace clustering has been successfully applied in various domains. Recently, the proliferation of high-dimensional data and the need for quality clustering results have moved the research era to enhanced subspace clustering, which targets on problems that cannot be handled or solved effectively through traditional subspace clustering. These enhanced clustering techniques involves in handling the complex data and improving clustering results in various domains like social networking, biology, astronomy and computer vision. The authors have reviewed on the enhanced subspace clustering paradigms and their properties. Mainly they have discussed three main problems of enhanced subspace clustering, first: overlapping clusters mined by significant subspace clusters. Second: overcome the parameter sensitivity problems of the state-of-the-art subspace clustering algorithms. Third: incorporate the constraints or domain knowledge that can make to improve the quality of clusters. They also discuss the basic subspace clustering, the relevant high-dimensional clustering approaches, and describes how they are related.

Keywords :

Subspace Clustering,
High-dimensional Data,
Enhanced Subspace Clustering.

Introduction

Clustering is a descriptive task in the data mining process which aims at grouping similar data sets into clusters (A.K. Jain, et al., 1988; Berkhin P, 2006). Traditional clustering techniques were designed to deliver the clusters in fulldimensional original space, by estimating the proximity between the data objects using all of the attributes of a full data set (Zhang T, et al., 1996; Han J, et al., 2011). Now-adays, as the dimensionality of data increases in advances in data collection management. But as the attributes of data increases, some of the attributes become irrelevant to some of the cluster groups. These traditional clustering algorithms are inefficient for high-dimensional data. The high dimensional data suffers several problems from the curse of dimensionality (Houle ME, et al., 2010). A two fold problem determines the summarization of the curse of dimensionality with respect to clusters. (i) The dimensionality of data grows, several dimensions may not be relevant to define cluster and may, the distance computations usually performed in full dimensional space. The cluster may form in subset of attributes, but it may not be possible to detect the clusters perfectly with the full dimensional space. (ii) The subset of dimensions related to some other cluster may be different from the subset of dimensions relate to another cluster. These two clusters may differ from the relevant subspace for a third cluster. So it may be defined as overlapping clusters in high-dimensional data, i.e. one data point belongs to cluster C_d1 in a certain subset, but the cluster C_d2 in another subset. This is not accounted in traditional clustering methods such as partitioning and density based algorithms. Figure 1 represents Existence of redundant and irrelevant attributes (Aggarwal R, et al., 1998).

Figure 1. Existence of redundant attributes and Existence of irrelevant attributes, (a) Original Dataset, (b) Projection on the subspace {x,y}, (c) Projection on the subspace {x₁ , z}, (d) Projection on the subspace {y₁ , z}

Aggarwal R, et al. (1998) proposed Subspace clustering to overcome the problems in state-of-art methods. Research in subspace clustering has been a challenging area for the past period. Traditional subspace clustering focuses on objects which are nearly together in their subspaces. Aggarwal CC and Reddy (2013) described although these subspace clustering techniques are efficient and effective in solving their clustering applications, their limitations are exhibited by the up-todate proliferation of complex data and the need for higher quality cluster results. The new era enhanced methods have solved proliferation problems. In this paper, the authors focus on enhanced methods to subspace clustering techniques, which is the main attention of this survey.

K. Sim, et al. (2012) scrutinized that enhanced subspace clustering can be mainly categorized in two ways, one is handling complex data and the second is improving the cluster results. Handling complex data: The fundamental subspace clustering methods only handles the 2D (two dimensional) data (object x attribute), and not the complex data of 3D data (object x attribute x time), categorical data or noisy data, streaming data. Therefore, some enhanced subspace clustering methods handle all 3D data, such as categorization, noisy, streaming data, etc. Improving the clustering results: Defined from the base class of fundamental subspace clustering approaches, which can overcome the overlapping of subspace clusters, to set tuning parameters-sensitivity problem of the basic algorithms, user may incorporate the extra information such as constraints or domain knowledge. This leads to improve quality of the clusters. In this survey, first they present subspace clustering algorithms, problems, and definitions. Second, enhanced subspace clustering algorithms and their properties, that uses improving the clustering results and their efficiency. Finally, they discuss potential problems to future research areas in enhanced subspace clustering.

In this study, they present enhanced subspace clustering methods that handle complex data and improving the clustering results, which to the best of our idea. This scenario is a challenging task. The usual surveys present a specific presentation of how the algorithms work. This survey style is different from those, as they decompose the subspace clustering problems and the results.

Motivating Examples

The domain of subspace clustering is important with massive applications for high dimensional data mining: In web text mining through document clustering (Li T, et al., 2004). There are billions of digital documents applicable today. Li T., et al. (2004) proposed an iterative subspace clustering algorithm for web text mining. In online social networks, the determination of communities in both sociologists and target marketers were done (Vidal R, et al., 2008; Vidal R. (2011), and in radio astronomy (Jang W, et al., 2007). In all of the applications considered above, fruitful knowledge is maintained in the subspaces of data.

1. Subspace Clustering Algorithms

Subspace clustering evaluate features on a subset of the data, representing a cluster.

Let DB =O_p X A_p be a dataset presented in a matrix,where O_p is the set of objects and A_p is the set of attributes. A Subspace cluster C_s is a sub matrix O_p x A_p , where the set of objects O ⊆ O_p is homogeneous in the set of dimensions A ⊆ A_p .

Three major variants are developed, via the window based, grid based and density based subspace clustering algorithms (Kriegel HP, et al., 2005). These approaches are very efficient and effective in solving their clustering problems. Lance Parsons, et al., (2004) defined the hierarchy of subspace clustering algorithms as Top down approach and Bottom up approach shown in Figure 2.

Figure 2. Hierarchy of Subspace Clustering Algorithms

1.1 Grid based Subspace Clustering

In grid based clustering (Agrawal, et al. 1998), data space is partitioned into grids. Dense grids containing significant objects are used to form subspace clusters. It is done in three steps. One, find subspaces with good clustering. Second, identify the clusters in the selected subspaces. Third present the results to the users. Some efficient algorithms are CLIQUE, ENCLUS, MAFIA. Agrawal, et al. (1998) defined accuracy and efficiency of grid based methods depending on the granularity and the positioning of grids as shown in Figure 3.

Figure 3. Accuracy and Efficiency of Grid Based Methods

1.2 Window based Subspace Clustering

A sliding window is slid over the domain of each attribute to obtain overlapping intervals. These overlapping windows of length δ are then used as one dimensional units of the grid. It overcomes the weakness of grid based methods (Liu, et al., 2009). Cluster is an efficient approach.

1.3 Density based Subspace Clustering

Kailing, et al. (2004), and Muller E, et al. (2011) defined cluster as a maximal set of density, connected points in Density based subspace clustering. In this approach, two input parameters, e and m used. An object with at least m points in its e neighborhood denotes a core object. The density, associated core objects along with border objects form a specific cluster. Density based subspace clustering algorithms can detect arbitrary shaped clusters.

Some efficient Density subspace clustering algorithms are SUBCLU, INSCY, FIRES, DUSC, DENCOS, DisH, and DENCLU. Figure 4 represents Clusters by notion (Sunita Jahirabadkar, et al. 2013).

Figure 4. Clusters by Notion

2. Enhanced Subspace Clustering Problems

K. Sim, et al (2012), proposed enhanced subspace clustering can be categorized into two important approaches, namely one, handling complex data. second, improving the cluster results. The solving problems beyond the scope of subspace clustering are maintained by enhanced subspace clustering. Important note that the desired properties of the fundamental clusters still grip for the enhanced subspace clusters and each enhanced subspace cluster has its retain properties referring that it is solved.

2.1 Methods and Desired Properties

2.1.1 Handling Complex Data

The fundamental clustering problem are represented on 2D datasets, and not complex data, such as a stream or noisy data, categorical.

Three-Dimensional data (3D): Gathering over the years, 3D datasets have been collected such as gene-sample time microarray in biology, stock financial ratio data (Sim, et al., 2006), item –time region data in market analysis (Ji, et al., 2006) . So, a valuable solution is to let out the subspace clusters to 3D, which are subspace clusters that across time. 3D subspace cluster is determined as follows C=((O,A,T)); let O be a set of object, A be group of attributes, and T is a set of time stamps T.
Categorical Data: Zaki MJ, et al. (2005) In a subspace cluster C=(O,D) of categorical data, the set of objects O the same value for each dimension D. Then the subspace cluster is formed by a set of dimension values and a group of objects.
Noisy Data: Noisy data consists of erroneous, missing, or uncertain values and real–applicable data are being noisy, such as sensor data in smartphone, financial ratio data in finance, gene expression data in biology, etc. These subspaces clustering techniques should be able to handle the uncertainty of the data. The derived properties are like complete, stable result, efficiency same as basic subspace clusters.

2.2 Improving Cluster Results

The clustering result can be augmented in different ways from mining significant subspace clusters to using parameter-in sensitive clustering methods. They provide “true” subspace clusters.

Various Approaches and Definitions with Properties in Improving Clustering Results are given below.

2.2.1 Significant Subspace Clustering

Mainly two approaches are assembled in significance to subspace clustering, the one is to determine significant subspaces and then mine the subspace clusters. Determining significant subspaces can be observed as a preprocessing step and is similar to filter model of attribute selection (Dash, et al., 2002). The second is to define the significant subspace cluster and mine them.

2.2.1.1 Entropy based Subspaces:

Cheng, et al., (1999) proposed mining subspace clusters from significant subspaces to detect an exponential number of clusters being generated. Let A be the subspace in the analysis. By estimating each attribute, a ∈ A as a random variable, the entropy of A can be determined as H (a₁ , . . . , a| A|). Cheng, et al. (1999) also computed the correlation of the attributes in A by the preceding equalization.

Intersest (A)= ∑ H( a ) – H (a , . . . , a| A|) a ∈A;

Properties of Entropy Subspace: K. Sim et al. (2012) :

Significant: Ch, et al. observed that a subspace with clusters naturally has lower entropy than a subspace without clusters. High entropy of subspace of objects is systematically distributed.
Number of Parameters: w, ∈ are tuning parameters to set the significance subspace.
Parameter Sensitivity: The cluster is sensitive to the tuning parameters.

2.2.1.2 Interesting Subspaces:

Kailing, et al. (2003) stated a density based approach to detect interesting subspaces, and traditional subspace clusters are then mined from these subspaces.

Properties of Interesting Subspace: K. Sim, et al. (2012):

Significant: The number of objects cloud together is high with respect to volume of subspace.
Number of Parameters : There are two parameters m, ∈, that controls the density & closeness of the objects on subspace.
Parameter Sensitivity: The cluster is semantic to make parameters.

2.2.1.3 Overlapping Subspace Clusters:

The relevance model (Muller, et al. 2009) and orthogonal model (Gunnemann, et al., 2009) have been proposed to the overlapping subspace clusters. (mine high quality but minimal). These two models reduce the overlapping clusters.

In the relevance model : Let us assume the type of subspace clusters to mine and Let M{c₁ ------- c_n }={(O₁ ,A₁ ),------(O_n ,A_n } ⊆ ALL be a set of subspace clusters. Here the function cluster gain is proposed to measure the signifcance i.e.

We denote Cov (M) = Uni=1 O_i as the union of the objects in the set of clusters M. K( c) as the cost function of cluster C.

Figure 5 shows the objects in subspace {a₁ a₂ } and {a₃ a₄ } successively.

Figure 5. Set of Significant Clusters under (a) Relevance Model and (b) Orthogonal Model

{c₁ ,c₂ } is a set of significant clusters under the relevance model, causing the cost function dependent on the size of subspace.

2.2.1.4 Orthogonal Model:

It aims to detect a set of subspace clusters, that is most of the objects and attributes are involved by the clusters, but there is not much overlapping of the clusters. Each subspace cluster represents a concept (Gunnemann, et al., 2009) which is defined by the attributes of the cluster. Gunnemann, et al., (2009) first defined a concept set, which is a set of subspace clusters that have high similar set of attributes w.r.t a subspace cluster c=(O,A).

Concept Group (C, M) = {C_i ∈ M\{C}|| A_i ∩ A| ≥ β.| A|}

In Figure 3 (b) m = {c₁ c₂ c₇ } is a set of significant subspace cluster by orthogonal Model. The set of objects of c₇ is highly overlapping with those of c₁ and c₂ , c₇ is different set as c₁ and c₂ .

Properties of Cluster (Relevance and Orthogonal Model): K. Sim, et al. (2012):

Homogenetity and size of clusters are dependent on the subspace cluster.
Significant Subspace Cluster: The subspace clusters are significant when the quality of clusters are high.
Number of Parameters: Both models are tuning parameters, that depends on the definition of the subspace cluster used in the models.
Parameter Sensitivity: Both models are sensitive, but difference to set as they are intuitive.

2.2.2 Semi - Supervised Subspace Clustering

2.2.2.1 Constrained based Subspace Clustering:

Subspace clusters are fully dependent of the user. For example: window based or density based subspace clusters can be used. constraints are the 'add-on' to the clusters. Object level constraints are incorporated such as must-link and cannot-link. must-link defines that a pair of objects must be in the same cluster. cannot-link defines that a pair of objects cannot be in the same cluster.

Properties (Sim, et al., 2012):

Homogeneity, size, sensitivity to parameter, and number of parameters are dependent on the definition of the subspace cluster. (Fromont, et al., 2009) used grid, window based subspace clusters.
The must-link and cannot link constraints satisfies the cluster.

2.2.2.2 Actionable Subspace Clustering

The definition of actionable is generalized from actionable patis to meaterns. The ability to suggest a profitable action for the decision-makers (Kleinberg, et al., 1998). Sim, et al., (2010) proposed actionable subspace clusters. They are useful in financial data mining. The continuous attribute known as utility to measure the actionability of the cluster utility is similar to the objectlevel constraints, but it is continuous that mainly actionable subspace clusters are mined 3D datasets. D=O x A x T.

Properties (K. Sim, et al. 2012):

Homogeneity: The Homogeneity is dependent on the weights of the objects.
Size: The size of the cluster is dependent on the object weights.
Semisupervised: The objects have high and correlated utility.
Number of Parameters: Use centroids, based on the domain knowledge of the user.

2.2.3Overcoming Parameter Sensitive Subspace Clustering

The present subspace clustering requires the user to set parameters and the clusters that satisfy these parameters are returned. Mainly two categories are classified: cluster parameters and algorithm parameters where cluster parameters are used in cluster definitions. Algorithm parameters are used in guiding the clustering process. Consider, the parameter k in k-means clustering algorithm. On the other hand, the user selected centroid is also a cluster parameter. Some cases, both clusters and algorithm parameters can be applied. For example: constraints as must-link and cannot–link (Wagstaff, et al., 2001) are cluster parameters. They can be used in improving the quality of the cluster process. In other example, the minsize threshold as a cluster parameter and algorithm parameters. We can also categorize the parameters as semantical parameters and tuning parameters based on their usage (Kriegel, et al., 2007). It is used to tune the efficiency of the quality of the clusters or the algorithm. Semantic parameters are meaningful parameters, describing the semantics of the clusters and domain knowledge of the user can be incorporated into these parameters. Table 1, describes the parameter matrix (K. Sim, et al., 2012).

Table 1. The Matrix Parameter

3. Summary

Table 2 presents the desired properties of the subspace cluster approaches have: Compared to the traditional subspace clustering approaches, the enhanced subspace clustering approaches have desired properties (K. Sim, et al., 2012).

Table 2. Comparison of Desired Properties

4. Discussions

Here we present the desired properties (K. Sim, et al., 2012) of different subspace approaches. So the reader will first follow the problems and what desired properties they explore in their clustering results. Later the reader will understand the results and their properties, and if the properties of the solutions do fulfill the desired properties that the problems seek. In future many variations of subspace clusters have been proposed to deal with various types of data and solve different problems. We will use deriving models to represent the results. All these proposed techniques have the potential to be future research areas of enhanced subspace clustering.

Conclusion

Nowadays, due to the advancement of data collection and increase of data is a challenging task. So the research trend has moved from traditional clustering to enhanced subspace clustering. In this review, the authors have presented the enhanced subspace clustering paradigm. As with any approach, enhanced subspace techniques finds useful and improved results depending on the selection of a particular approach and proper tuning of the algorithm with i/p parameters. This paper has discussed the concept of traditional subspace clustering and enhanced subspace clustering is desired properties of different types of approaches. Further, they will study the implementations and compare their solutions. Research in enhanced Subspace clustering is an exciting area to mine the high dimensional data sets efficiently.

References

[1]. Agrawal R, Gehrke J, Gunopulos D, and Raghavan P, (1998). “Automatic subspace clustering of highdimensional data for data mining applications”. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp.94-105.

[2]. Aggarwal CC, and Reddy CK, (2013). “Data clustering: algorithms and applications”. Data Mining Knowledge and Discovery Series 1^st Eds. CRC Press.

[3]. Berkhin P, (2006). “A survey of clustering data mining techniques”. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping Multidimensional Data, chap 2. Springer, New York, pp.25-71.

[4]. Cheng CH, Fu AW, and Zhang Y, (1999). “Entropybased subspace clustering for mining numerical data”. In: Proceedings of the 5^th ACM International Conference on Knowledge Discovery and Data Mining (KDD), pp. 84- 93.

[5]. Dash M, Choi K, Scheuermann P, and Liu H, (2002). “Feature selection for clustering-a filter solution. In: Proceedings of the 2^nd IEEE International Conference on Data Mining (ICDM), pp. 115-122.

[6]. Fromont É, Prado A, and Robardet C, (2009). “Constraint-based subspace clustering”. In Proceedings of the 9^th SIAM International Conference on Data Mining (SDM), pp.26-37.

[7]. Günnemann S, Müller E, Färber I, and Seidl T, (2009). “Detection of orthogonal concepts in subspaces of high dimensional data”. In: Proceedings of the 18^th ACM Conference on Information and Knowledge Management (CIKM), pp.1317-1326.

[8]. Houle ME, Kriegel HP, Kröger P, Schubert E, and Zimek A, (2010). Can shared-neighbor distances defeat the curse of dimensionality?. In: Proceedings of the 22^nd International Conference on Scientific and Statistical Database Management (SSDBM).

[9]. Sim K, Poernomo AK, and Gopalkrishnan V, (2010). “Mining actionable subspace clusters in sequential data”. In: Proceedings of the 10^th SIAM International Conference on Data Mining (SDM), pp.442-453.

[10]. Han J, Pei J, and Kamber M, (2011). Data Mining: Concepts and Techniques. Elsevier.

[11]. Ji L, Tan KL, and Tung AKH, (2006). Mining frequent closed cubes in 3D datasets. In Proceedings of the 32^nd International Conference on Very Large Databases (VLDB), pp.811-822.

[12]. Jang W, Jang W, and Hendry M, (2007). “Cluster analysis of massive datasets in astronomy”. Stat Comput, Vol. 17, No. 3, pp. 253-262.

[13]. Kleinberg J, Papadimitriou C, and Raghavan P, (1998). “A microeconomic view of data mining”. Data Mining Knowl Discov, Vol.2, No.4, pp.311-324.

[14]. Kailing, Kailing K, Kriegel HP, Kröger P, and Wanka S, (2003). “Ranking interesting subspaces for clustering high dimen-sional data”. In: Proceedings of the 7^th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), pp.241-252.

[15]. Kailing, Kailing K, Kröger P, and Kriegel HP, (2004). “Density-connected subspace clustering for high dimensional data”. In: Proceedings of the 4^th SIAM International Conference on Data Mining (SDM), pp.246- 257.

[16]. Kriegel HP, Kroger P, Renz M, and Wurst S, (2005). “A generic framework for efficient subspace clustering of high-dimensional data”. In fifth IEEE International Conference on Data Mining (ICDM'05), pp.1-8.

[17]. Kriegel, Kriegel HP, Borgwardt KM, Kröger P, Pryakhin A, Schubert M, and Zimek A, (2007). “Future trends in data mining”. Data Mining Knowl Discov, Vol. 15, No. 1, pp. 87- 97.

[18]. K. Sim, Sim K, Gopalkrishnan V, Zimek A, and Cong G, (2012). “A survey on enhanced subspace clustering”. Data Min Knowl Disc, Vol.26, No.2, pp.332-397.

[19]. Lance Parsons, Parsons L, Haque E, and Liu H, (2004). “Subspace clustering for high dimensional data: A review”. ACM SIGKDD Explor Newsl, Vol.6, No.1, pp.90- 105.

[20]. Li T, Ma S, and Ogihara M, (2004). “Document clustering via adaptive subspace iteration”. In: Proceedings of the 27^th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, USA. pp.218-225.

[21]. Liu G, Sim K, Li J, and Wong L, (2009). “Efficient mining of distance-based subspace clusters”. Stat Anal Data Mining, (00975-8887), Vol.2, No.5-6, pp.427-444.

[22]. Müller E, Assent I, Günnemann S, Krieger R, and Seidl T, (2009). “Relevant subspace clustering: Mining the most interesting non-redundant concepts in high dimensional data”. In: Proceedings of the 9^th IEEE International Conference on Data Mining (ICDM), pp.377-386.

[23]. Müller E, Assent I, Günnemann S, and Seidl T, (2011). “Scalable density-based subspace clustering”. In tProceedings of the 20^th ACM International Conference on Information and Knowledge Management, pp.1077-86, ACM.

[24]. Sim K, Li J, Gopalkrishnan V, and Liu G, (2006). “Mining maximal quasi-bicliques to co-cluster stocks and financial ratios for value investment”. In: Proceedings of tthe 6^th IEEE International Conference on Data Mining (ICDM), pp.1059-1063.

[25]. Sunita Jahirabadkar, and Parag Kulkarni, (2013). “Clustering for High Dimensional Data: Density based subspace Clustering Algorithms”. International Journal of Computer Applications, Vol.63, No.20, pp.00975-8887.

[26]. Vidal R, Tron R, and Hartley R, (2008). “Multiframe motion segmentation with missing data using Power Factorization and GPCA”. Int J Comput Vis, Vol.79, No.1, pp.85-105.

[27]. Vidal R, (2011). “Subspace Clustering”. IEEE Signal Proc Mag, Vol.28, No.2, pp.52-68.

[28]. Wagstaff K, Cardie C, Rogers S, and Schrödl S, (2001). “Constrained k-means clustering with background knowl-edge”. In: Proceedings of the 18^th International Conference on Machine Learning (ICML), pp.577-584.

[29]. Zhang T, Ramakrishnan R, and Livny M, (1996). “BIRCH: An efficient data clustering method for very large databases”. In: Proc. of the ACM SIGMOD International Conference on Management of Data, Vol.1, ACM Press, USA. pp.103-114.

[30]. Zaki MJ, Peters M, Assent I, and Seidl T, (2005). “CLICKS: An effective algorithm for mining subspace clus ters in categorical datasets”. In: Proceedings of the 11^th ACM International Conference on Knowledge Discovery and Data Mining (KDD), pp.736-742.