Initial Cluster Centers Based on Moving Two Lines Approximation in K-means Algorithm

: The main shortcoming of K-means clustering algorithm is its great dependence on the initial cluster center point. Based on the moving two lines approximation model, this paper gives a method to pick the initial cluster center of k-means clustering. Numerical experiments and comparison criteria show that this method can get better clustering effect.


Introduction
K-means algorithm is a hard-clustering algorithm and a representative of typical prototypebased clustering methods. It is a certain distance between the data points and the prototype as the objective function of optimization, and uses the method of finding the extreme value of the function to get the adjustment rules of iterative operation. K-means clustering maximizes the distance between objects within a class, while minimizes the distance between classes.
K-means clustering algorithm is highly dependent on the selection of initial values. An inappropriate initial value causes the algorithm to converge to a local minimum, so a lot of work has been done on the selection of initial clustering centers.
Fayyad et al. [1] give a fast and efficient algorithnm operated over small subsamples of a given database for refining an initial starting point. Using the global optimization ability of genetic algorithm to improve the traditional Kmeans algorithm to prevent only local optimal solutions is described in [2,3]. Likas et al. present the global k-means algorithm which is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure [4]. Kkan and Ahmad propose an algorithm to compute initial cluster centers for kmeans clustering. This algorithm is based on two observations that some of the patterns are very similar to each other and that is why they have same cluster membership irrespective to the choice of initial cluster centers [5]. In order to get the initial cluster centers for k-means algorithm, cells are partitioned one at a time until the number of cells equals to the predefined number of clusters, k. The centers of the k cells become the initial cluster centers [6]. Later a new algorithm for initial cluster centers in k-means algorithm is given in [7]. Two principal variables are selected according to maximum coefficient of the variaton and minimum absoute value of the correlation.
The rest of this paper is organized as follows. Section 2 introduces the Initial cluster centers computing based on moving two lines approximation. Section 3 introduces the comparision criteria such as error percentage and rand index. Section 4 elaborates the implementation of the algorithm and presents experiments and results better. Section 4 concludes the paper.

Initial cluster centers computing based on moving two lines approximation
Moving multiple curves approximation is based on [8], while moving two lines approximation is a particultiple curves approximation of moving multiple curves approximation [9].
There using two lines in the moving multiple curves approximation model and modify the model as follows.
is a projected point of on an underlying curve and is a fixed point called a reference point near the underlying shape which is acquired according to partitioning the space occupied by { } =1 . ( , ) is the coordinate of , , = 1,2, ⋯ , .

Comparision criteria
To compare the clustering results, two clustering criteria are presented here. One is error percentage [7] and another is the rand index [10].
Error percentage is defined as follows = × 100 (3) where is misclassified observations and is the total number of observations in datasets.
The higher the value of Error, the better the result. : and are in the same cluster in 1 and in the same cluster in 2 .
: and are in different cluster in 1 but in the same cluster in 2 .
: and are in the same cluster in 1 but in different cluster in 2 .
: and are in different cluster in 1 and in different cluster in 2 .
Rand index is given as follows = + + + + (4) And the lower the value of rand, the better the result.

Random initial cluster centers
We consider the sampling points { } =1 from two curves 1 : 1 = 0.02 and 2 : 2 = −0.02 , see the pink circles in Figure 1, Figure 2 and Figure 3. We pick the initial centers (0.2,0), (0.3,0) in Figure 1 and (0.4,0.02), (0.1, −0.02) in Figure 2 and use k-means algorithm to cluster { } =1 into two classes, see Figure 1 (b) and Figure 2 (b). Both of them are unreasonable results. We want the ones with the ordinate 0.02 to be one class and the ones with the ordinate -0.02 to be the other class.

Initial cluster center given by moving two lines approximation
It can be seen from the above results that if the initial clustering center is not appropriate, no good clustering result can be obtained. There the initial cluster center was obtained by moving two lines approximation. We use all the points { } =1 in the moving two lines approximation model (2) without partition. is the mean of { } =1 , denoted a pink star in Figure 3. And then from (2), two blue lines are obtained showing also in Figure 3. The computed two blue target points (0.25,0.02), (0.25,-0.02) according to (2) are considerd two initial centers of k-means algorithm, see Figure 4(a). Depend on the two initial centers from (2), we get a reasonable clustering result, see Figure 4 (b) . Besides, with the help of formula (3) and (4), we can compute error percentage and rand index for diffrent initial cluster centers, see Table 1. We found that the initial value from the moving two lines approximation corresponds to the smallest Error, and the maximum rand index. Its clustering effect is the best.

Conclusion
A new method to select the initial cluster center of k-means algorithm, that is, moving two lines approximation have provided in this paper. Moving two lines approximation is a particular case of moving multiple curves approximation. We just choose two algebraic equations of two lines in the model. For given dataset, by moving two lines approximation, we can find better initial center in k-means algorithm for given observation data and achieve good clustering effect than random initial cluster center. Here, only moving two approximation is used to deal with binary classification problems. In the future, moving multiple curves approximation can be considered to deal with multiclassification problems, and it may be used to deal with practical dataset.