Volume 3, Issue 6 e1206
RESEARCH ARTICLE
Full Access

Euclidean distance stratified random sampling based clustering model for big data mining

Kamlesh Kumar Pandey

Corresponding Author

Kamlesh Kumar Pandey

Department of Computer Science & Applications, Dr. Hari singh Gour Vishwavidyalaya, Sagar, India

Correspondence

Kamlesh Kumar Pandey, Department of Computer Science & Applications, Dr. Hari singh Gour Vishwavidyalaya, Sagar, Madhya Pradesh, India.

Email: [email protected]

Search for more papers by this author
Diwakar Shukla

Diwakar Shukla

Department of Computer Science & Applications, Dr. Hari singh Gour Vishwavidyalaya, Sagar, India

Search for more papers by this author
First published: 15 October 2021
Citations: 3

Funding information: No Funding.

Abstract

Big data mining is related to large-scale data analysis and faces computational cost-related challenges due to the exponential growth of digital technologies. Classical data mining algorithms suffer from computational deficiency, memory utilization, resource optimization, scale-up, and speed-up related challenges in big data mining. Sampling is one of the most effective data reduction techniques that reduces the computational cost, improves scalability and computational speed with high efficiency for any data mining algorithm in single and multiple machine execution environments. This study suggested a Euclidean distance-based stratum method for stratum creation and a stratified random sampling-based big data mining model using the K-Means clustering (SSK-Means) algorithm in a single machine execution environment. The performance of the SSK-Means algorithm has achieved better cluster quality, speed-up, scale-up, and memory utilization against the random sampling-based K-Means and classical K-Means algorithms using silhouette coefficient, Davies Bouldin index, Calinski Harabasz index, execution time, and speedup ratio internal measures.

INTRODUCTION

Nowadays, massive data has been grown by the use of digital and communication technologies, especially related to sensor networks, Internet of Things, social media, healthcare, e-commerce, cloud computing, cyber-physical systems, and so on. In May 2018, Forbes released an article related to the speed of data production. Stating that, Google searches 3.5 billion requests per day and 1.2 trillion requests per year worldwide. YouTube users watched 4,146,600 videos, Twitter users wrote 456,000 tweets, Instagram, and Snapchat users shared 46,740 and 527,760 photos, Facebook users wrote 510,000 comments, 293,000 status updates, and 300 million photos updated every minute worldwide.1 These digital platforms have a variety of sources that generate heterogeneous data at high speed. Complex sources, data formats, and large-scale data production define big data. It is a revolution of conventional and batch data because internet users have grown by 7.5% from 2016 to 2018.1 The above data statistics recognize big data in terms of volume, variety, velocity, and determine traditional data mining algorithms, tools, and techniques are incompatible.2 Essential characteristics of big data are recognized as volume, variety, velocity, whatever value, veracity, variability, visualization identified under the supportable characteristics of big data.3-6

One way of summarization is “The heterogeneous (variety) data is generated, created, and updated by heterogeneous (variety) sources through batch, real-time, and stream environment at high speed (velocity). The variety and velocity increase the data size (volume) to multiple Terabytes and Petabytes”. The veracity, value, variability, and visualization are recognized under extreme characteristics that are supported by volume, variety, and velocity. Based on the existing research examination,3-6 extreme characteristics described as “Veracity encourages accuracy through variety. Value supports the results of data mining by volume, variety, and velocity. Variability boosts sentiment analysis via volume and variety. Visualization supports graphical representation according to the user-readable mode by using other big data characteristics”.

Data mining algorithms are required to improve upon their computational cost, speed, scalability, flexibility, and efficiency according to the essential characteristics of big data.7 Extraction of appropriate hidden predictive information, patterns and relations from the heterogeneous large-scale dataset is big data mining,8 which requires higher transparency for volume, variety and velocity because large-scale data contains valuable knowledge and information.9 Mining techniques have limited applicability for high volume datasets due to millions of tuples and records. To overcome these issues, two alternatives are presented. The first is to scale up the data mining algorithm and the second is to reduce the dataset size. These two alternatives allow effective and feasible data mining.10 Intelligent big data mining is a combination of statistics and mining techniques with data and process management.9

Sampling is a technique of data reduction that uses a statistical model.11, 12 It provides a more effective evaluation with accuracy13 and is recognized as a “bag of little bootstraps”.14 Author15 described a standard framework of the sampling process for data mining using clustering. According to this sampling framework, the entire dataset is split into a sample and an un-sample component. The data mining algorithm uses the sample component and gives partial results. This is referred to as an approximate result. Thereafter, merges the partial results and un-sample unit by using a sample extension strategy that produces a final approximate result.

Various applications adopt cluster analysis such as social network analysis, ontology, customer segmentation, scientific data analysis, bioinformatics, natural classification, underlying structure, data compression, mobile ad-hoc networks, target marketing, texture segmentation, scientific data analysis, and vector quantization.15, 16 Clustering is one of the essential techniques in data mining for discovering the class of unlabeled data based on distance, similarity and dissimilarity. It discovers hidden relations, patterns, dance and information between the two classes, and improves the efficiency and effectiveness of the data mining model.17 Some authors16, 18 identified conventional clustering algorithms for big data mining by comparing the volume, variety and velocity, and suggested K-Means, BFR, CURE, BIRCH, CLARA, DBSCAN, DENCLUE, Wavecluster, and FC are robust for big data mining.

Contributions in the past7, 15, 16, 19, 20 addressed cluster creation techniques under single and multiple machine execution environments of big data mining. Clustering techniques for big data mining are categorized into divide-and-conquer, parallel, center reduction, efficient nearest neighbor (NN) search, sampling, dimension reduction, incremental, and condensation methods. Implementation of the parallel method is useful inside of multiple machines and other cluster construction methods are executed inside of both machine environments.21 This article uses sampling-based method for cluster creation under a single machine execution environment. Sampling mechanism increases the speed of the clustering algorithm and reduces computation time. The efficiency of sampling depends on the sample selection strategies and sample size.

In literature 15 discussed the CURE hierarchical clustering algorithm, which is based on uniform random sampling and improves computational efficiency. This method first selects the sample and thereafter carries the hierarchical clustering. Kollios et al.22 has used the density biased sampling technique for cluster construction and outlier detection within the K-Means and K-medoids algorithm. David et al.23 present a sample-based clustering framework for K-Means and K-Median clustering algorithms. This sampling framework achieves approximation results and depends on the sample size and accuracy parameters because the sample bounds the convergence to optimal clustering.

Aggarwal et al.24 proposed a bi-criteria approximation approach for the initial centroid selection of K-Means clustering using adaptive sampling. This approach achieves ineffective triangle inequalities due to excessive Euclidean distance computations and distance comparison. Bejarano et al.25 suggested a sampler algorithm for K-Means clustering based on random sampling. This algorithm reduces the distance estimation and computational cost in each iteration. The sampler algorithm increases the computation time for massive multi-dimensional data due to sample size estimation in each iteration. Cui et al.26 suggested optimized K-Means clustering by random sampling and parallel execution strategies. This strategy eliminates the iteration of K-Means inside parallel execution and yields effective performance. Xu et al.27 described the random instance sampling-based approximate clustering approach for text-based data and compared it against K-Means, and observed sampling-based clustering to reduce computationally expensive.

Zhan et al.28 introduced a spectral clustering algorithm by incremental Nystrom sampling method for large-scale and high-dimensional datasets. This algorithm achieves optimal approximation results and shows that maximum sampling trials efficiently reduce sampling error. Ros et al.29 used a combination of distance and density sampling for K-Means and hierarchical clustering. This method eliminates multiple distance computations in both clustering algorithms and is insensitive to data size, cluster noise, data, and cluster initialization. However, the combination of both sampling techniques increases the computation cost in the first iteration of the clustering. Zhao et al.30 presents Cluster Sampling Arithmetic (CSA) efficient heuristic approach for clustering through random sampling. This algorithm reduces the computation time by obtaining the minimum sample set and avoids the local optimum problem. The CSA uses a parallel execution strategy and discovered the initial centroid of K-Means clustering.

Aloise et al.31 used iterative sampling to solve the minimax diameter clustering problem. The minimax diameter clustering problem is related to minimizing the maximum intra-cluster dissimilarity. The solution to the minimax diameter clustering problem is to solve the partitioning clustering objective. Ben Hajkacem et al.32 used reservoir random sampling and proposed the sampling-based STiMR K-Means clustering algorithm. The STiMR K-Means algorithm is used triangle inequality and MapReduce acceleration techniques for fast execution. This algorithm suffers from scalability-related issues. Li et al.33 proposed sampling-based K-Means (SKMeans) and discovered excellent results as compared to the K-Means algorithm. This algorithm effectively reduces data size and increases cluster effectiveness and efficiency. Luchi et al.34 enhanced the DBSCAN clustering algorithm by sampling in terms of clustering time and scalability, which is known as Rough-DBSCAN and I-DBSCAN. These algorithms are influenced by some parameters such as sample size.

Based on the above research perspective, sampling is used for different kinds of clustering problems such as outlier detection, initial centroid identification, cluster construction, data size selection, computation cost, sampling error, trails of sampling, local optima, clustering objective, enormous distance computation, scalability, speed-up, and so on. The literatures 26, 27, 30, 33 used multiple machines and literatures 22-25, 28, 29, 31, 32, 34 used a single machine execution environment for clustering, and most of the clustering algorithms achieved approximation results. This study considers the improvement of random sampling concerning cluster creation in a single machine execution environment.

Some other sampling-based data mining techniques are density biased sampling,22 CURE (Clustering Using REpresentatives), RSEFCM (Random Sampling plus Extension Fuzzy C-Means), CLARANS (Clustering Large Applications based on Randomized Sampling),15 eNERF (extended non-Euclidean relational fuzzy c-means clustering),35 progressive sampling-based mining of association rules,36 GOFCM (geometric progressive fuzzy c-means),15 EM clustering,35 STiMR K-Means,32 AdROIT (Adaptive Reservoir sampling Of stream In Time),37 random pairing,38 adaptive-size reservoir sampling, Stratified Reservoir Sampling,39 Monte Carlo based uncertainty quantification and Refined Stratified Sampling,40 SSEFCM (Stratified sampling Plus Extension Fuzzy C Means),15 BIRCH,16 and so forth.

The objective of this article is to improve the computational efficiency and computing speed of conventional clustering algorithms without affecting cluster quality in big data mining using stratified random sampling. Section 0 describes the problem of data mining in large-scale data and presents big data clustering techniques based on existing computational methods. Section 0 has a stratified sampling method, stratum creation algorithm, and stratified sampling-based data mining algorithm. Section 0 contains the implementation of proposed work using the K-Means algorithm and validates them using internal measures. Section 0 concluded the study contribution and determines further research directions.

PROPOSED WORK

This section describes the clustering objective and presents the stratified sampling-based K-Means algorithm (SSK-Means) with Euclidean distance-based stratum (EDS) for cluster construction to big data mining using single machine execution. The proposed method enhances computational efficiency, computation cost and computing speed without affecting cluster quality and cluster objective.

Objective function

Suppose the N dataset is consists of N n data points with D dimensions and the number of clusters required as k. The partitional clustering algorithm such as K-Means divide N = N 1 , N 2 N n into k = C 1 , C 2 , C k exhaustive and mutually exclusive clusters. Partitional of the cluster must be follow C 1 C 2 C k = N and C 1 C 2 C k = θ condition 16. The K-Means algorithm create the cluster by objective of minimizing of the sum of squared error (SSE) function through the iterative process. The criterion optimization function of K-Means clustering is given in Equation 1.19
SSE J ( N , C ) = k = 1 K n i C k n i μ k 2 ()
where n i is the data points, and μ k is the centroid of the C k cluster. The Equation 1 referred to the minimum SSE problem that can be solved by Equation 2. The contain of C k is restrained the minimum SSE that is defined as follows.41
C k = n i N k = arg min j { 1 , 2 , . K } n μ j 2 ()
μ k = n i C k n i C i ()

Proposed model for big data mining

This subsection presents the stratified sampling based data mining model which has the capability to scale-up and speed-up any conventional data mining algorithm in big data environment. The proposed model uses a three phase with the help of stratified sampling, data mining technique as clustering and sample extension. In the following, details of stratified sampling and sample extension are described separately.

Stratified sampling (SS)

A successful sampling technique must be scalable under the high volume, variety, and velocity characteristics of big data. The SS reduces data volume size, data categorization from heterogeneous to homogeneous, and changes sample data concerning the time and user requirements.15, 30 The SS is an unbiased sampling approach from a statistical point of view that improves the convergence rate and reduces the variance.40 These capabilities illustrate that SS is more scalable for big data mining compared to other sampling methods.

The objective of SS is to improve upon the precision of results based on homogeneous strata. It conducts the sampling process in two phases. In the first step, the entire dataset is grouped into small strata based on the similarity that is known as the stratification process. The second step extracts some relevant data points from each stratum for data mining or analysis called sample allocation. The data points of the sample collection are chosen by any sampling method according to the objective function.42, 43

Suppose the dataset N is grouped into L strata and hth stratum has consisted of Nh data points where h = 1 L N h = N and h = 1,2,3,4…,L. The sample size nh of each stratum is drawn by any suitable sampling technique as a research problem requirement where h = 1 L n h = n . The conceptual representation of SS is shown in Figure 1.

Details are in the caption following the image
Conceptual representation of SS
Let Y hi represents stratified dataset unit, where i is the selected attribute of the study from the hth stratum, i = 1 , 2 , 3 , N h . Thereafter, random sampling extracts some homogeneous data points from each h stratum according to sample size. Let yhi is a random sample unit, where ith data unit is selected from the hth stratum for the mining model. The mean of hth stratum is Y h based on the N h data units and mean of a random sample of the hth stratum is y h based on the n h data units.
Y h = 1 N h i = 1 N h Y hi ()
y h = 1 n h i = 1 n h y hi ()
The proportion W h of stratified sampling is W h = N h N . The SS is an unbiased sampling process and produces high precision as compared to another biased sampling.

Stratification

In computer science, numerous stratification techniques are available for stratum formation, such as locality-sensitive hashing,15 greedy stratification method,44 Latin Hypercube Sampling,40 hamming distance and other collision-related hashing and bucket-based strategies. This article considers a new method of stratum creation using the Euclidean distance, which is known as Euclidean distance based stratum (EDS). The EDS algorithm used the maxmin data range heuristics approach for k stratum construction, which produces a high homogeneity within-stratum as compared to the mid-square, division, multiplication, and flooding hashing technique. This method first determines the k centroid of the dataset based on max and min range heuristics of the data points and thereafter assigns each data point to the k stratum through the Euclidean distance between the data points and the k centroid of the dataset. This approach describes the high density and compaction within each stratum. The EDS algorithm is described in Algorithm 1.

Algorithm 1. Proposed Euclidean distance based stratum (EDS)

Input:

1. data = Dataset with N data points.

2. i = Attributes of the dataset.

3. k = Required number of strata.

Output:

1. L = N 1 , N 2 N k of homogeneous strata.

Method

Data set centroid identification

1. v = ( max _ value ( i ) min _ value ( i ) ) / ( k + 1 )

2. if v is greater than max _ value ( i ) than

3. dataset has noise data point and exit()

4. else

Stratum centroid identification

5. c 1 = min _ value ( i ) + v

6. c 2 = c 1 + v

7. c k = c k 1 + v

8. End if

Assign data point into stratum

9. for i = 1 to length of data(i)
dis euclidean ( data ( i ) , ck ) = | data ( i ) ck | 2

10. Assign each data on the closed k stratum.

11. end for

12. Exit()

Sample allocation techniques

The efficiency of stratification depends upon the sample size taken from ith stratum. Optimum allocation is one of the sample allocation techniques that decides sample size and variability according to stratum size and variability, with the lowest cost. The optimum allocation obeys the sample allocation characteristics of SS. The Neyman allocation method is a specific case of optimal allocation that minimizes the variance and total cost of the sample allocation. The number of sample units (sample size) to be selected from the hth stratum is defined in Equation 6.
n h = n W h σ h h = 1 L W h σ h 2 ()
where n is the total sample size, σ h is the standard deviation of hth stratum, σ h 2 is the variance of hth stratum based on N h units.
σ h 2 = 1 N h i = 1 N h Y hi Y h 2 ()
σ h = sqrt σ h 2 ()

After deciding the optimum sample size of the stratum, the next is to data from the stratum through random sampling. The selection of data points from each stratum has equal changes with 1 / n h probability. Random sampling is used as the probability distribution and without-replacement procedure. The sample pool has n data units from the hth stratum and this sample pool is used for mining.

Sample extension

Sample extension method produces final results by merging the results of the sample pool with un-sample data points through the sample extension technique. This study uses the K-NN classifier as a sample extension technique defined in Equation 9. The K-NN classifier is related to Euclidean distance and centroid that assigns un-sample data to the nearest sample pool results through Euclidean distance. According to the K-NN formulation, A i is a member of the un-sample data unit and B i is the mean of the sample pool result of the ith variable.15
dis euclidean ( A , B ) = i = 1 n A i B i 2 ()

Algorithm description

This subsection presents the stratified sampling-based data mining algorithm for K-Means (SSK-Means) clustering based on a single machine execution environment that reduces the computation cost and memory resources without impacting conventional K-Means effectiveness. The proposed model first constructs the stratum through the EDS algorithm and then collects samples from each stratum in the sample pool with the help of Neyman allocation and random sampling. Hereafter, the data mining technique is applied inside the sample pool that produces partial results. The last step merges the final result using a sample extension method that merges the un-sample data into partial results. The final result is validated through data mining measures. The details of the proposed SSK-Means algorithm is described in Algorithm 2 and their flowchart in Figure 2. The flow chart explains the essential concept of the SSK-Means algorithm.

Details are in the caption following the image
Flow chart of proposed model (SSK-Means)

Algorithm 2. Stratified sampling-based K-Means using proposed big data mining model

Input:

1. Data N = N 1 , N 2 N n points on D dimension space.

2. i = Attributes of the dataset.

3. k = Required number of clusters.

Output:

1. AFc = C 1 , C 2 C n Approximate Final clustering result.

Method

1. Call the EDH(N,i,k) algorithm to create an L stratum, where EDH returns the number of L strata equal to k.

2. Determine the number of data objects n h from each stratum using Equation 6.

3. Using random sampling, collect n h data objects from each stratum and assign each n h data object into the Ns sample pool and leftover data objects to the Us un-sample pool.

4. Apply required clustering techniques in the Ns pool such as PFc = KMeans(k,Ns)

5. Used Equation 9 for sample extension to obtain final clustering results through centroids of PFc and Us pool such as AFc = dis euclidean ( Us , Mean ( PFc ) )

6. Exit

COMPUTATIONAL ANALYSIS

This section first explains the experimental setup and details of datasets for evaluation and then computes the performance of the SSK-Means algorithm using effectiveness and efficiency-related evaluation criteria.

Experiment environment (tools) and dataset

The experimental environment of the proposed model has used the Jupyter notebook computing environment, Python 3.5.3 programming tool, Intel I3 processor, CPU [email protected] GHz, 320 GB hard disk, 4 GB DDR3 RAM, Windows 7 Operating System with two real datasets from UC Irvine Machine Learning Repository (archive.ics.uci.edu/ml/datasets/seeds). Details of these datasets are described in Table 1. Evaluation of the real dataset used 5%, 10%, 15%, and 20% sample data and after that merged the sample results with 95%, 90%, 85%, and 80% un-sample data, respectively.

TABLE 1. Details of real datasets
Datasets Objects Attributes
Skin segmentation 245,057 3
Poker 800,000 10
  • Note: Optimal values for each investigation are marked in boldface.

Evaluation criteria of clustering validation

Cluster validation assesses the cluster quality using internal and external measurements. An excellent clustering method always has intra-class similarity higher and inter-class similarity smaller. This study uses the silhouette coefficient, Davies Bouldin score, Calinski Harabasz score, execution time, and speed up internal metrics for cluster validation.15, 45 The efficient results of silhouette coefficient, Calinski Harabasz score and speed up validation metric are often needed for maximization. Better value for Davies Bouldin's score and execution time validation metric always needs to be minimize.
  • Silhouette coefficient (SC) is to validate clustering performance through a pairwise difference of cluster distances within (compactness) and between (separation). The SC measures the similarity within the cluster. In the SC formula, a(x) is the average distance of x to all other data points in the same cluster C, b(x) denotes the average distance of x to all other data points in all Ci clusters.
SC = x Ci b ( x ) a ( x ) max [ b ( x ) , a ( x ) ] ()
  • Davies Bouldin Score (DB) help to evaluate within-cluster dispersion and between cluster similarity. Moreover, DB evaluates cluster dispersion and similarity without dependence on number of clusters. In the DB formulation, k is the total number of clusters, C j defines the total number of data point x i inside of C j cluster and C i is another cluster.
DB = 1 k i = 1 k max i j within i + within j between ij ()
within j = 1 C j i = 1 c j x i C j 2 ()
between ij = C i C j 2 ()
  • Calinski Harabasz Score (CH) determines the average value of the sum of squares between and within clusters. The CH measures cluster variance and is referred to as the variance ratio criterion. In the CH formulation, n is the total number of data points, k for the total number of clusters, x is data points inside the C i cluster, m denotes the mean of entire dataset and m i is the mean of C i cluster.
CH = ( n k ) i = 1 k C i m i m T m i m ( k 1 ) t i = 1 k x C i x m i T x m i ()
  • Execution time (ET) computes the total execution time of any algorithm/model. The execution time is achieved by between the entry ENT and exit EXT time of any data mining algorithm.
ET = EX T EN T ()
  • Speedup ratio (SR) estimates the ratio of execution time of sampled-based algorithm T SCS and conventional algorithm T TCS .
SR = T TCS T SCS ()

Used algorithm for big data clustering

This study compares the performance of the proposed stratified random sampling-based K-means algorithm (SSK-Means) against the random sampling-based K-means (RSK-Means)25 and traditional K-means19 algorithm inside a single machine based on clustering objective.

Results

The reported results of internal measures are shown in Tables 2–6 based on the average of 10 trials. Optimal values for each investigation are marked in boldface. A number of clusters are set at three for experiments. Tables 2–4 presents the comparative examination of SC, DB, and CH values on experimental datasets using RSK-Means and SSK-Means algorithms, respectively. Table 2 illustrates that the proposed SSK-Means algorithm attained improved cluster compaction and separation than RSK-Means algorithm by maximization of SC value. The SC value of the SSK-Means algorithm is achieved better than RSK-Means except for the sample size 10% on the skin dataset and SC value of the SSK-Means algorithm is much optimized compared to RSK-Means in all sample sizes on the poker dataset.

TABLE 2. Comparison of SC values ( means ± std ) of RSK-Means and SSK-Means
Dataset Sample size RSK-Means SSK-Means
Skin segmentation 5% 0.51524 ± 0.0015 0.51566 ± 0.00152
10% 0.51567 ± 0.0014 0.51482 ± 0.00130
15% 0.51468 ± 0.00145 0.51528 ± 0.0012
20% 0.51483 ± 0.00145 0.51501 ± 0.00136
Poker 5% 0.11876 ± 0.00269 0.11934 ± 0.00151
10% 0.11848 ± 0.00356 0.11895 ± 0.00235
15% 0.11892 ± 0.00261 0.11973 ± 0.00197
20% 0.11900 ± 0.00263 0.120020 ± 0.0020
  • Note: Optimal values for each investigation are marked in boldface.
TABLE 3. Comparison of DB values ( means ± std ) of RSK-Means and SSK-Means
Dataset Sample size RSK-Means SSK-Means
Skin segmentation 5% 0.81746 ± 0.00969 0.81737 ± 0.00997
10% 0.81703 ± 0.00649 0.81634 ± 0.00774
15% 0.81684 ± 0.00771 0.81646 ± 0.00654
20% 0.81956 ± 0.00449 0.81584 ± 0.00744
Poker 5% 2.14451 ± 0.0418 2.13567 ± 0.01555
10% 2.15167 ± 0.03592 2.14005 ± 0.01956
15% 2.14843 ± 0.03371 2.13856 ± 0.02105
20% 2.14216 ± 0.03183 2.12686 ± 0.01957
  • Note: Optimal values for each investigation are marked in boldface.
TABLE 4. Comparison of CH values ( means ± std ) of RSK-Means and SSK-Means
Dataset Sample size RSK-Means SSK-Means
Skin segmentation 5% 305228.77 ± 129.69 305245.70 ± 107.42
10% 305262.29 ± 71.75 305282.88 ± 87.54
15% 305288.51 ± 49.36 305279.05 ± 79.75
20% 305253.11 ± 76.16 305273.44 ± 88.24
Poker 5% 112017.24 ± 3764.91 112870.93 ± 2004.81
10% 111403.47 ± 3336.27 112326.03 ± 2308.14
15% 111794.62 ± 3351.86 112290.39 ± 2201.60
20% 112221.03 ± 2813.21 113468.85 ± 1909.35
  • Note: Optimal values for each investigation are marked in boldface.
TABLE 5. Comparison of SC, DB, and CH values ( means ± std ) of K-Means, RSK-Means, and SSK-Means
Dataset Evaluation criteria K-Means RSK-Means SSK-Means
Skin segmentation SC 0.5123 ± 0.00291 0.51511 ± 0.00147 0.51519 ± 0.00134
DB 0.81712 ± 0.00583 0.81772 ± 0.00656 0.8165 ± 0.00645
CH 305269.71 ± 59.38 305258.17 ± 86.25 305270.27 ± 88.96
Poker SC 0.1167 ± 0.00402 0.11879 ± 0.00279 0.11951 ± 0.00195
DB 20.19172 ± 0.06467 2.14669 ± 0.03208 2.13528 ± 0.01895
CH 107753.84 ± 4693.99 111859.093 ± 3217.61 112739.056 ± 2086.43
  • Note: Optimal values for each investigation are marked in boldface.
TABLE 6. Comparison of ET values ( means ± std ) of RSK-Means and SSK-Means
Dataset Sample size RSK-Means SSK-Means
Skin segmentation 5% 104.6961 ± 34.7258 72.0375 ± 18.3354
10% 69.3591 ± 25.4227 87.3679 ± 17.5049
15% 208.2860 ± 73.2868 70.5661 ± 15.3212
20% 219.1936 ± 105.7125 139.8222 ± 39.0222
Poker 5% 168.71725 ± 35.4200 162.05747 ± 25.58619
10% 233.82127 ± 79.3378 211.25498 ± 37.78515
15% 316.57931 ± 94.7075 235.40716 ± 54.46616
20% 413.60596 ± 75.4575 397.17432 ± 113.9999
  • Note: Optimal values for each investigation are marked in boldface.

Table 3 reveals that the proposed SSK-Means algorithm obtained a better distance within and between clusters than RSK-Means algorithm through minimization of DB value. The DB value of the skin dataset is close to zero because the dataset uses an exact number of clusters. The DB value of the poker dataset is higher than one due to the reason it requires more clusters. The DB value of SSK-Means is accurate as opposed to RSK-Means in both datasets. Table 4 reflects that the proposed algorithm achieved excellent variance ratio between sum of squares within and between by maximization of CH value. Validation of CH score and SSK-Means algorithm, the skin dataset obtained better CH as compared to RSK-Means excluding sample size 15%, and poker dataset achieved better-optimized CH value as compared to RSK-Means in all sample sizes.

Table 5 presented the SC, DB, and CH examinations of K-Means, RSK-Means, and SSK-Means using average results of the selected sample size, where SSK-Means obtained better SC, DB, and CH results compared to K-Means and RSK-Means on both datasets. This reflection demonstrates that the SSK-Means seems to be achieved efficiencies homogeneity, compaction, separation, similarity and than K-Means, RSK-Means algorithms.

The execution time and speedup ratio validate the algorithm efficiency. Table 6 shows the execution time of the RSK-Means and SSK-Means algorithms on selected sample sizes and their speedup ratio as mentioned in Figure 3 using K-Means execution time. The execution time of SSK-Means is achieved better than the RSK-Means algorithm on each sample size. The speedup of SSK-Means achieved high scalability and speed as compared to the RSK-Means algorithm on each sample size except for the 10% sample size in the skin dataset. The sample size 15% obtains higher scalability and speed of SSK-Means while compared with other sample sizes in both datasets.

Details are in the caption following the image
Speed up (SR) of RSK-Means and SSK-Means using K-Means execution time

The average execution time of the K-Means, RSK-Means, and SSK-Means algorithms are in Figure 4 that indicates the SSK-Means comments less execution time compared to K-Means and RSK-Means. The proposed SS-based algorithm reduces the execution time without affecting the quality of conventional and random sampling-based algorithms. The SSK-Means reduce the average execution time by 58% and 31% with respect to RSK-Means in the skin and poker datasets, respectively.

Details are in the caption following the image
Execution time of K-Means, RSK-Means and RSK-Means

The SSK-Means algorithm uses smaller iterations during clustering that reason the SSK-Means algorithm is to reduce the execution time, computation cost, resource consumption, and improve the convergence speed of K-Means. The reported results of Table 5 and the average execution time of Figure 4 indicate that the SSK-Means have achieved better effectiveness and efficiency against the RSK-Means algorithm and comparable to the K-Means. This indicates that the proposed algorithm is straightaway scalable and robust for big data clustering.

CONCLUSION

This study has suggested two strategies to overcome the shortcomings of the sample-based clustering algorithm in the terms of computation time, resource utilization, and quality. The first is known as the EDS method that is used for stratum creation by dividing original data into different strata according to a number of cluster requirements. The second approach is a stratified random sampling-based data mining model for big data (abbr. SS-based data mining algorithm as SSK-Means). The SSK-Means algorithm has been compared to random sampling-based K-Means (abbr. RSK-Means) using SC, DB, CH, ET, and SR validation. The 5%, 10%, 15%, and 20% sample size data objects have been chosen by EDS, sample allocation and stratified sampling. The K-Means partial clustering results of 5%, 10%, 15%, 20% selected sample sizes data have been obtained and merged into 95%, 90%, 85%, 80% un-sample data by sample extension for final results. The SSK-Means obtained a high speedup ratio in a sample size of 15% as compared to other sample sizes and RSK-Means. Experimental studies on real data sets show that the proposed algorithm has superior clustering performance than RSK-Means and conventional K-Means algorithms in terms of efficiency and effectiveness. The SSK-Means achieved better computing time, cluster quality, and efficiency in 15% and 20% sample sizes. Sampling strategies significantly reduce the clustering time on large-scale data, but they may decrease the cluster quality due to small sample size selection. The sample selection, sample allocation, clustering algorithm, and sample extension are based on the stratification process. Therefore, better stratification determines the effectiveness and efficiency of the clustering algorithm. Further scope of the study is to overcome stratification process, sample size selection, sample allocation concerns for excellence cluster effectiveness by validating other internal and external measurements on single or multiple machine-based environments.

CONFLICT OF INTEREST

The authors declare that they have no conflict of interest.

Biographies

  • biography image

    Kamlesh Kumar Pandey is pursuing a Ph.D. from Dr. Hari Singh Gour Vishwavidyalaya (A Central University), Sagar, India, under the supervision of Prof. Diwakar Shukla. Currently, He is doing research on the design of big data mining algorithms with respect to clustering. He is the author and co-author of several research articles in International journals and conference such as IEEE, Springer, and others. He has 8 years of teaching and research experience. He awarded training of young scientist in 34th and 35th M.P. Young Scientist Congress.

  • biography image

    Diwakar Shukla is presently working as HOD in the Department of Computer science and applications, Dr. Hari Singh Gour Vishwavidyalaya, Sagar, India, and has over 25 years' experience of teaching research experience. He obtained M.Sc.(stat.), Ph.D.(stat.) degrees from Banaras Hindu University, Varanasi and served the Devi Ahilya University, Indore, M.P. as a permanent Lecturer from 1989 for 9 years and obtained the degree of M.Tech.(Computer Science) from there. He joined Dr. Hari Singh Gour Vishwavidyalaya, Sagar as a reader in statistics in the year 1998. During Ph.D. from BHU, he was junior and senior research fellow of CSIR, New Delhi through Fellowship Examination (NET) of 1983. Till now, he has published more than 75 research articles in national and international journals and participated in more than 35 seminars/conferences at the national level. He also worked as a Professor at the Lucknow University, Lucknow, U.P., for one (from June 2007 to 2008) year and visited abroad in Sydney (Australia) and Shanghai (China) for conference participation and paper presentation. He has supervised 14 Ph.D. theses in Statistics and Computer Science and seven students are presently enrolled for their doctoral degree under his supervision. He is the author of two books. He is a member of 11 learned bodies of Statistics and Computer Science at the national level. The area of research, he works for are Sampling Theory, Graph Theory, Stochastic Modeling, Data mining, Big Data, Operation Research, Computer Network, and Operating Systems.

    The full text of this article hosted at iucr.org is unavailable due to technical difficulties.