Volume 3, Issue 6 e1206

RESEARCH ARTICLE

Full Access

Euclidean distance stratified random sampling based clustering model for big data mining

Kamlesh Kumar Pandey,

Corresponding Author

Kamlesh Kumar Pandey

[email protected]

Department of Computer Science & Applications, Dr. Hari singh Gour Vishwavidyalaya, Sagar, India

Correspondence

Kamlesh Kumar Pandey, Department of Computer Science & Applications, Dr. Hari singh Gour Vishwavidyalaya, Sagar, Madhya Pradesh, India.

Email: [email protected]

Search for more papers by this author

Diwakar Shukla,

Diwakar Shukla

[email protected]

Department of Computer Science & Applications, Dr. Hari singh Gour Vishwavidyalaya, Sagar, India

Search for more papers by this author

Kamlesh Kumar Pandey,

Corresponding Author

Kamlesh Kumar Pandey

[email protected]

Department of Computer Science & Applications, Dr. Hari singh Gour Vishwavidyalaya, Sagar, India

Correspondence

Kamlesh Kumar Pandey, Department of Computer Science & Applications, Dr. Hari singh Gour Vishwavidyalaya, Sagar, Madhya Pradesh, India.

Email: [email protected]

Search for more papers by this author

Diwakar Shukla,

Diwakar Shukla

[email protected]

Department of Computer Science & Applications, Dr. Hari singh Gour Vishwavidyalaya, Sagar, India

Search for more papers by this author

First published: 15 October 2021

https://doi.org/10.1002/cmm4.1206

Citations: 3

Funding information: No Funding.

Share a link

Email
Wechat
Bluesky

Abstract

Big data mining is related to large-scale data analysis and faces computational cost-related challenges due to the exponential growth of digital technologies. Classical data mining algorithms suffer from computational deficiency, memory utilization, resource optimization, scale-up, and speed-up related challenges in big data mining. Sampling is one of the most effective data reduction techniques that reduces the computational cost, improves scalability and computational speed with high efficiency for any data mining algorithm in single and multiple machine execution environments. This study suggested a Euclidean distance-based stratum method for stratum creation and a stratified random sampling-based big data mining model using the K-Means clustering (SSK-Means) algorithm in a single machine execution environment. The performance of the SSK-Means algorithm has achieved better cluster quality, speed-up, scale-up, and memory utilization against the random sampling-based K-Means and classical K-Means algorithms using silhouette coefficient, Davies Bouldin index, Calinski Harabasz index, execution time, and speedup ratio internal measures.

INTRODUCTION

Nowadays, massive data has been grown by the use of digital and communication technologies, especially related to sensor networks, Internet of Things, social media, healthcare, e-commerce, cloud computing, cyber-physical systems, and so on. In May 2018, Forbes released an article related to the speed of data production. Stating that, Google searches 3.5 billion requests per day and 1.2 trillion requests per year worldwide. YouTube users watched 4,146,600 videos, Twitter users wrote 456,000 tweets, Instagram, and Snapchat users shared 46,740 and 527,760 photos, Facebook users wrote 510,000 comments, 293,000 status updates, and 300 million photos updated every minute worldwide.¹ These digital platforms have a variety of sources that generate heterogeneous data at high speed. Complex sources, data formats, and large-scale data production define big data. It is a revolution of conventional and batch data because internet users have grown by 7.5% from 2016 to 2018.¹ The above data statistics recognize big data in terms of volume, variety, velocity, and determine traditional data mining algorithms, tools, and techniques are incompatible.² Essential characteristics of big data are recognized as volume, variety, velocity, whatever value, veracity, variability, visualization identified under the supportable characteristics of big data.^3-6

One way of summarization is “The heterogeneous (variety) data is generated, created, and updated by heterogeneous (variety) sources through batch, real-time, and stream environment at high speed (velocity). The variety and velocity increase the data size (volume) to multiple Terabytes and Petabytes”. The veracity, value, variability, and visualization are recognized under extreme characteristics that are supported by volume, variety, and velocity. Based on the existing research examination,^3-6 extreme characteristics described as “Veracity encourages accuracy through variety. Value supports the results of data mining by volume, variety, and velocity. Variability boosts sentiment analysis via volume and variety. Visualization supports graphical representation according to the user-readable mode by using other big data characteristics”.

Data mining algorithms are required to improve upon their computational cost, speed, scalability, flexibility, and efficiency according to the essential characteristics of big data.⁷ Extraction of appropriate hidden predictive information, patterns and relations from the heterogeneous large-scale dataset is big data mining,⁸ which requires higher transparency for volume, variety and velocity because large-scale data contains valuable knowledge and information.⁹ Mining techniques have limited applicability for high volume datasets due to millions of tuples and records. To overcome these issues, two alternatives are presented. The first is to scale up the data mining algorithm and the second is to reduce the dataset size. These two alternatives allow effective and feasible data mining.¹⁰ Intelligent big data mining is a combination of statistics and mining techniques with data and process management.⁹

Sampling is a technique of data reduction that uses a statistical model.^{11, 12} It provides a more effective evaluation with accuracy¹³ and is recognized as a “bag of little bootstraps”.¹⁴ Author¹⁵ described a standard framework of the sampling process for data mining using clustering. According to this sampling framework, the entire dataset is split into a sample and an un-sample component. The data mining algorithm uses the sample component and gives partial results. This is referred to as an approximate result. Thereafter, merges the partial results and un-sample unit by using a sample extension strategy that produces a final approximate result.

Various applications adopt cluster analysis such as social network analysis, ontology, customer segmentation, scientific data analysis, bioinformatics, natural classification, underlying structure, data compression, mobile ad-hoc networks, target marketing, texture segmentation, scientific data analysis, and vector quantization.^{15, 16} Clustering is one of the essential techniques in data mining for discovering the class of unlabeled data based on distance, similarity and dissimilarity. It discovers hidden relations, patterns, dance and information between the two classes, and improves the efficiency and effectiveness of the data mining model.¹⁷ Some authors^{16, 18} identified conventional clustering algorithms for big data mining by comparing the volume, variety and velocity, and suggested K-Means, BFR, CURE, BIRCH, CLARA, DBSCAN, DENCLUE, Wavecluster, and FC are robust for big data mining.

Contributions in the past^{7, 15, 16, 19, 20} addressed cluster creation techniques under single and multiple machine execution environments of big data mining. Clustering techniques for big data mining are categorized into divide-and-conquer, parallel, center reduction, efficient nearest neighbor (NN) search, sampling, dimension reduction, incremental, and condensation methods. Implementation of the parallel method is useful inside of multiple machines and other cluster construction methods are executed inside of both machine environments.²¹ This article uses sampling-based method for cluster creation under a single machine execution environment. Sampling mechanism increases the speed of the clustering algorithm and reduces computation time. The efficiency of sampling depends on the sample selection strategies and sample size.

In literature 15 discussed the CURE hierarchical clustering algorithm, which is based on uniform random sampling and improves computational efficiency. This method first selects the sample and thereafter carries the hierarchical clustering. Kollios et al.²² has used the density biased sampling technique for cluster construction and outlier detection within the K-Means and K-medoids algorithm. David et al.²³ present a sample-based clustering framework for K-Means and K-Median clustering algorithms. This sampling framework achieves approximation results and depends on the sample size and accuracy parameters because the sample bounds the convergence to optimal clustering.

Aggarwal et al.²⁴ proposed a bi-criteria approximation approach for the initial centroid selection of K-Means clustering using adaptive sampling. This approach achieves ineffective triangle inequalities due to excessive Euclidean distance computations and distance comparison. Bejarano et al.²⁵ suggested a sampler algorithm for K-Means clustering based on random sampling. This algorithm reduces the distance estimation and computational cost in each iteration. The sampler algorithm increases the computation time for massive multi-dimensional data due to sample size estimation in each iteration. Cui et al.²⁶ suggested optimized K-Means clustering by random sampling and parallel execution strategies. This strategy eliminates the iteration of K-Means inside parallel execution and yields effective performance. Xu et al.²⁷ described the random instance sampling-based approximate clustering approach for text-based data and compared it against K-Means, and observed sampling-based clustering to reduce computationally expensive.

Zhan et al.²⁸ introduced a spectral clustering algorithm by incremental Nystrom sampling method for large-scale and high-dimensional datasets. This algorithm achieves optimal approximation results and shows that maximum sampling trials efficiently reduce sampling error. Ros et al.²⁹ used a combination of distance and density sampling for K-Means and hierarchical clustering. This method eliminates multiple distance computations in both clustering algorithms and is insensitive to data size, cluster noise, data, and cluster initialization. However, the combination of both sampling techniques increases the computation cost in the first iteration of the clustering. Zhao et al.³⁰ presents Cluster Sampling Arithmetic (CSA) efficient heuristic approach for clustering through random sampling. This algorithm reduces the computation time by obtaining the minimum sample set and avoids the local optimum problem. The CSA uses a parallel execution strategy and discovered the initial centroid of K-Means clustering.

Aloise et al.³¹ used iterative sampling to solve the minimax diameter clustering problem. The minimax diameter clustering problem is related to minimizing the maximum intra-cluster dissimilarity. The solution to the minimax diameter clustering problem is to solve the partitioning clustering objective. Ben Hajkacem et al.³² used reservoir random sampling and proposed the sampling-based STiMR K-Means clustering algorithm. The STiMR K-Means algorithm is used triangle inequality and MapReduce acceleration techniques for fast execution. This algorithm suffers from scalability-related issues. Li et al.³³ proposed sampling-based K-Means (SKMeans) and discovered excellent results as compared to the K-Means algorithm. This algorithm effectively reduces data size and increases cluster effectiveness and efficiency. Luchi et al.³⁴ enhanced the DBSCAN clustering algorithm by sampling in terms of clustering time and scalability, which is known as Rough-DBSCAN and I-DBSCAN. These algorithms are influenced by some parameters such as sample size.

Based on the above research perspective, sampling is used for different kinds of clustering problems such as outlier detection, initial centroid identification, cluster construction, data size selection, computation cost, sampling error, trails of sampling, local optima, clustering objective, enormous distance computation, scalability, speed-up, and so on. The literatures 26, 27, 30, 33 used multiple machines and literatures 22-25, 28, 29, 31, 32, 34 used a single machine execution environment for clustering, and most of the clustering algorithms achieved approximation results. This study considers the improvement of random sampling concerning cluster creation in a single machine execution environment.

Some other sampling-based data mining techniques are density biased sampling,²² CURE (Clustering Using REpresentatives), RSEFCM (Random Sampling plus Extension Fuzzy C-Means), CLARANS (Clustering Large Applications based on Randomized Sampling),¹⁵ eNERF (extended non-Euclidean relational fuzzy c-means clustering),³⁵ progressive sampling-based mining of association rules,³⁶ GOFCM (geometric progressive fuzzy c-means),¹⁵ EM clustering,³⁵ STiMR K-Means,³² AdROIT (Adaptive Reservoir sampling Of stream In Time),³⁷ random pairing,³⁸ adaptive-size reservoir sampling, Stratified Reservoir Sampling,³⁹ Monte Carlo based uncertainty quantification and Refined Stratified Sampling,⁴⁰ SSEFCM (Stratified sampling Plus Extension Fuzzy C Means),¹⁵ BIRCH,¹⁶ and so forth.

The objective of this article is to improve the computational efficiency and computing speed of conventional clustering algorithms without affecting cluster quality in big data mining using stratified random sampling. Section 0 describes the problem of data mining in large-scale data and presents big data clustering techniques based on existing computational methods. Section 0 has a stratified sampling method, stratum creation algorithm, and stratified sampling-based data mining algorithm. Section 0 contains the implementation of proposed work using the K-Means algorithm and validates them using internal measures. Section 0 concluded the study contribution and determines further research directions.

PROPOSED WORK

This section describes the clustering objective and presents the stratified sampling-based K-Means algorithm (SSK-Means) with Euclidean distance-based stratum (EDS) for cluster construction to big data mining using single machine execution. The proposed method enhances computational efficiency, computation cost and computing speed without affecting cluster quality and cluster objective.

Objective function

Suppose the N dataset is consists of

N_{n}

data points with D dimensions and the number of clusters required as k. The partitional clustering algorithm such as K-Means divide

N = \{N_{1}, N_{2} \dots N_{n}\}

into

k = \{C_{1}, C_{2}, \dots C_{k}\}

exhaustive and mutually exclusive clusters. Partitional of the cluster must be follow

\{C_{1} \cup C_{2} \cup \dots \cup C_{k}\} = N

and

\{C_{1} \cap C_{2} \cap \dots \cap C_{k}\} = θ

condition 16. The K-Means algorithm create the cluster by objective of minimizing of the sum of squared error (SSE) function through the iterative process. The criterion optimization function of K-Means clustering is given in Equation 1.¹⁹

SSE J (N, C) = \sum_{k = 1}^{K} \sum_{n_{i} \in C_{k}} {‖n_{i} - μ_{k}‖}^{2}

()

where

n_{i}

is the data points, and

μ_{k}

is the centroid of the

C_{k}

cluster. The Equation 1 referred to the minimum SSE problem that can be solved by Equation 2. The contain of

C_{k}

is restrained the minimum SSE that is defined as follows.⁴¹

C_{k} = \{n_{i} \in N ∣ k = \underset{j \in {1, 2, \dots . K}}{\arg \min} {‖n - μ_{j}‖}^{2}\}

()

μ_{k} = \frac{\sum_{n_{i} \in C_{k}} n_{i}}{|C_{i}|}

()

Proposed model for big data mining

This subsection presents the stratified sampling based data mining model which has the capability to scale-up and speed-up any conventional data mining algorithm in big data environment. The proposed model uses a three phase with the help of stratified sampling, data mining technique as clustering and sample extension. In the following, details of stratified sampling and sample extension are described separately.

Stratified sampling (SS)

A successful sampling technique must be scalable under the high volume, variety, and velocity characteristics of big data. The SS reduces data volume size, data categorization from heterogeneous to homogeneous, and changes sample data concerning the time and user requirements.^{15, 30} The SS is an unbiased sampling approach from a statistical point of view that improves the convergence rate and reduces the variance.⁴⁰ These capabilities illustrate that SS is more scalable for big data mining compared to other sampling methods.

The objective of SS is to improve upon the precision of results based on homogeneous strata. It conducts the sampling process in two phases. In the first step, the entire dataset is grouped into small strata based on the similarity that is known as the stratification process. The second step extracts some relevant data points from each stratum for data mining or analysis called sample allocation. The data points of the sample collection are chosen by any sampling method according to the objective function.^{42, 43}

Suppose the dataset N is grouped into L strata and h^th stratum has consisted of N_h data points where $\sum_{h = 1}^{L} N_{h} = N$ and h = 1,2,3,4…,L. The sample size n_h of each stratum is drawn by any suitable sampling technique as a research problem requirement where $\sum_{h = 1}^{L} n_{h} = n$ . The conceptual representation of SS is shown in Figure 1.

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

Conceptual representation of SS

Let

Y_{hi}

represents stratified dataset unit, where i is the selected attribute of the study from the h^th stratum,

i = 1, 2, 3, \dots N_{h}

. Thereafter, random sampling extracts some homogeneous data points from each h stratum according to sample size. Let y_hi is a random sample unit_, where i^th data unit is selected from the h^th stratum for the mining model. The mean of h^th stratum is

\overline{Y_{h}}

based on the

N_{h}

data units and mean of a random sample of the h^th stratum is

\overline{y_{h}}

based on the

n_{h}

data units.

\overline{Y_{h}} = \frac{1}{N_{h}} \sum_{i = 1}^{N_{h}} Y_{hi}

()

\overline{y_{h}} = \frac{1}{n_{h}} \sum_{i = 1}^{n_{h}} y_{hi}

()

The proportion

W_{h}

of stratified sampling is

W_{h} = \frac{N_{h}}{N}

. The SS is an unbiased sampling process and produces high precision as compared to another biased sampling.

Stratification

In computer science, numerous stratification techniques are available for stratum formation, such as locality-sensitive hashing,¹⁵ greedy stratification method,⁴⁴ Latin Hypercube Sampling,⁴⁰ hamming distance and other collision-related hashing and bucket-based strategies. This article considers a new method of stratum creation using the Euclidean distance, which is known as Euclidean distance based stratum (EDS). The EDS algorithm used the maxmin data range heuristics approach for k stratum construction, which produces a high homogeneity within-stratum as compared to the mid-square, division, multiplication, and flooding hashing technique. This method first determines the k centroid of the dataset based on max and min range heuristics of the data points and thereafter assigns each data point to the k stratum through the Euclidean distance between the data points and the k centroid of the dataset. This approach describes the high density and compaction within each stratum. The EDS algorithm is described in Algorithm 1.

Algorithm 1. Proposed Euclidean distance based stratum (EDS)

Input:

1. data = Dataset with N data points.

2. i = Attributes of the dataset.

3. k = Required number of strata.

Output:

1. $L = \{N_{1}, N_{2} \dots N_{k}\}$ of homogeneous strata.

Method

Data set centroid identification

1. $v = (\max_value (i) - \min_value (i)) / (k + 1)$

2. if $v$ is greater than $\max_value (i)$ than

3. dataset has noise data point and exit()

4. else

Stratum centroid identification

5. $c_{1} = \min_value (i) + v$

6. $c_{2} = c_{1} + v$

7. $c_{k} = c_{k - 1} + v$

8. End if

Assign data point into stratum

9. for i = 1 to length of data(i)

{dis}_{euclidean} (data (i), ck) = \sqrt{{| data (i) - ck |}^{2}}

10. Assign each data on the closed k stratum.

11. end for

12. Exit()

Sample allocation techniques

The efficiency of stratification depends upon the sample size taken from i^th stratum. Optimum allocation is one of the sample allocation techniques that decides sample size and variability according to stratum size and variability, with the lowest cost. The optimum allocation obeys the sample allocation characteristics of SS. The Neyman allocation method is a specific case of optimal allocation that minimizes the variance and total cost of the sample allocation. The number of sample units (sample size) to be selected from the h^th stratum is defined in Equation 6.

n_{h} = \frac{n W_{h} σ_{h}}{\sum_{h = 1}^{L} W_{h} σ_{h}^{2}}

()

where n is the total sample size,

σ_{h}

is the standard deviation of h^th stratum,

σ_{h}^{2}

is the variance of h^th stratum based on

N_{h}

units.

σ_{h}^{2} = \frac{1}{N_{h}} \sum_{i = 1}^{N_{h}} {(Y_{hi} - \overline{Y_{h}})}^{2}

()

σ_{h} = sqrt (σ_{h}^{2})

()

After deciding the optimum sample size of the stratum, the next is to data from the stratum through random sampling. The selection of data points from each stratum has equal changes with $1 / n_{h}$ probability. Random sampling is used as the probability distribution and without-replacement procedure. The sample pool has n data units from the h^th stratum and this sample pool is used for mining.

Sample extension

Sample extension method produces final results by merging the results of the sample pool with un-sample data points through the sample extension technique. This study uses the K-NN classifier as a sample extension technique defined in Equation 9. The K-NN classifier is related to Euclidean distance and centroid that assigns un-sample data to the nearest sample pool results through Euclidean distance. According to the K-NN formulation,

A_{i}

is a member of the un-sample data unit and

B_{i}

is the mean of the sample pool result of the i^th variable.¹⁵

{dis}_{euclidean} (A, B) = \sqrt{\sum_{i = 1}^{n} {|A_{i} - B_{i}|}^{2}}

()

Algorithm description

This subsection presents the stratified sampling-based data mining algorithm for K-Means (SSK-Means) clustering based on a single machine execution environment that reduces the computation cost and memory resources without impacting conventional K-Means effectiveness. The proposed model first constructs the stratum through the EDS algorithm and then collects samples from each stratum in the sample pool with the help of Neyman allocation and random sampling. Hereafter, the data mining technique is applied inside the sample pool that produces partial results. The last step merges the final result using a sample extension method that merges the un-sample data into partial results. The final result is validated through data mining measures. The details of the proposed SSK-Means algorithm is described in Algorithm 2 and their flowchart in Figure 2. The flow chart explains the essential concept of the SSK-Means algorithm.

Algorithm 2. Stratified sampling-based K-Means using proposed big data mining model

Input:

1. Data $N = \{N_{1}, N_{2} \dots N_{n}\}$ points on D dimension space.

2. i = Attributes of the dataset.

3. k = Required number of clusters.

Output:

1. $AFc = \{C_{1}, C_{2} \dots C_{n}\}$ Approximate Final clustering result.

Method

1. Call the EDH(N,i,k) algorithm to create an L stratum, where EDH returns the number of L strata equal to k.

2. Determine the number of data objects $n_{h}$ from each stratum using Equation 6.

3. Using random sampling, collect $n_{h}$ data objects from each stratum and assign each $n_{h}$ data object into the Ns sample pool and leftover data objects to the Us un-sample pool.

4. Apply required clustering techniques in the Ns pool such as PFc = KMeans(k,Ns)

5. Used Equation 9 for sample extension to obtain final clustering results through centroids of PFc and Us pool such as $[AFc = {dis}_{euclidean} (Us, Mean (PFc))]$

6. Exit

COMPUTATIONAL ANALYSIS

This section first explains the experimental setup and details of datasets for evaluation and then computes the performance of the SSK-Means algorithm using effectiveness and efficiency-related evaluation criteria.

Experiment environment (tools) and dataset

The experimental environment of the proposed model has used the Jupyter notebook computing environment, Python 3.5.3 programming tool, Intel I3 processor, CPU [email protected] GHz, 320 GB hard disk, 4 GB DDR3 RAM, Windows 7 Operating System with two real datasets from UC Irvine Machine Learning Repository (archive.ics.uci.edu/ml/datasets/seeds). Details of these datasets are described in Table 1. Evaluation of the real dataset used 5%, 10%, 15%, and 20% sample data and after that merged the sample results with 95%, 90%, 85%, and 80% un-sample data, respectively.

TABLE 1. Details of real datasets

Datasets	Objects	Attributes
Skin segmentation	245,057	3
Poker	800,000	10

Note: Optimal values for each investigation are marked in boldface.

Evaluation criteria of clustering validation

Cluster validation assesses the cluster quality using internal and external measurements. An excellent clustering method always has intra-class similarity higher and inter-class similarity smaller. This study uses the silhouette coefficient, Davies Bouldin score, Calinski Harabasz score, execution time, and speed up internal metrics for cluster validation.^{15, 45} The efficient results of silhouette coefficient, Calinski Harabasz score and speed up validation metric are often needed for maximization. Better value for Davies Bouldin's score and execution time validation metric always needs to be minimize.

Silhouette coefficient (SC) is to validate clustering performance through a pairwise difference of cluster distances within (compactness) and between (separation). The SC measures the similarity within the cluster. In the SC formula, a(x) is the average distance of x to all other data points in the same cluster C, b(x) denotes the average distance of x to all other data points in all C_i clusters.

SC = \{\sum_{x \in Ci} \frac{b (x) - a (x)}{\max [b (x), a (x)]}\}

()

Davies Bouldin Score (DB) help to evaluate within-cluster dispersion and between cluster similarity. Moreover, DB evaluates cluster dispersion and similarity without dependence on number of clusters. In the DB formulation, k is the total number of clusters, $|C_{j}|$ defines the total number of data point $x_{i}$ inside of $C_{j}$ cluster and $C_{i}$ is another cluster.

DB = \frac{1}{k} \sum_{i = 1}^{k} \max_{i \neq j} \frac{{within}_{i} + {within}_{j}}{{between}_{ij}}

()

{within}_{j} = \frac{1}{|C_{j}|} \sum_{i = 1}^{|c_{j}|} \sqrt{{|x_{i} - C_{j}|}^{2}}

()

{between}_{ij} = \sqrt{{|C_{i} - C_{j}|}^{2}}

()

Calinski Harabasz Score (CH) determines the average value of the sum of squares between and within clusters. The CH measures cluster variance and is referred to as the variance ratio criterion. In the CH formulation, n is the total number of data points, k for the total number of clusters, x is data points inside the $C_{i}$ cluster, m denotes the mean of entire dataset and $m_{i}$ is the mean of $C_{i}$ cluster.

CH = \frac{(n - k) \sum_{i = 1}^{k} |C_{i}| {(m_{i} - m)}^{T} (m_{i} - m)}{(k - 1) t \sum_{i = 1}^{k} \sum_{x \in C_{i}} {(x - m_{i})}^{T} (x - m_{i})}

()

Execution time (ET) computes the total execution time of any algorithm/model. The execution time is achieved by between the entry EN_T and exit EX_T time of any data mining algorithm.

ET = [{EX}_{T} - {EN}_{T}]

()

Speedup ratio (SR) estimates the ratio of execution time of sampled-based algorithm $T_{SCS}$ and conventional algorithm $T_{TCS}$ .

SR = \frac{T_{TCS}}{T_{SCS}}

()

Used algorithm for big data clustering

This study compares the performance of the proposed stratified random sampling-based K-means algorithm (SSK-Means) against the random sampling-based K-means (RSK-Means)²⁵ and traditional K-means¹⁹ algorithm inside a single machine based on clustering objective.

Results

The reported results of internal measures are shown in Tables 2–6 based on the average of 10 trials. Optimal values for each investigation are marked in boldface. A number of clusters are set at three for experiments. Tables 2–4 presents the comparative examination of SC, DB, and CH values on experimental datasets using RSK-Means and SSK-Means algorithms, respectively. Table 2 illustrates that the proposed SSK-Means algorithm attained improved cluster compaction and separation than RSK-Means algorithm by maximization of SC value. The SC value of the SSK-Means algorithm is achieved better than RSK-Means except for the sample size 10% on the skin dataset and SC value of the SSK-Means algorithm is much optimized compared to RSK-Means in all sample sizes on the poker dataset.

TABLE 2. Comparison of SC values (

means \pm std

) of RSK-Means and SSK-Means

Dataset	Sample size	RSK-Means	SSK-Means
Skin segmentation	5%	$0.51524 \pm 0.0015$	$0.51566 \pm 0.00152$
	10%	$0.51567 \pm 0.0014$	$0.51482 \pm 0.00130$
	15%	$0.51468 \pm 0.00145$	$0.51528 \pm 0.0012$
	20%	$0.51483 \pm 0.00145$	$0.51501 \pm 0.00136$
Poker	5%	$0.11876 \pm 0.00269$	$0.11934 \pm 0.00151$
	10%	$0.11848 \pm 0.00356$	$0.11895 \pm 0.00235$
	15%	$0.11892 \pm 0.00261$	$0.11973 \pm 0.00197$
	20%	$0.11900 \pm 0.00263$	$0.120020 \pm 0.0020$

Note: Optimal values for each investigation are marked in boldface.

TABLE 3. Comparison of DB values (

means \pm std

) of RSK-Means and SSK-Means

Dataset	Sample size	RSK-Means	SSK-Means
Skin segmentation	5%	$0.81746 \pm 0.00969$	$0.81737 \pm 0.00997$
	10%	$0.81703 \pm 0.00649$	$0.81634 \pm 0.00774$
	15%	$0.81684 \pm 0.00771$	$0.81646 \pm 0.00654$
	20%	$0.81956 \pm 0.00449$	$0.81584 \pm 0.00744$
Poker	5%	$2.14451 \pm 0.0418$	$2.13567 \pm 0.01555$
	10%	$2.15167 \pm 0.03592$	$2.14005 \pm 0.01956$
	15%	$2.14843 \pm 0.03371$	$2.13856 \pm 0.02105$
	20%	$2.14216 \pm 0.03183$	$2.12686 \pm 0.01957$

Note: Optimal values for each investigation are marked in boldface.

TABLE 4. Comparison of CH values (

means \pm std

) of RSK-Means and SSK-Means

Dataset	Sample size	RSK-Means	SSK-Means
Skin segmentation	5%	$305228.77 \pm 129.69$	$305245.70 \pm 107.42$
	10%	$305262.29 \pm 71.75$	$305282.88 \pm 87.54$
	15%	$305288.51 \pm 49.36$	$305279.05 \pm 79.75$
	20%	$305253.11 \pm 76.16$	$305273.44 \pm 88.24$
Poker	5%	$112017.24 \pm 3764.91$	$112870.93 \pm 2004.81$
	10%	$111403.47 \pm 3336.27$	$112326.03 \pm 2308.14$
	15%	$111794.62 \pm 3351.86$	$112290.39 \pm 2201.60$
	20%	$112221.03 \pm 2813.21$	$113468.85 \pm 1909.35$

Note: Optimal values for each investigation are marked in boldface.

TABLE 5. Comparison of SC, DB, and CH values (

means \pm std

) of K-Means, RSK-Means, and SSK-Means

Dataset	Evaluation criteria	K-Means	RSK-Means	SSK-Means
Skin segmentation	SC	$0.5123 \pm 0.00291$	$0.51511 \pm 0.00147$	$0.51519 \pm 0.00134$
	DB	$0.81712 \pm 0.00583$	$0.81772 \pm 0.00656$	$0.8165 \pm 0.00645$
	CH	$305269.71 \pm 59.38$	$305258.17 \pm 86.25$	$305270.27 \pm 88.96$
Poker	SC	$0.1167 \pm 0.00402$	$0.11879 \pm 0.00279$	$0.11951 \pm 0.00195$
	DB	$20.19172 \pm 0.06467$	$2.14669 \pm 0.03208$	$2.13528 \pm 0.01895$
	CH	$107753.84 \pm 4693.99$	$111859.093 \pm 3217.61$	$112739.056 \pm 2086.43$

Note: Optimal values for each investigation are marked in boldface.

TABLE 6. Comparison of ET values (

means \pm std

) of RSK-Means and SSK-Means

Dataset	Sample size	RSK-Means	SSK-Means
Skin segmentation	5%	$104.6961 \pm 34.7258$	$72.0375 \pm 18.3354$
	10%	$69.3591 \pm 25.4227$	$87.3679 \pm 17.5049$
	15%	$208.2860 \pm 73.2868$	$70.5661 \pm 15.3212$
	20%	$219.1936 \pm 105.7125$	$139.8222 \pm 39.0222$
Poker	5%	$168.71725 \pm 35.4200$	$162.05747 \pm 25.58619$
	10%	$233.82127 \pm 79.3378$	$211.25498 \pm 37.78515$
	15%	$316.57931 \pm 94.7075$	$235.40716 \pm 54.46616$
	20%	$413.60596 \pm 75.4575$	$397.17432 \pm 113.9999$

Note: Optimal values for each investigation are marked in boldface.

Table 3 reveals that the proposed SSK-Means algorithm obtained a better distance within and between clusters than RSK-Means algorithm through minimization of DB value. The DB value of the skin dataset is close to zero because the dataset uses an exact number of clusters. The DB value of the poker dataset is higher than one due to the reason it requires more clusters. The DB value of SSK-Means is accurate as opposed to RSK-Means in both datasets. Table 4 reflects that the proposed algorithm achieved excellent variance ratio between sum of squares within and between by maximization of CH value. Validation of CH score and SSK-Means algorithm, the skin dataset obtained better CH as compared to RSK-Means excluding sample size 15%, and poker dataset achieved better-optimized CH value as compared to RSK-Means in all sample sizes.

Table 5 presented the SC, DB, and CH examinations of K-Means, RSK-Means, and SSK-Means using average results of the selected sample size, where SSK-Means obtained better SC, DB, and CH results compared to K-Means and RSK-Means on both datasets. This reflection demonstrates that the SSK-Means seems to be achieved efficiencies homogeneity, compaction, separation, similarity and than K-Means, RSK-Means algorithms.

The execution time and speedup ratio validate the algorithm efficiency. Table 6 shows the execution time of the RSK-Means and SSK-Means algorithms on selected sample sizes and their speedup ratio as mentioned in Figure 3 using K-Means execution time. The execution time of SSK-Means is achieved better than the RSK-Means algorithm on each sample size. The speedup of SSK-Means achieved high scalability and speed as compared to the RSK-Means algorithm on each sample size except for the 10% sample size in the skin dataset. The sample size 15% obtains higher scalability and speed of SSK-Means while compared with other sample sizes in both datasets.

The average execution time of the K-Means, RSK-Means, and SSK-Means algorithms are in Figure 4 that indicates the SSK-Means comments less execution time compared to K-Means and RSK-Means. The proposed SS-based algorithm reduces the execution time without affecting the quality of conventional and random sampling-based algorithms. The SSK-Means reduce the average execution time by 58% and 31% with respect to RSK-Means in the skin and poker datasets, respectively.

The SSK-Means algorithm uses smaller iterations during clustering that reason the SSK-Means algorithm is to reduce the execution time, computation cost, resource consumption, and improve the convergence speed of K-Means. The reported results of Table 5 and the average execution time of Figure 4 indicate that the SSK-Means have achieved better effectiveness and efficiency against the RSK-Means algorithm and comparable to the K-Means. This indicates that the proposed algorithm is straightaway scalable and robust for big data clustering.

CONCLUSION

This study has suggested two strategies to overcome the shortcomings of the sample-based clustering algorithm in the terms of computation time, resource utilization, and quality. The first is known as the EDS method that is used for stratum creation by dividing original data into different strata according to a number of cluster requirements. The second approach is a stratified random sampling-based data mining model for big data (abbr. SS-based data mining algorithm as SSK-Means). The SSK-Means algorithm has been compared to random sampling-based K-Means (abbr. RSK-Means) using SC, DB, CH, ET, and SR validation. The 5%, 10%, 15%, and 20% sample size data objects have been chosen by EDS, sample allocation and stratified sampling. The K-Means partial clustering results of 5%, 10%, 15%, 20% selected sample sizes data have been obtained and merged into 95%, 90%, 85%, 80% un-sample data by sample extension for final results. The SSK-Means obtained a high speedup ratio in a sample size of 15% as compared to other sample sizes and RSK-Means. Experimental studies on real data sets show that the proposed algorithm has superior clustering performance than RSK-Means and conventional K-Means algorithms in terms of efficiency and effectiveness. The SSK-Means achieved better computing time, cluster quality, and efficiency in 15% and 20% sample sizes. Sampling strategies significantly reduce the clustering time on large-scale data, but they may decrease the cluster quality due to small sample size selection. The sample selection, sample allocation, clustering algorithm, and sample extension are based on the stratification process. Therefore, better stratification determines the effectiveness and efficiency of the clustering algorithm. Further scope of the study is to overcome stratification process, sample size selection, sample allocation concerns for excellence cluster effectiveness by validating other internal and external measurements on single or multiple machine-based environments.

CONFLICT OF INTEREST

The authors declare that they have no conflict of interest.

Biographies

Kamlesh Kumar Pandey is pursuing a Ph.D. from Dr. Hari Singh Gour Vishwavidyalaya (A Central University), Sagar, India, under the supervision of Prof. Diwakar Shukla. Currently, He is doing research on the design of big data mining algorithms with respect to clustering. He is the author and co-author of several research articles in International journals and conference such as IEEE, Springer, and others. He has 8 years of teaching and research experience. He awarded training of young scientist in 34th and 35th M.P. Young Scientist Congress.
Diwakar Shukla is presently working as HOD in the Department of Computer science and applications, Dr. Hari Singh Gour Vishwavidyalaya, Sagar, India, and has over 25 years' experience of teaching research experience. He obtained M.Sc.(stat.), Ph.D.(stat.) degrees from Banaras Hindu University, Varanasi and served the Devi Ahilya University, Indore, M.P. as a permanent Lecturer from 1989 for 9 years and obtained the degree of M.Tech.(Computer Science) from there. He joined Dr. Hari Singh Gour Vishwavidyalaya, Sagar as a reader in statistics in the year 1998. During Ph.D. from BHU, he was junior and senior research fellow of CSIR, New Delhi through Fellowship Examination (NET) of 1983. Till now, he has published more than 75 research articles in national and international journals and participated in more than 35 seminars/conferences at the national level. He also worked as a Professor at the Lucknow University, Lucknow, U.P., for one (from June 2007 to 2008) year and visited abroad in Sydney (Australia) and Shanghai (China) for conference participation and paper presentation. He has supervised 14 Ph.D. theses in Statistics and Computer Science and seven students are presently enrolled for their doctoral degree under his supervision. He is the author of two books. He is a member of 11 learned bodies of Statistics and Computer Science at the national level. The area of research, he works for are Sampling Theory, Graph Theory, Stochastic Modeling, Data mining, Big Data, Operation Research, Computer Network, and Operating Systems.

REFERENCES

1 Marr B. How much data do we create every day? the mind-blowing stats everyone should read. In: B Marr ed. Forbes Magazine; Forbes; 2018. https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#4c5b6f5360ba
Google Scholar
2Wasastjerna MC. The role of big data and digital privacy in merger review. Eur Compet J. 2018; 14(2–3): 417-444. doi:10.1080/17441056.2018.1533364
10.1080/17441056.2018.1533364
Google Scholar
3Gandomi A, Haider M. Beyond the hype: big data concepts methods and analytics. Int J Inf Manage. 2015; 35(2): 137-144. doi:10.1016/j.ijinfomgt.2014.10.007
10.1016/j.ijinfomgt.2014.10.007
Web of Science® Google Scholar
4Lee I. Big data: dimensions, evolution, impacts and challenges. Bus Horiz. 2017; 60(3): 293-303. doi:10.1016/j.bushor.2017.01.004
10.1016/j.bushor.2017.01.004
Web of Science® Google Scholar
5Hariri RH, Fredericks EM, Bowers KM. Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data. 2019; 6: 44. doi:10.1186/s40537-019-0206-3
10.1186/s40537-019-0206-3
Google Scholar
6Tabesh P, Mousavidin E, Hasani S. Implementing big data strategies: a managerial perspective. Bus Horiz. 2019; 62(3): 347-358. doi:10.1016/j.bushor.2019.02.001
10.1016/j.bushor.2019.02.001
Web of Science® Google Scholar
7Khondoker MR. Big data clustering. Wiley StatsRef: Statistics Reference Online. John Wiley & Sons, Ltd; 2018: 1-10. doi:10.1002/9781118445112.stat07978
10.1002/9781118445112.stat07978
Google Scholar
8Bendechache M, Tari AK, Kechadi MT. Parallel and distributed clustering framework for big spatial data mining. Int J Parallel Emergent Distrib Syst. 2019; 34(6): 671-689. doi:10.1080/17445760.2018.1446210
10.1080/17445760.2018.1446210
Web of Science® Google Scholar
9Kacfah Emani C, Cullot N, Nicolle C. Understandable big data: a survey. Comput Sci Rev. 2015; 17: 70-81. doi:10.1016/j.cosrev.2015.05.002
10.1016/j.cosrev.2015.05.002
Web of Science® Google Scholar
10Reinartz T. Similarity driven sampling for data mining. Proceedings of the European Symposium on Principles of Data Mining and Knowledge Discovery; 1998:423-431; Springer, New York, NY.
Google Scholar
11Chen B, Haas P, Scheuermann P. A new two-phase sampling based algorithm for discovering association rules. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002:462-468; ACM Digital Library. doi:10.1145/775107.775114
Google Scholar
12Shu H. Big data analytics: six techniques. Geo-Spatial Inf Sci. 2016; 19(2): 119-128. doi:10.1080/10095020.2016.1182307
10.1080/10095020.2016.1182307
Google Scholar
13Ramos Rojas JA, Beth Kery M, Rosenthal S, Dey A. Sampling techniques to improve big data exploration. Proceedings of the 2017 IEEE 7th Symposium on Large Data Analysis and Visualization (LDAV); 2017:26-35; IEEE. doi:10.1109/LDAV.2017.8231848
Google Scholar
14Franke B, Plante J, Roscher R, et al. Statistical inference, learning and models in big data. Int Stat Rev. 2016; 84(3): 371-389. doi:10.1111/insr.12176
10.1111/insr.12176
Web of Science® Google Scholar
15Zhao X, Liang J, Dang C. A stratified sampling based clustering algorithm for large-scale data. Knowl Based Syst. 2019; 163: 416-428. doi:10.1016/j.knosys.2018.09.007
10.1016/j.knosys.2018.09.007
Web of Science® Google Scholar
16Pandove D, Goel S, Rani R. Systematic review of clustering high-dimensional and large datasets. ACM Trans Knowl Discov Data. 2018; 12(2): 1-68. doi:10.1145/3132088
10.1145/3132088
Web of Science® Google Scholar
17Chen W, Oliverio J, Kim JH, Shen J. The modeling and simulation of data clustering algorithms in data mining with big data. J Ind Integr Manag. 2019; 04(01):1850017. doi:10.1142/S2424862218500173
10.1142/S2424862218500173
Google Scholar
18Xu D, Tian Y. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015; 2(2): 165-193. doi:10.1007/s40745-015-0040-1
10.1007/s40745-015-0040-1
Google Scholar
19Jain AK. Data clustering: 50 years beyond k-means. Pattern Recognit Lett. 2010; 31(8): 651-666. doi:10.1016/j.patrec.2009.09.011
10.1016/j.patrec.2009.09.011
Web of Science® Google Scholar
20HajKacem MA, N'Cir CE, Essoussi N. Overview of scalable partitional methods for big data clustering. In: O Nasraoui, CE Ben N′ Cir, eds. Clustering Methods for Big Data Analytics, Unsupervised and Semi-Supervised Learning. Springer Nature; 2019: 1-23. https://doi.org/10.1007/978-3-319-97864-2-1
10.1007/978-3-319-97864-2_1
Google Scholar
21Chen M, Ludwig SA, Li K. Clustering in big data. In: K-C Li, H Jiang, AY Zomaya, eds. Big Data Management and Processing. Chapman & Hall/CRC Press; 2017: 333-346. doi:10.1201/9781315154008
10.1201/9781315154008-16
Google Scholar
22Kollios G, Gunopulos D, Koudas N, Berchtold S. Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Trans Knowl Data Eng. 2003; 15(5): 1170-1187. doi:10.1109/TKDE.2003.1232271
10.1109/TKDE.2003.1232271
Web of Science® Google Scholar
23Ben-david S. A framework for statistical clustering with constant time approximation algorithms for K -median and K -means. Mach Learn. 2007; 66: 243-257. doi:10.1007/s10994-006-0587-3
10.1007/s10994-006-0587-3
Web of Science® Google Scholar
24 Aggarwal A, Deshpande A, Kannan R. Adaptive sampling for k-means clustering. In: I Dinur, K Jansen, J Naor, J Rolim, eds. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Berlin, Heidelberg: Springer: Vol 5687; 2009: 15-28. doi:10.1007/978-3-642-03685-9_2
10.1007/978-3-642-03685-9_2
Google Scholar
25Bejarano J, Bose K, Brannan T, Thomas A. Sampling within k-means algorithm to cluster large datasets. Technical report HPCF-2011-12; 2011:1-11.
Google Scholar
26Cui X, Zhu P, Yang X, Li K, Ji C. Optimized big data K-means clustering using MapReduce. J Supercomput. 2014; 70(3): 1249-1259. doi:10.1007/s11227-014-1225-7
10.1007/s11227-014-1225-7
Web of Science® Google Scholar
27Xu Z, Wu Z, Cao J, Xuan H. Scaling information-theoretic text clustering: a sampling-based approximate method. Proceedings of the 2014 2nd International Conference on Advanced Cloud and Big Data, CBD 2014; 2015:18-25. doi:10.1109/CBD.2014.56
Google Scholar
28Zhan Q. Improved spectral clustering based on Nyström method. Multimed Tools Appl. 2017; 76: 20149-20165. doi:10.1007/s11042-017-4566-4
10.1007/s11042-017-4566-4
Web of Science® Google Scholar
29Ros F, Guillaume S. DIDES: a fast and effective sampling for clustering algorithm. Knowl Inf Syst. 2017; 50(2): 543-568. doi:10.1007/s10115-016-0946-8
10.1007/s10115-016-0946-8
Web of Science® Google Scholar
30Zhao J, Sun J, Zhai Y, Ding Y, Wu C, Hu M. A novel clustering-based sampling approach for minimum sample set in big data environment. Int J Pattern Recognit Artif Intell. 2018; 32(2): 1-20. doi:10.1142/S0218001418500039
10.1142/S0218001418500039
Web of Science® Google Scholar
31Aloise D, Contardo C. A sampling-based exact algorithm for the solution of the minimax diameter clustering problem. J Glob Optim. 2018; 71: 613-630. doi:10.1007/s10898-018-0634-1
10.1007/s10898-018-0634-1
Web of Science® Google Scholar
32 Ben HMA, Ben NCE, Essoussi N. STiMR k -means: an efficient clustering method for big data. Int J Pattern Recognit Artif Intell. 2019; 33(8). doi:10.1142/S0218001419500137
Web of Science® Google Scholar
33 Li H, Liu R, Wang J, Wu Q. An enhanced and efficient clustering algorithm for large data using MapReduce. IAENG Int J Comput Sci. 2019; 46(1).
Google Scholar
34Luchi D, Loureiros Rodrigues A, Miguel VF. Sampling approaches for applying DBSCAN to large datasets. Pattern Recognit Lett. 2019; 117: 90-96. doi:10.1016/j.patrec.2018.12.010
10.1016/j.patrec.2018.12.010
Web of Science® Google Scholar
35Wang L, Bezdek JC, Leckie C, Kotagiri R. Selective sampling for approximate clustering of very large data sets. Int J Intell Syst. 2008; 23(3): 313-331. doi:10.1002/int.20268
10.1002/int.20268
Web of Science® Google Scholar
36Umarani V, Punithavalli M. Analysis of the progressive sampling-based approach using real life datasets. Open Comput Sci. 2011; 1(2): 221-242. doi:10.2478/s13537-011-0016-y
10.2478/s13537-011-0016-y
Google Scholar
37Kathiresan G, Mohanta K, Asari K. Analyzing continuous data streams using improved stratified sampling and ensemble classification. Int J Intell Eng Syst. 2018; 11(5): 215-225. doi:10.22266/ijies2018.1031.20
Google Scholar
38Haas PJ. Data-Stream Sampling: Basic Techniques and Results. Springer-Verlag; 2016. doi:10.1007/978-3-540-28608-0_2
Google Scholar
39 Al-Kateb M, Lee BS. Stratified reservoir sampling over heterogeneous data streams. In: M Gertz, B Ludäscher, eds. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). LNCS. Berlin, Heidelberg: Springer: Vol 6187; 2010: 621-639. doi:10.1007/978-3-642-13818-8_42
10.1007/978-3-642-13818-8_42
Google Scholar
40Shields MD, Teferra K, Hapij A, Daddazio RP. Refined stratified sampling for efficient Monte Carlo based uncertainty quantification. Reliab Eng Syst Saf. 2015; 142: 310-325. doi:10.1016/j.ress.2015.05.023
10.1016/j.ress.2015.05.023
Web of Science® Google Scholar
41Xiao Y, Yu J. Partitive clustering ( k -means family). Wiley Interdiscip Rev Data min Knowl Discov. 2012; 2(3): 209-225. doi:10.1002/widm.1049
10.1002/widm.1049
Web of Science® Google Scholar
42Thompson SK. Sampling. 3rd ed. John Wiley & Sons Inc; 2012.
10.1002/9781118162934
Google Scholar
43Rice JA. Mathematical Statistics and Metastatistical Analysis. 3rd ed. Thomson Higher Education; 2007.
Google Scholar
44Liu T, Wang F, Agrawal G. Stratified sampling for data mining on the deep web. Front Comput Sci China. 2012; 6(2): 179-196. doi:10.1007/s11704-012-2859-3
10.1007/s11704-012-2859-3
Google Scholar
45 CC Aggarwal, CK Reddy. Data Custering Algorithms and Applications. CRC Press; 2014. http://www.taylorandfrancis.com.
Google Scholar

Citing Literature

All articles

Euclidean distance stratified random sampling based clustering model for big data mining

Abstract

INTRODUCTION

PROPOSED WORK

Objective function

Proposed model for big data mining

Stratified sampling (SS)

Stratification

Algorithm 1. Proposed Euclidean distance based stratum (EDS)

Sample allocation techniques

Sample extension

Algorithm description

Algorithm 2. Stratified sampling-based K-Means using proposed big data mining model

COMPUTATIONAL ANALYSIS

Experiment environment (tools) and dataset

Evaluation criteria of clustering validation

Used algorithm for big data clustering

Results

CONCLUSION

CONFLICT OF INTEREST

Biographies

REFERENCES

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Euclidean distance stratified random sampling based clustering model for big data mining

Abstract

INTRODUCTION

PROPOSED WORK

Objective function

Proposed model for big data mining

Stratified sampling (SS)

Stratification

Algorithm 1. Proposed Euclidean distance based stratum (EDS)

Sample allocation techniques

Sample extension

Algorithm description

Algorithm 2. Stratified sampling-based K-Means using proposed big data mining model

COMPUTATIONAL ANALYSIS

Experiment environment (tools) and dataset

Evaluation criteria of clustering validation

Used algorithm for big data clustering

Results

CONCLUSION

CONFLICT OF INTEREST

Biographies

REFERENCES

Citing Literature

Figures

References

Related

Information