Kernel k-means (k_k-means) extends the standard k-means clustering method to identify generally shaped clusters by employing the k-means algorithm in a higher-dimensional space. Current implementations of k_k-means are rather naive. We present simplifications that reduce calculations, performing only those that are absolutely necessary, therefore improving the overall algorithm's run time. We also present a sampling-based clustering and classification strategy for employing k_k-means clustering for large datasets. Results on 13 data sets illustrate the computational benefits of our new algorithm. Our sampling-based strategy is also applied to investigate the use of -means in the color quantization of images in human perception-based colorspaces.

Abbreviations

AAC: alternating assignment and centering
ARI: adjusted rand index
kk-means: kernel k-means
MATR: MAx-TRace
MKNN: mean k-nearest neighbors
OT: optimal transfer
OTQT: Optimal transfer and quick transfer
QT: quick transfer
SCAC: sampling-based clustering and classification
SDSS: Sloan Digital Sky Survey
SRSWOR: simple random sampling without replacement
VIF: visual information fidelity
WSS: within-sums-of-squares

1 Introduction

Cluster analysis [1-4] partitions observations into homogeneous groups without label information. There are many clustering applications [5-9] and methods [10-16], of which the most commonly-used is the model-agnostic -means algorithm [17-19] that partitions observations into a given number () of groups by locally minimizing the WSS. The preference, by design here, is for homogeneous spherical clusters, even though attempts [20-23] have been made to address this limitation. Similarly, the -means algorithm [24, 25] performs -means-type operations after nonlinear mapping of the data onto a higher-dimensional space, but is slow on larger datasets.

This paper proposes three sets of computational improvements to the -means algorithm. We recognize that current -means implementations adapt [17]'s -means algorithm that, upon starting with initial cluster centers in the transformed space, assigns each transformed observation to the cluster with the closest center, and then updates the centers at the conclusion of the cycle. Our first improvement borrows ideas from [26]'s -means algorithm and simply updates cluster centers with every observation reassignment. Our second improvement forms the crux of our contributions and simplifies and reduces computations by utilizing an “optimal transfer” (OT) stage and a more heuristic “quick transfer” (QT) stage, as also done [27, 28] for -means and -modes algorithms [12]. Indeed, OT steps are proved to improve solutions. Finally, we explore a computationally feasible clustering and classification framework for -means on large datasets.

This paper has four more sections. Section 2 details the methodological improvements mentioned above. Section 3 evaluates performance on some common synthetic and real datasets. The modification for use with large datasets is employed to study the use of -means for image color quantization in human perception-based colorspaces. This article concludes with a discussion. A supplement, with sections, tables, and figures prefixed by “S” is available, as is a R [29] package kkmeans that implements our methods.

2 Methodology

2.1 Background and Preliminaries

Consider a dataset with -variate observations . The usual -means setting partitions into disjoint sets by minimizing the WSS , where . Kernel methods generalize this framework.

Let

be a nonlinear, continuous map into a feature space

of arbitrary dimension. Then, letting

, the objective function for

-means is

()

Let

denote an inner product. Then

So, given

, (1) depends only on inner products computed in

. Then, a kernel function satisfying Mercer's condition [30] can ensure that both

and

exist, without explicitly using

[31]. Also, unlike [20], (1) ignores the transformation

, allowing it to be complex and infinite-dimensional. Let

be a symmetric kernel function satisfying Mercer's condition such that

. The norm in the objective function (1) can be written as

()

Among popular kernels [3] are the Gaussian kernel with bandwidth parameter

that is given by

the Laplacian kernel with parameter

given by

the polynomial kernel with offset

and degree

that has the form

and the sigmoid kernel with parameters

and

that is

These kernels have user-specified parameters that are estimated, as discussed in Section 2.2.3.3. This paper uses the Gaussian kernel, but our methods apply to other kernels.

Current implementations of -means simply mimic [17]'s -means algorithm that naively assigns each observation to its closest current centroid, updating the centroids only at the end of all the assignments. We suggest improvements to this approach in the remainder of this section.

2.2 Efficient -Means Algorithms

2.2.1 Alternating Assignment and Clustering (AAC)

Our first improvement updates the cluster centers with every reassignment of an observation to a centroid. Indeed, [26]'s -means algorithm fits the AAC framework, and we shall see that this seemingly simple fix improves both accuracy and running time of the -means algorithm.

2.2.2 Optimal Transfer and Quick Transfer (OTQT)

Our second improvement classifies iterations into optimal transfer (OT) and quick transfer (QT) steps, similar to [27, 28] or [20]'s suggestions for different

-means scenarios, or [12]'s

-modes algorithm for clustering categorical datasets. Our OTQT algorithm has the steps:

For each observation , find and store its closest and second closest cluster centers, and , and assign to .
Initially, all clusters are in the live set .
Optimal transfer. Consider the cluster . Then only if was updated in the last QT stage or if it was updated in any of the last OT steps.
1. Let . For , find
  ()
  
  If no reallocation is necessary and . Otherwise, transfer to and set . Clusters and are now in the live set .
2. For , proceed as in Step 3a but with in (3) minimized only over .
Stop if is empty. Otherwise, continue to Step 5.
Quick transfer. Let and . If then skip to . Otherwise, transfer to and set if

If no changes occurred in the last QT steps, go to Step 3 otherwise go to Step 5.

Theoretical benefits: We now show that a

-means algorithm based only on the OT steps can yield performance gains over the naive method. We state and prove.

Claim 1.Let be the center of in feature space after initialization. The -means algorithm using only OT steps with distinct observations and distinct initial centers has the following properties:

Each pass through the dataset never increases .
The resulting partition set has no empty clusters.
Each pass through the dataset keeps the centers distinct.

Proof. We prove each statement in that order.

If there is no that can reduce the objective function by transferring to a different cluster, the algorithm will terminate. Additionally, an observation will only be transferred to a different cluster if the objective function decreases, proving Part 1.
Suppose at the beginning of the OT stage. Then reassigning any to will reduce , and so will pull in at least one of the s. Alternatively, suppose a cluster is made empty during the OT stage. Then, the algorithm will iterate due to an observation being transferred, and a point will be assigned to the empty cluster. So, Part 2 holds.
Suppose upon entering the OT stage, . Then, since each point is distinct, there exists a point such that . WLOG let . Then, since and , the reallocation condition in the OT algorithm is met and will be transferred to .□

The OT steps can be implemented in parallel to further speed up computations. Further, we see the QT steps to substantially improve experimental performance, but as with -means [32] or -modes [12], these gains have not been possible to prove theoretically.

2.2.3 Additional Implementation Aspects

We address a few issues that arise while implementing our algorithms.

2.2.3.1 Initialization

As with -means [33], initialization of -means algorithms impacts clustering performance. A simple approach is to use SRSWOR to choose random observations in feature space. This method was not outperformed by our -means++ approach (Section S1.1) that extends [34]'s -means++ algorithm. Our initializer is stochastic, so we run our algorithm from many initializers and choose the result with the lowest terminating of (1).

2.2.3.2 Storing Distances

A large part of the OTQT algorithm's efficiency is from storing the distance of each

, and is achieved by updating the impacted

s with every reassignment of observations. While straightforward with

-means, it is more challenging in the

-means context because the values of

s are not available since

is unknown. However, (2) suggests that the distance to a cluster center can still be updated without knowing the exact value. To see this, note that the last term,

, in (2) is constant for each cluster. So, these

-many

values can be stored and updated as observations are added and removed from each cluster. To update the value of

, let us assume that

. Let

. Then

Similarly, letting

for

So, if an observation

is transferred from

, the values can be updated as

For the Gaussian kernel, is constant and can be ignored in calculations. Additionally, and can be stored for each . If has been transferred or if either or have changed since the last time was visited, these values must be recomputed. This recomputation requires computations of and can be expensive, so if computer memory permits, it is better to store the Gram matrix for the dataset and thus reduce the number of computations. Such an approach can also be adopted for our AAC algorithm, since the centers are updated whenever an observation moves to a new cluster.

2.2.3.3 Parameter Selection for the Gaussian Kernel

The Gaussian kernel needs a choice for the bandwidth parameter . Figure 1 illustrates the effect of on -means clustering of the Bullseye dataset [21, 22]. We see that smaller s produce fragmented groups in Euclidean space, while larger values produce more compact clusters. We now discuss some methods for choosing . Luxburg [23] suggested estimating using the mean squared distance of each observation to its -nearest neighbors (MKNN). Ordozgoiti [35] provided a hierarchical search method that requires storing the full Gram matrix, and is impractical for larger datasets. The MAx-TRace (MATR) algorithm [36] performs a grid search on the possible values of . The algorithm needs a similarity matrix of the data: we choose by using , with preliminarily estimated by MKNN, with . Section S1.2 shows our MATR implementation to outperform MKNN. The method also extends to non-Gaussian kernels, and so we use it hence.

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

Effect of on -means clustering of *Bullseye*.

2.3 The -Means Algorithm for Large Datasets

Even with our improvements, our -means algorithms can still be slow for large datasets. This is because beyond the in-place calculations of the Gram matrix needed for large , the kernel function has to be computed for each observation in each cluster, even before any feature space calculations. We propose a sampling-based clustering-and-classification (SCAC) strategy for large datasets. Specifically, we randomly split the original data into a set used for clustering, and the remainder used for classification. We cluster using -means to obtain a partition of clusters . Then the observations in are assigned to their closest cluster in according to (2). For illustration, Figure 2 displays the use of our SCAC strategy for the bullseye dataset. aset of 200 observations. While our -means algorithms can be easily applied to this dataset (and is done in Section 3), our illustration using only 20% of the dataset for clustering and the remaining 80% for classification performs quite well, and almost at par with the clustering of the entire dataset (as seen a little later in Section 3).

The proportion () of included in can impact the variability of the clustering performance. When is large, is necessarily small for computational practicality, and there is far more variability in the clustering. To alleviate the effects of this variation, we propose running the partitioning algorithm on several () different versions of , and choosing the result which minimizes (1) on the entire dataset after classifying , which can be done in parallel. We study these suggestions in Section 3.4.

3 Performance Evaluations

We evaluated our improvements to -means in terms of six synthetic 2D datasets from [37], shown in Figure 3, the subset of the Sloan Digital Sky Survey (SDSS) dataset in [38], and six real-world classification datasets from the UCI Machine Learning Repository [39], with details in Table 1.

TABLE 1. Results of

-means partitions obtained after minimizing (1) over

initializations for each method and dataset.

Data set				Naive	AAC	OTQT
				Adjusted Rand Index
spherical	7	500	2	1.000	1.000	1.000
banana-clump	2	200	2	0.980	0.980	0.980
bullseye	2	400	2	0.710	0.710	0.710
pathbased	3	300	2	0.939	0.939	0.939
bananas-arcs	4	4515	2	0.966	0.966	0.966
SSS	3	5015	2	0.685	0.685	0.685
SDSS	2	1507	5	0.286	0.286	0.286
anuran	10	7195	22	0.407	0.408	0.408
glass	7	214	9	0.268	0.206	0.217
libras	15	360	90	0.302	0.314	0.322
wine	3	178	13	0.416	0.416	0.416
segmentation	7	2310	19	0.436	0.437	0.436
iris	3	150	4	0.744	0.744	0.744

Our evaluations were on the naive and our AAC and OTQT approaches, with all methods efficiently coded in C with R [29] interfaces in our kkmeans package. Numerical clustering performance was assessed by the adjusted Rand Index (ARI) [40] that compared the grouping obtained by each algorithm with the true partition. The ARI takes values no greater than unity, with values closer to unity indicating good clustering performance.

3.1 Performance of AAC and OTQT Algorithms

3.1.1 Clustering Performance

The clustering performance of all three methods is summarized, numerically for all datasets in Table 1, and also visually for the 2D datasets in Figure 3. For each of the datasets from [37] and the SDSS data, all three methods arrived at the same final partition after initializations. The naive, AAC, and OTQT methods each performed the best on one UCI dataset each, and yielded the same result on two datasets. On the remaining dataset anuran, the AAC and OTQT methods yielded the same solution that is marginally better than that of the naive method.

3.1.2 Timing Results

Figure 4 displays the change in time taken by AAC and OTQT, relative to the naive method, to arrive at the terminating solution for the datasets in Table 1, with all three methods started from each of the initializations. Excepting the spherical and glass datasets, the naive method typically takes the longest time to termination. In general, the AAC and OTQT methods take similar amounts of time, with AAC narrowly faster in slightly more cases. Thus, the naive method is fastest for datasets with fewer average observations per cluster, while OTQT is at the other end of the spectrum. Because computational overhead in the algorithms increases from the naive to the AAC to the OTQT, our results suggest using algorithms in that order accordingly as the average number of observations increases. In essence, AAC or OTQT calculations make sense only when their use can be justified, which happens when is larger.

3.2 Comparison of Different Kernel Functions

Table 2 compares results obtained using the four kernels described in Section 2.1. We see that the Gaussian kernel generally performs well across the variety of datasets used in this evaluation. However, in some instances, the Laplace kernel is the best performer, and indeed, on the SDSS dataset, vastly outperforms the others. The polynomial and sigmoid kernels generally underperform relative to the other two kernels, but there are some datasets where the sigmoid kernel performs well: indeed, it is a runaway winner on the SSS dataset.

TABLE 2. Results of the

-means partitions after

initializations for each dataset and kernel function. The parameters for each dataset were chosen manually to yield the highest adjusted Rand index.

Data set	Gaussian	Laplace	Polynomial	Sigmoid
	Adjusted Rand Index
spherical	1.000	0.996	1.000	0.898
banana-clump	1.000	1.000	0.980	0.980
bullseye	0.619	0.080	0.013	−0.002
pathbased	0.930	0.929	0.425	0.401
bananas-arcs	0.951	0.520	0.378	0.463
SSS	0.673	0.741	0.749	0.972
SDSS	0.893	0.985	0.051	0.353
glass	0.268	0.219	0.229	0.021
libras	0.315	0.274	0.289	0.272
wine	0.416	0.416	0.382	0.171
segmentation	0.436	0.502	0.413	0.346
iris	0.758	0.731	0.640	0.530

3.3 Computational Complexity Analysis

We next empirically evaluated the effect of and on the time taken for each algorithm to converge. We systematically generated datasets via R's MixSim [41] package. In all, 400 datasets were generated for each combination of and . These 400 datasets had dimensions evenly split with . Note that does not affect runtime, as the Gram matrix for each dataset is computed and stored before clustering, and therefore adds the same constant time for each method. All datasets were generated with a small amount of overlap among the clusters, with a generalized overlap value [42] equal to . Similar to the comparisons between algorithms in previous sections, within each of the datasets all three clustering algorithms began at the same initial values.

Figure 5 displays the median time to converge for each parameter combination. In general, the AAC and OTQT methods outperform the naive method, with the improvement becoming more pronounced for larger datasets. When there are only two clusters, we see that the added overhead present in the OTQT method means that it takes longer than the AAC method. Since there are only two clusters, the QT step is unnecessary, and thus it is recommended to use the AAC method in these situations. When the number of clusters increases, however, the time efficiency of the OTQT method is realized. In particular, as the number of clusters increases, the OTQT method continues to converge faster than the AAC method, as the QT step is designed to reduce the total number of comparisons required in each pass through the dataset.

3.4 Evaluating the SCAC Strategy for Larger Datasets

We evaluated SCAC on the datasets of Table 1, with -means clustering done using OTQT. We can perform -means clustering on the entire datasets here (as was done in Section 3.1) so here our investigation is determining how SCAC can perform relative to clustering of the entire sample. Section S2 illustrates performance on the bananas-arcs dataset. Figure S1 shows performance of the SCAC strategy on one realization with set at 5% of . We see that the first SCAC run happens to do quite well (ARI = 0.954) but the next five SCAC runs, with ARIs ranging from 0.606 to 0.964, indicate substantial variability in the clustering performance. We therefore evaluate performance of SCAC over different , and . Figure 6 displays the results on each of the datasets for different and . We see that as both and increase, the resulting ARI matches more closely that from the partitioning obtained using the complete dataset, with decreasing variance. This effect depends on the dimension of the dataset, as datasets with higher dimension require a larger proportion of the original dataset to achieve better results. For many low-dimensional datasets we see that even the smallest size of yields similar results to clustering the entire dataset, especially with a large number of repeated samples. While the choice of will likely be dictated by availability of compute resources, we recommend using larger for smaller to achieve better results. We apply SCAC in the next section.

4 Color Quantization of Images

Color quantization is used to represent digital images by means of a few colors without appreciably degrading their visual quality [19]. It is useful for storing and transmitting data. Traditionally, color quantization has been done in the RGB colorspace, where -means has been a method of choice, but some authors [43] have used it in human perception-based colorspaces. Indeed, [44] showed in an expansive study that -means color quantization is better done in these colorspaces in many cases. However, they also expressed some misgivings on the use of -means in these colorspaces given that these spaces have features that are very different from each other. We therefore investigate the use of -means on color quantization of five different publicly available images.

We performed color quantization using both our -means and -means algorithms on RGB, HCL, and CIEXYZ colorspace, with colors, and with each variable being scaled prior to clustering. For -means, the entire set of colors was used, while for -means, we used our SCAC strategy with and .

Figure 7 illustrates performance on Statlab. We display only the results with colors in the CIEXYZ space, which is the space where it performs best using both -means and -means. While both algorithms perform fairly similarly, the 64 colors chosen to describe the image differ and emphasize different features of each image. In particular, the -means quantized images have fewer colors in the background of the signpost, creating a less pixelated look and a clearer distinction with the text compared to the results from -means. Beyond visual displays, we also assessed the quality of the quantized image in faithfully representing the original image in terms of the Visual Information Fidelity (VIF) criterion [45] that takes values in (0,1) with 1 representing perfect fidelity and values farther away from unity representing increased degradation. VIF compares two images using natural scene statistics and shows the perceptual quality difference between them, and in particular, is better suited for images with natural scenes. The VIF for the -means quantized image is slightly higher than that obtained using -means image.

We also explored the use of -means in the different colorspaces in a few more images of different sizes, displayed in Figure 8, with results on the VIF obtained using the two methods in the three colorspaces for displayed in Figure 9 and Table S3. For smaller palettes (), there is not much benefit to leaving the RGB colorspace, but for larger numbers of colors, CIEXYZ seems to be better for quantization. In all cases, HCL is the worst performer. Between -means and -means, the honors are shared, so we now try and see whether we can explain performance in terms of the characteristics of the images.

Figure 10 displays circular-linear boxplots of the distributions of unique colors for our images in terms of hue (H), chroma (C) and luminance (L). Hue being an angular quantity, we use circular boxplots [46, 47] to display its distribution, while linear boxplots are used for the C and L bands. In general, we see that Bandhavgarh is the worst performer for -means. The C-distribution here is highly skewed in the clockwise direction, with high kurtosis but low IQR. The other two values are fairly symmetric, however. The best performer, IceBed, has a relatively symmetric distribution for H and C, with L only slightly skewed toward smaller values. Pujatram and Statlab both perform best for two of the four colors, but overall have relatively equal performance between the methods and indeed, their distributions are only slightly skewed (but in opposite directions). They have similar-looking distributions for the hue, with both of them being skewed toward higher values. Lastly, Ultadanga is best suited for -means in the three higher colors (), has a symmetric distribution for H and L, and is only slightly skewed in C. In summary, it seems that -means performs best when the distributions for each of H, C, and L are symmetric, and poorly with increased circular skewness in H, but is more robust to skewness in the L band. Our initial findings here call for a deeper analysis that is beyond the scope of this paper.

5 Conclusions

This paper delivers computational improvements to the -means algorithm using AAC and OTQT approaches. As with -means [32] and -modes [12] algorithms, we provide theoretical guarantees for the OT version, while the OTQT algorithm's generally good performance is demonstrated through experiments on 2D synthetic and higher-dimensional real-world classification datasets. We also use our approach to study the performance of -means in the color quantization of images in human perception-based colorspaces. A R package implements the naive, AAC and OTQT methods as well as methods for estimating the bandwidth parameter for the Gaussian kernel using both MKNN and MATR.

There are a number of issues that can benefit from further study. For instance, the problem of selecting an optimal number of clusters using -means remains unresolved. Also unresolved are penalization methods for in (1) based on the complexity of the transformation, where the challenge is accentuated by the fact that the Jacobian is unclear since is a potentially infinite-dimensional function of complicated form. Thus we see that while we have made important computational improvements to -means algorithms, there remain questions worthy of further attention.

Author Contributions

Joshua D. Berlinski wrote the R package, developed the methodology, proved the claim and performed the computations. Ranjan Maitra conceptualized the problem and planned the framework and the experiments, and the applications. Both authors shared in the writing of the article.

Acknowledgments

We thank the Editor-in-Chief for comments on an earlier version that improved this article. All original digital images were converted to the Tagged Image File Format (TIFF). Statlab is a 2023 digital photograph of a plaque that stands outside the building housing Iowa State University's Department of Statistics. The plaque commemorates the 1933 establishment there of the first statistical laboratory in the United States of America. IceBed is thanks to Nima Sarikhani's dreamy photograph, named “Ice Bed”, of a young polar bear drifting to sleep on a bed carved into an iceberg, first published on February 7, 2024. Bandhavgarh is courtesy of Staffan Widstrand of the World Wildlife Foundation and shows a Royal Bengal tiger in Bandhavgarh National Park, India. Pujatram is a photograph, released by the transport department of the Government of West Bengal, India, of a specially decorated 2023 tram commemorating both the 150th anniversary of Calcutta Tramways and the then-recent award of UNESCO's intangible cultural heritage tag to Kolkata's (formerly, Calcutta) famed Durga Puja festival. Ultadanga is a photograph in December 2010 of the multimodal traffic chaos at the Ultadanga-VIP Road intersection in Kolkata, India. Statlab and Ultadanga are from the second author's personal photographs collection. The research of the second author was supported in part by the United States Department of Agriculture (USDA) National Institute of Food and Agriculture (NIFA) Hatch project IOW03717. The content of this paper is however solely the responsibility of the authors and does not represent the official views of the USDA. Open access funding provided by the Iowa State University Library.

Conflicts of Interest

The authors declare no conflicts of interest.

Open Research

Data Availability Statement

The data that supports the findings of this study are both publicly available and also from the Supporting Information materials section of this article.

Supporting Information

References

1R. Maitra and V. Melnykov, “Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms,” Journal of Computational and Graphical Statistics 19, no. 2 (2010): 354–376.
10.1198/jcgs.2009.08054
Web of Science® Google Scholar
2P. D. McNicholas, Mixture Model-Based Classification (Chapman and Hall/CRC, 2016).
10.1201/9781315373577
Google Scholar
3R. Xu and D. C. Wunsch, Clustering (John Wiley & Sons, 2009).
Google Scholar
4B. Everitt, S. Landau, M. Leese, and D. Stahl, Cluster Analysis (Wiley, 2011).
10.1002/9780470977811
Google Scholar
5A. Hinneburg and D. Keim, “ Cluster Discovery Methods for Large Databases: From the Past to the Future,” in Proceedings of the ACM SIGMOD International Conference on the Management of Data (Association for Computing Machinery, 1999).
Google Scholar
6D. Comaniciu and P. Meer, “Mean Shift: A Robust Approach Toward Feature Space Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence 24, no. 5 (2002): 603–619.
10.1109/34.1000236
Web of Science® Google Scholar
7R. Filipovych, S. M. Resnick, and C. Davatzikos, “Semisupervised Cluster Analysis of Imaging Data,” NeuroImage 54, no. 3 (2011): 2185–2197, https://doi.org/10.1016/j.neuroimage.2010.09.074.
10.1016/j.neuroimage.2010.09.074
PubMed Google Scholar
8F. Qian, A. Pathak, Y. Hu, Z. M. Mao, and Y. Xie, “A Case for Unsupervised-Learning-Based Spam Filtering,” SIGMETRICS 38 (2010): 367–368.
10.1145/1811099.1811090
Google Scholar
9R. Maitra, “Clustering Massive Datasets With Applications to Software Metrics and Tomography,” Technometrics 43, no. 3 (2001): 336–346.
10.1198/004017001316975925
Web of Science® Google Scholar
10G. McLachlan and D. Peel, Finite Mixture Models (John Wiley and Sons, Inc., 2000).
10.1002/0471721182
Google Scholar
11V. Melnykov and R. Maitra, “Finite Mixture Models and Model-Based Clustering,” Statistics Surveys 4 (2010): 80–116.
10.1214/09-SS053
Google Scholar
12K. S. Dorman and R. Maitra, “An Efficient k-Modes Algorithm for Clustering Categorical Datasets,” Statistical Analysis and Data Mining: The ASA Data Science Journal 15, no. 1 (2022): 83–97.
10.1002/sam.11546
Web of Science® Google Scholar
13S. Johnson, “Hierarchical Clustering Schemes,” Psychometrika 32, no. 3 (1967): 241–254.
10.1007/BF02289588
CAS PubMed Web of Science® Google Scholar
14M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “ A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases With Noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Ser. KDD'96 (AAAI Press, 1996), 226–231.
Google Scholar
15M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: Ordering Points to Identify the Clustering Structure,” SIGMOD Record 28, no. 2 (1999): 49–60, https://doi.org/10.1145/304181.304187.
10.1145/304181.304187
Google Scholar
16R. Maitra and I. P. Ramler, “A k-Mean-Directions Algorithm for Efficient Clustering of Data on a Sphere,” Journal of Computational and Graphical Statistics 19, no. 2 (2010): 377–396.
10.1198/jcgs.2009.08155
Google Scholar
17S. Lloyd, “Least Squares Quantization in PCM,” IEEE Transactions on Information Theory 28, no. 2 (1982): 129–137.
10.1109/TIT.1982.1056489
Web of Science® Google Scholar
18R. Maitra and I. P. Ramler, “Clustering in the Presence of Scatter,” Biometrics 65 (2009): 341–352.
10.1111/j.1541-0420.2008.01064.x
CAS PubMed Web of Science® Google Scholar
19R. Maitra, V. Melnykov, and S. Lahiri, “Bootstrapping for Significance of Compact Clusters in Multi-Dimensional Datasets,” Journal of the American Statistical Association 107, no. 497 (2012): 378–392.
10.1080/01621459.2011.646935
Web of Science® Google Scholar
20N. S. Berry and R. Maitra, “Tik-Means: Transformation-Infused k-Means Clustering for Skewed Groups,” Statistical Analysis and Data Mining: The ASA Data Science Journal 12, no. 3 (2019): 223–233.
10.1002/sam.11416
Web of Science® Google Scholar
21I. A. Almodóvar-Rivera and R. Maitra, “Kernel-Estimated Nonparametric Overlap-Based Syncytial Clustering,” Journal of Machine Learning Research 21, no. 122 (2020): 1–54.
PubMed Google Scholar
22A. D. Peterson, A. P. Ghosh, and R. Maitra, “Merging k-Means With Hierarchical Clustering for Identifying General-Shaped Groups,” Stat 7, no. 1 (2018): e172.
10.1002/sta4.172
PubMed Web of Science® Google Scholar
23U. von Luxburg, “A Tutorial on Spectral Clustering,” Statistics and Computing 17, no. 4 (December 2007): 395–416.
10.1007/s11222-007-9033-z
Web of Science® Google Scholar
24M. Girolami, “Mercer Kernel Based Clustering in Feature Space,” IEEE Transactions on Neural Networks 13, no. 3 (2002): 780–784.
10.1109/TNN.2002.1000150
CAS PubMed Web of Science® Google Scholar
25I. Dhillon, Y. Guan, and B. Kulis, “A Unified View of Kernel k-Means, Spectral Clustering and Graph Cuts,” 2004 University of Texas at Austin, Technical Reports TR-04-25.
Google Scholar
26J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” Proceedings of the Fifth Berkeley Symposium 1 (1967): 281–297.
Google Scholar
27J. A. Hartigan and M. A. Wong, “A k-Means Clustering Algorithm,” Applied Statistics 28 (1979): 100–108.
10.2307/2346830
Google Scholar
28A. Lithio and R. Maitra, “An Efficient k-Means-Type Algorithm for Clustering Datasets With Incomplete Records,” Statistical Analysis and Data Mining: The ASA Data Science Journal 11, no. 6 (2018): 296–311.
10.1002/sam.11392
Web of Science® Google Scholar
29 R Development Core Team, R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2018), http://www.R-project.org.
Google Scholar
30J. Mercer, “Functions of Positive and Negative Type, and Their Connection With the Theory of Integral Equations,” Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 209 (1909): 415–446.
10.1098/rsta.1909.0016
Web of Science® Google Scholar
31N. Aronszajn, “Theory of Reproducing Kernels,” Transactions of the American Mathematical Society 68, no. 3 (1950): 337–404.
10.1090/S0002-9947-1950-0051437-7
Web of Science® Google Scholar
32M. Telgarsky and A. Vattani, “ Hartigan's Method: K-Means Clustering Without Voronoi,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Ser. Proceedings of Machine Learning Research, 13–15 May 2010, vol. 9, ed. Y. W. Teh and M. Titterington (PMLR, 2010), 820–827.
Google Scholar
33R. Maitra, “Initializing Partition-Optimization Algorithms,” IEEE/ACM Transactions on Computational Biology and Bioinformatics 6 (2009): 144–157.
10.1109/TCBB.2007.70244
PubMed Web of Science® Google Scholar
34D. Arthur and S. Vassilvitskii, “ K-Means++: The Advantages of Careful Seeding,” in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics (Association for Computing Machinery, 2007), 1027–1035.
Web of Science® Google Scholar
35B. Ordozgoiti and L. A. B. Muñoz, “ Off-The-Grid: Fast and Effective Hyperparameter Search for Kernel Clustering,” in Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part II (Springer-Verlag, 2020), 399–415, https://doi.org/10.1007/978-3-030-67661-2_24.
Google Scholar
36X. Fan, Y. Yue, P. Sarkar, and Y. X. R. Wang, “A Unified Framework For Tuning Hyperparameters in Clustering Problems,” 2020.
Google Scholar
37I. Almodóvar-Rivera and R. Maitra, “RKNOBSynC: Kernel-Estimated Nonparametric Overlap-Based Syncytial Clustering,” 2019, R Package, http://github.com/ialmodovar/RKNOBSynC.
Google Scholar
38K. Wagstaff, “ Clustering With Missing Values: No Imputation Required,” in Classification, Clustering, and Data Mining Applications (Springer, 2004), 649–658.
10.1007/978-3-642-17103-1_61
Google Scholar
39D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz, “UCI Repository of Machine Learning Databases,” 1998.
Google Scholar
40L. Hubert and P. Arabie, “Comparing Partitions,” Journal of Classification 2 (1985): 193–218.
10.1007/BF01908075
Web of Science® Google Scholar
41V. Melnykov, W.-C. Chen, and R. Maitra, “MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms,” Journal of Statistical Software 51, no. 12 (2012): 1–25.
10.18637/jss.v051.i12
PubMed Web of Science® Google Scholar
42R. Maitra, “A Re-Defined and Generalized Percent-Overlap-of Activation Measure for Studies of fMRI Reproducibility and Its Use in Identifying Outlier Activation Maps,” NeuroImage 50, no. 1 (2010): 124–135.
10.1016/j.neuroimage.2009.11.070
PubMed Web of Science® Google Scholar
43K.-J. Yoon and I.-S. Kweon, “ Human Perception Based Color Image Quantization,” in Proceedings of the 17th International Conference on Pattern Recognition, vol. 1 (ICPR, 2004), 664–667.
Google Scholar
44R. Maitra and J. P. Rodriguez-Ramirez, “Optimized k-Means Color Quantization of Images Using Human Perception-Based Color Spaces,” 2025.
Google Scholar
45H. Sheikh and A. Bovik, “Image Information and Visual Quality,” IEEE Transactions on Image Processing 15, no. 2 (2006): 430–444.
10.1109/TIP.2005.859378
PubMed Web of Science® Google Scholar
46D. Buttarazzi, G. Pandolfo, and G. C. Porzio, “A boxplot for circular data,” Biometrics 74, no. 4 (2018): 1492–1501, https://doi.org/10.1111/biom.12889.
10.1111/biom.12889
PubMed Web of Science® Google Scholar
47J. D. Berlinski, F. Dai, and R. Maitra, “ Boxplots and Quartile Plots for Grouped and Periodic Angular Data,” submitted manuscript.
Google Scholar

Volume18, Issue4

August 2025

e70032

Filename	Description
sam70032-sup-0001-Supinfo01.pdfPDF document, 1 MB	Data S1 Supporting Information.
sam70032-sup-0002-Supinfo02.gzGzip archive, 17.1 MB	Data S2 Supporting Information.

Computational Improvements to the Kernel -Means Clustering Algorithm