Volume 18, Issue 4 e70032
RAPID PUBLICATION
Open Access

Computational Improvements to the Kernel -Means Clustering Algorithm

Joshua D. Berlinski

Joshua D. Berlinski

Department of Statistics, Iowa State University, Ames, Iowa, USA

Search for more papers by this author
Ranjan Maitra

Corresponding Author

Ranjan Maitra

Department of Statistics, Iowa State University, Ames, Iowa, USA

Correspondence: Ranjan Maitra ([email protected])

Search for more papers by this author
First published: 09 July 2025

Funding: This work was supported by National Institute of Food and Agriculture.

ABSTRACT

Kernel k-means (kk-means) extends the standard k-means clustering method to identify generally shaped clusters by employing the k-means algorithm in a higher-dimensional space. Current implementations of kk-means are rather naive. We present simplifications that reduce calculations, performing only those that are absolutely necessary, therefore improving the overall algorithm's run time. We also present a sampling-based clustering and classification strategy for employing kk-means clustering for large datasets. Results on 13 data sets illustrate the computational benefits of our new algorithm. Our sampling-based strategy is also applied to investigate the use of -means in the color quantization of images in human perception-based colorspaces.

Abbreviations

  • AAC
  • alternating assignment and centering
  • ARI
  • adjusted rand index
  • kk-means
  • kernel k-means
  • MATR
  • MAx-TRace
  • MKNN
  • mean k-nearest neighbors
  • OT
  • optimal transfer
  • OTQT
  • Optimal transfer and quick transfer
  • QT
  • quick transfer
  • SCAC
  • sampling-based clustering and classification
  • SDSS
  • Sloan Digital Sky Survey
  • SRSWOR
  • simple random sampling without replacement
  • VIF
  • visual information fidelity
  • WSS
  • within-sums-of-squares
  • 1 Introduction

    Cluster analysis [1-4] partitions observations into homogeneous groups without label information. There are many clustering applications [5-9] and methods [10-16], of which the most commonly-used is the model-agnostic -means algorithm [17-19] that partitions observations into a given number () of groups by locally minimizing the WSS. The preference, by design here, is for homogeneous spherical clusters, even though attempts [20-23] have been made to address this limitation. Similarly, the -means algorithm [24, 25] performs -means-type operations after nonlinear mapping of the data onto a higher-dimensional space, but is slow on larger datasets.

    This paper proposes three sets of computational improvements to the -means algorithm. We recognize that current -means implementations adapt [17]'s -means algorithm that, upon starting with initial cluster centers in the transformed space, assigns each transformed observation to the cluster with the closest center, and then updates the centers at the conclusion of the cycle. Our first improvement borrows ideas from [26]'s -means algorithm and simply updates cluster centers with every observation reassignment. Our second improvement forms the crux of our contributions and simplifies and reduces computations by utilizing an “optimal transfer” (OT) stage and a more heuristic “quick transfer” (QT) stage, as also done [27, 28] for -means and -modes algorithms [12]. Indeed, OT steps are proved to improve solutions. Finally, we explore a computationally feasible clustering and classification framework for -means on large datasets.

    This paper has four more sections. Section 2 details the methodological improvements mentioned above. Section 3 evaluates performance on some common synthetic and real datasets. The modification for use with large datasets is employed to study the use of -means for image color quantization in human perception-based colorspaces. This article concludes with a discussion. A supplement, with sections, tables, and figures prefixed by “S” is available, as is a R [29] package kkmeans that implements our methods.

    2 Methodology

    2.1 Background and Preliminaries

    Consider a dataset with -variate observations . The usual -means setting partitions into disjoint sets by minimizing the WSS , where . Kernel methods generalize this framework.

    Let be a nonlinear, continuous map into a feature space of arbitrary dimension. Then, letting , the objective function for -means is
    ()
    Let denote an inner product. Then
    So, given , (1) depends only on inner products computed in . Then, a kernel function satisfying Mercer's condition [30] can ensure that both and exist, without explicitly using [31]. Also, unlike [20], (1) ignores the transformation , allowing it to be complex and infinite-dimensional. Let be a symmetric kernel function satisfying Mercer's condition such that . The norm in the objective function (1) can be written as
    ()
    Among popular kernels [3] are the Gaussian kernel with bandwidth parameter that is given by
    the Laplacian kernel with parameter given by
    the polynomial kernel with offset and degree that has the form
    and the sigmoid kernel with parameters and that is

    These kernels have user-specified parameters that are estimated, as discussed in Section 2.2.3.3. This paper uses the Gaussian kernel, but our methods apply to other kernels.

    Current implementations of -means simply mimic [17]'s -means algorithm that naively assigns each observation to its closest current centroid, updating the centroids only at the end of all the assignments. We suggest improvements to this approach in the remainder of this section.

    2.2 Efficient -Means Algorithms

    2.2.1 Alternating Assignment and Clustering (AAC)

    Our first improvement updates the cluster centers with every reassignment of an observation to a centroid. Indeed, [26]'s -means algorithm fits the AAC framework, and we shall see that this seemingly simple fix improves both accuracy and running time of the -means algorithm.

    2.2.2 Optimal Transfer and Quick Transfer (OTQT)

    Our second improvement classifies iterations into optimal transfer (OT) and quick transfer (QT) steps, similar to [27, 28] or [20]'s suggestions for different -means scenarios, or [12]'s -modes algorithm for clustering categorical datasets. Our OTQT algorithm has the steps:
    1. For each observation , find and store its closest and second closest cluster centers, and , and assign to .
    2. Initially, all clusters are in the live set .
    3. Optimal transfer. Consider the cluster . Then only if was updated in the last QT stage or if it was updated in any of the last OT steps.

      1. Let . For , find

        ()

        If no reallocation is necessary and . Otherwise, transfer to and set . Clusters and are now in the live set .

      2. For , proceed as in Step 3a but with in (3) minimized only over .

    4. Stop if is empty. Otherwise, continue to Step 5.
    5. Quick transfer. Let and . If then skip to . Otherwise, transfer to and set if
    1. If no changes occurred in the last QT steps, go to Step 3 otherwise go to Step 5.
    Theoretical benefits: We now show that a -means algorithm based only on the OT steps can yield performance gains over the naive method. We state and prove.

    Claim 1.Let be the center of in feature space after initialization. The -means algorithm using only OT steps with distinct observations and distinct initial centers has the following properties:

    1. Each pass through the dataset never increases .
    2. The resulting partition set has no empty clusters.
    3. Each pass through the dataset keeps the centers distinct.
    Proof. We prove each statement in that order.
    1. If there is no that can reduce the objective function by transferring to a different cluster, the algorithm will terminate. Additionally, an observation will only be transferred to a different cluster if the objective function decreases, proving Part 1.
    2. Suppose at the beginning of the OT stage. Then reassigning any to will reduce , and so will pull in at least one of the s. Alternatively, suppose a cluster is made empty during the OT stage. Then, the algorithm will iterate due to an observation being transferred, and a point will be assigned to the empty cluster. So, Part 2 holds.
    3. Suppose upon entering the OT stage, . Then, since each point is distinct, there exists a point such that . WLOG let . Then, since and , the reallocation condition in the OT algorithm is met and will be transferred to .□

    The OT steps can be implemented in parallel to further speed up computations. Further, we see the QT steps to substantially improve experimental performance, but as with -means [32] or -modes [12], these gains have not been possible to prove theoretically.

    2.2.3 Additional Implementation Aspects

    We address a few issues that arise while implementing our algorithms.

    2.2.3.1 Initialization

    As with -means [33], initialization of -means algorithms impacts clustering performance. A simple approach is to use SRSWOR to choose random observations in feature space. This method was not outperformed by our -means++ approach (Section S1.1) that extends [34]'s -means++ algorithm. Our initializer is stochastic, so we run our algorithm from many initializers and choose the result with the lowest terminating of (1).

    2.2.3.2 Storing Distances

    A large part of the OTQT algorithm's efficiency is from storing the distance of each to , and is achieved by updating the impacted s with every reassignment of observations. While straightforward with -means, it is more challenging in the -means context because the values of s are not available since is unknown. However, (2) suggests that the distance to a cluster center can still be updated without knowing the exact value. To see this, note that the last term, , in (2) is constant for each cluster. So, these -many values can be stored and updated as observations are added and removed from each cluster. To update the value of , let us assume that . Let . Then
    Similarly, letting for ,
    So, if an observation is transferred from to , the values can be updated as

    For the Gaussian kernel, is constant and can be ignored in calculations. Additionally, and can be stored for each . If has been transferred or if either or have changed since the last time was visited, these values must be recomputed. This recomputation requires computations of and can be expensive, so if computer memory permits, it is better to store the Gram matrix for the dataset and thus reduce the number of computations. Such an approach can also be adopted for our AAC algorithm, since the centers are updated whenever an observation moves to a new cluster.

    2.2.3.3 Parameter Selection for the Gaussian Kernel

    The Gaussian kernel needs a choice for the bandwidth parameter . Figure 1 illustrates the effect of on -means clustering of the Bullseye dataset [21, 22]. We see that smaller s produce fragmented groups in Euclidean space, while larger values produce more compact clusters. We now discuss some methods for choosing . Luxburg [23] suggested estimating using the mean squared distance of each observation to its -nearest neighbors (MKNN). Ordozgoiti [35] provided a hierarchical search method that requires storing the full Gram matrix, and is impractical for larger datasets. The MAx-TRace (MATR) algorithm [36] performs a grid search on the possible values of . The algorithm needs a similarity matrix of the data: we choose by using , with preliminarily estimated by MKNN, with . Section S1.2 shows our MATR implementation to outperform MKNN. The method also extends to non-Gaussian kernels, and so we use it hence.

    Details are in the caption following the image
    Effect of on -means clustering of Bullseye.

    2.3 The -Means Algorithm for Large Datasets

    Even with our improvements, our -means algorithms can still be slow for large datasets. This is because beyond the in-place calculations of the Gram matrix needed for large , the kernel function has to be computed for each observation in each cluster, even before any feature space calculations. We propose a sampling-based clustering-and-classification (SCAC) strategy for large datasets. Specifically, we randomly split the original data into a set used for clustering, and the remainder used for classification. We cluster using -means to obtain a partition of clusters . Then the observations in are assigned to their closest cluster in according to (2). For illustration, Figure 2 displays the use of our SCAC strategy for the bullseye dataset. aset of 200 observations. While our -means algorithms can be easily applied to this dataset (and is done in Section 3), our illustration using only 20% of the dataset for clustering and the remaining 80% for classification performs quite well, and almost at par with the clustering of the entire dataset (as seen a little later in Section 3).

    Details are in the caption following the image
    (Left) -means clustering of (20% of Bullseye data ). (Right) Using the clustering results to classify , and thence .

    The proportion () of included in can impact the variability of the clustering performance. When is large, is necessarily small for computational practicality, and there is far more variability in the clustering. To alleviate the effects of this variation, we propose running the partitioning algorithm on several () different versions of , and choosing the result which minimizes (1) on the entire dataset after classifying , which can be done in parallel. We study these suggestions in Section 3.4.

    3 Performance Evaluations

    We evaluated our improvements to -means in terms of six synthetic 2D datasets from [37], shown in Figure 3, the subset of the Sloan Digital Sky Survey (SDSS) dataset in [38], and six real-world classification datasets from the UCI Machine Learning Repository [39], with details in Table 1.

    Details are in the caption following the image
    Partitions of our 2D datasets obtained using the -means OTQT algorithm. Numerals represent the true group for each observation, while color denotes the -means group.
    TABLE 1. Results of -means partitions obtained after minimizing (1) over initializations for each method and dataset.
    Adjusted Rand Index
    Data set Naive AAC OTQT
    spherical 7 500 2 1.000 1.000 1.000
    banana-clump 2 200 2 0.980 0.980 0.980
    bullseye 2 400 2 0.710 0.710 0.710
    pathbased 3 300 2 0.939 0.939 0.939
    bananas-arcs 4 4515 2 0.966 0.966 0.966
    SSS 3 5015 2 0.685 0.685 0.685
    SDSS 2 1507 5 0.286 0.286 0.286
    anuran 10 7195 22 0.407 0.408 0.408
    glass 7 214 9 0.268 0.206 0.217
    libras 15 360 90 0.302 0.314 0.322
    wine 3 178 13 0.416 0.416 0.416
    segmentation 7 2310 19 0.436 0.437 0.436
    iris 3 150 4 0.744 0.744 0.744

    Our evaluations were on the naive and our AAC and OTQT approaches, with all methods efficiently coded in C with R [29] interfaces in our kkmeans package. Numerical clustering performance was assessed by the adjusted Rand Index (ARI) [40] that compared the grouping obtained by each algorithm with the true partition. The ARI takes values no greater than unity, with values closer to unity indicating good clustering performance.

    3.1 Performance of AAC and OTQT Algorithms

    3.1.1 Clustering Performance

    The clustering performance of all three methods is summarized, numerically for all datasets in Table 1, and also visually for the 2D datasets in Figure 3. For each of the datasets from [37] and the SDSS data, all three methods arrived at the same final partition after initializations. The naive, AAC, and OTQT methods each performed the best on one UCI dataset each, and yielded the same result on two datasets. On the remaining dataset anuran, the AAC and OTQT methods yielded the same solution that is marginally better than that of the naive method.

    3.1.2 Timing Results

    Figure 4 displays the change in time taken by AAC and OTQT, relative to the naive method, to arrive at the terminating solution for the datasets in Table 1, with all three methods started from each of the initializations. Excepting the spherical and glass datasets, the naive method typically takes the longest time to termination. In general, the AAC and OTQT methods take similar amounts of time, with AAC narrowly faster in slightly more cases. Thus, the naive method is fastest for datasets with fewer average observations per cluster, while OTQT is at the other end of the spectrum. Because computational overhead in the algorithms increases from the naive to the AAC to the OTQT, our results suggest using algorithms in that order accordingly as the average number of observations increases. In essence, AAC or OTQT calculations make sense only when their use can be justified, which happens when is larger.

    Details are in the caption following the image
    Increase in time taken by AAC and OTQT, relative to the naive method, to arrive at the terminating solution, with each method started at each of initial seeds for each dataset.

    3.2 Comparison of Different Kernel Functions

    Table 2 compares results obtained using the four kernels described in Section 2.1. We see that the Gaussian kernel generally performs well across the variety of datasets used in this evaluation. However, in some instances, the Laplace kernel is the best performer, and indeed, on the SDSS dataset, vastly outperforms the others. The polynomial and sigmoid kernels generally underperform relative to the other two kernels, but there are some datasets where the sigmoid kernel performs well: indeed, it is a runaway winner on the SSS dataset.

    TABLE 2. Results of the -means partitions after initializations for each dataset and kernel function. The parameters for each dataset were chosen manually to yield the highest adjusted Rand index.
    Adjusted Rand Index
    Data set Gaussian Laplace Polynomial Sigmoid
    spherical 1.000 0.996 1.000 0.898
    banana-clump 1.000 1.000 0.980 0.980
    bullseye 0.619 0.080 0.013 −0.002
    pathbased 0.930 0.929 0.425 0.401
    bananas-arcs 0.951 0.520 0.378 0.463
    SSS 0.673 0.741 0.749 0.972
    SDSS 0.893 0.985 0.051 0.353
    glass 0.268 0.219 0.229 0.021
    libras 0.315 0.274 0.289 0.272
    wine 0.416 0.416 0.382 0.171
    segmentation 0.436 0.502 0.413 0.346
    iris 0.758 0.731 0.640 0.530

    3.3 Computational Complexity Analysis

    We next empirically evaluated the effect of and on the time taken for each algorithm to converge. We systematically generated datasets via R's MixSim [41] package. In all, 400 datasets were generated for each combination of and . These 400 datasets had dimensions evenly split with . Note that does not affect runtime, as the Gram matrix for each dataset is computed and stored before clustering, and therefore adds the same constant time for each method. All datasets were generated with a small amount of overlap among the clusters, with a generalized overlap value [42] equal to . Similar to the comparisons between algorithms in previous sections, within each of the datasets all three clustering algorithms began at the same initial values.

    Figure 5 displays the median time to converge for each parameter combination. In general, the AAC and OTQT methods outperform the naive method, with the improvement becoming more pronounced for larger datasets. When there are only two clusters, we see that the added overhead present in the OTQT method means that it takes longer than the AAC method. Since there are only two clusters, the QT step is unnecessary, and thus it is recommended to use the AAC method in these situations. When the number of clusters increases, however, the time efficiency of the OTQT method is realized. In particular, as the number of clusters increases, the OTQT method continues to converge faster than the AAC method, as the QT step is designed to reduce the total number of comparisons required in each pass through the dataset.

    Details are in the caption following the image
    Median time to converge for all simulated datasets under each parameter combination. Error bars indicate two standard deviations calculated using 1000 bootstrap samples.

    3.4 Evaluating the SCAC Strategy for Larger Datasets

    We evaluated SCAC on the datasets of Table 1, with -means clustering done using OTQT. We can perform -means clustering on the entire datasets here (as was done in Section 3.1) so here our investigation is determining how SCAC can perform relative to clustering of the entire sample. Section S2 illustrates performance on the bananas-arcs dataset. Figure S1 shows performance of the SCAC strategy on one realization with set at 5% of . We see that the first SCAC run happens to do quite well (ARI = 0.954) but the next five SCAC runs, with ARIs ranging from 0.606 to 0.964, indicate substantial variability in the clustering performance. We therefore evaluate performance of SCAC over different , and . Figure 6 displays the results on each of the datasets for different and . We see that as both and increase, the resulting ARI matches more closely that from the partitioning obtained using the complete dataset, with decreasing variance. This effect depends on the dimension of the dataset, as datasets with higher dimension require a larger proportion of the original dataset to achieve better results. For many low-dimensional datasets we see that even the smallest size of yields similar results to clustering the entire dataset, especially with a large number of repeated samples. While the choice of will likely be dictated by availability of compute resources, we recommend using larger for smaller to achieve better results. We apply SCAC in the next section.

    Details are in the caption following the image
    Results for various combinations of and used for clustering. The dashed blue line corresponds to the ARI obtained by -means clustering of the entire dataset.

    4 Color Quantization of Images

    Color quantization is used to represent digital images by means of a few colors without appreciably degrading their visual quality [19]. It is useful for storing and transmitting data. Traditionally, color quantization has been done in the RGB colorspace, where -means has been a method of choice, but some authors [43] have used it in human perception-based colorspaces. Indeed, [44] showed in an expansive study that -means color quantization is better done in these colorspaces in many cases. However, they also expressed some misgivings on the use of -means in these colorspaces given that these spaces have features that are very different from each other. We therefore investigate the use of -means on color quantization of five different publicly available images.

    We performed color quantization using both our -means and -means algorithms on RGB, HCL, and CIEXYZ colorspace, with colors, and with each variable being scaled prior to clustering. For -means, the entire set of colors was used, while for -means, we used our SCAC strategy with and .

    Figure 7 illustrates performance on Statlab. We display only the results with colors in the CIEXYZ space, which is the space where it performs best using both -means and -means. While both algorithms perform fairly similarly, the 64 colors chosen to describe the image differ and emphasize different features of each image. In particular, the -means quantized images have fewer colors in the background of the signpost, creating a less pixelated look and a clearer distinction with the text compared to the results from -means. Beyond visual displays, we also assessed the quality of the quantized image in faithfully representing the original image in terms of the Visual Information Fidelity (VIF) criterion [45] that takes values in (0,1) with 1 representing perfect fidelity and values farther away from unity representing increased degradation. VIF compares two images using natural scene statistics and shows the perceptual quality difference between them, and in particular, is better suited for images with natural scenes. The VIF for the -means quantized image is slightly higher than that obtained using -means image.

    Details are in the caption following the image
    The (a) original, (b) -means, and (c) -means color quantized Statlab images. The displays of quantized images use colors in the (best-performing) CIEXYZ color space.

    We also explored the use of -means in the different colorspaces in a few more images of different sizes, displayed in Figure 8, with results on the VIF obtained using the two methods in the three colorspaces for displayed in Figure 9 and Table S3. For smaller palettes (), there is not much benefit to leaving the RGB colorspace, but for larger numbers of colors, CIEXYZ seems to be better for quantization. In all cases, HCL is the worst performer. Between -means and -means, the honors are shared, so we now try and see whether we can explain performance in terms of the characteristics of the images.

    Details are in the caption following the image
    The digital images used in our color quantization assessment. They are (top left) IceBed of 1200 800 pixels, ( bottom left) Bandhavgarh of 1848 1981 pixels, (top right) Pujatram of 1600 1066 pixels, and (bottom right) Ultadanga, of 1600 1200 pixels. All image displays are to scale.
    Details are in the caption following the image
    VIFs of images quantized using -means and -means in the three colorspaces for .

    Figure 10 displays circular-linear boxplots of the distributions of unique colors for our images in terms of hue (H), chroma (C) and luminance (L). Hue being an angular quantity, we use circular boxplots [46, 47] to display its distribution, while linear boxplots are used for the C and L bands. In general, we see that Bandhavgarh is the worst performer for -means. The C-distribution here is highly skewed in the clockwise direction, with high kurtosis but low IQR. The other two values are fairly symmetric, however. The best performer, IceBed, has a relatively symmetric distribution for H and C, with L only slightly skewed toward smaller values. Pujatram and Statlab both perform best for two of the four colors, but overall have relatively equal performance between the methods and indeed, their distributions are only slightly skewed (but in opposite directions). They have similar-looking distributions for the hue, with both of them being skewed toward higher values. Lastly, Ultadanga is best suited for -means in the three higher colors (), has a symmetric distribution for H and L, and is only slightly skewed in C. In summary, it seems that -means performs best when the distributions for each of H, C, and L are symmetric, and poorly with increased circular skewness in H, but is more robust to skewness in the L band. Our initial findings here call for a deeper analysis that is beyond the scope of this paper.

    Details are in the caption following the image
    Distribution of our image colors in HCL space.

    5 Conclusions

    This paper delivers computational improvements to the -means algorithm using AAC and OTQT approaches. As with -means [32] and -modes [12] algorithms, we provide theoretical guarantees for the OT version, while the OTQT algorithm's generally good performance is demonstrated through experiments on 2D synthetic and higher-dimensional real-world classification datasets. We also use our approach to study the performance of -means in the color quantization of images in human perception-based colorspaces. A R package implements the naive, AAC and OTQT methods as well as methods for estimating the bandwidth parameter for the Gaussian kernel using both MKNN and MATR.

    There are a number of issues that can benefit from further study. For instance, the problem of selecting an optimal number of clusters using -means remains unresolved. Also unresolved are penalization methods for in (1) based on the complexity of the transformation, where the challenge is accentuated by the fact that the Jacobian is unclear since is a potentially infinite-dimensional function of complicated form. Thus we see that while we have made important computational improvements to -means algorithms, there remain questions worthy of further attention.

    Author Contributions

    Joshua D. Berlinski wrote the R package, developed the methodology, proved the claim and performed the computations. Ranjan Maitra conceptualized the problem and planned the framework and the experiments, and the applications. Both authors shared in the writing of the article.

    Acknowledgments

    We thank the Editor-in-Chief for comments on an earlier version that improved this article. All original digital images were converted to the Tagged Image File Format (TIFF). Statlab is a 2023 digital photograph of a plaque that stands outside the building housing Iowa State University's Department of Statistics. The plaque commemorates the 1933 establishment there of the first statistical laboratory in the United States of America. IceBed is thanks to Nima Sarikhani's dreamy photograph, named “Ice Bed”, of a young polar bear drifting to sleep on a bed carved into an iceberg, first published on February 7, 2024. Bandhavgarh is courtesy of Staffan Widstrand of the World Wildlife Foundation and shows a Royal Bengal tiger in Bandhavgarh National Park, India. Pujatram is a photograph, released by the transport department of the Government of West Bengal, India, of a specially decorated 2023 tram commemorating both the 150th anniversary of Calcutta Tramways and the then-recent award of UNESCO's intangible cultural heritage tag to Kolkata's (formerly, Calcutta) famed Durga Puja festival. Ultadanga is a photograph in December 2010 of the multimodal traffic chaos at the Ultadanga-VIP Road intersection in Kolkata, India. Statlab and Ultadanga are from the second author's personal photographs collection. The research of the second author was supported in part by the United States Department of Agriculture (USDA) National Institute of Food and Agriculture (NIFA) Hatch project IOW03717. The content of this paper is however solely the responsibility of the authors and does not represent the official views of the USDA. Open access funding provided by the Iowa State University Library.

      Conflicts of Interest

      The authors declare no conflicts of interest.

      Data Availability Statement

      The data that supports the findings of this study are both publicly available and also from the Supporting Information materials section of this article.

        The full text of this article hosted at iucr.org is unavailable due to technical difficulties.