A One-Class Classification-Based Control Chart Using the K-Means Data Description Algorithm
Abstract
This paper aims to enlarge the family of one-class classification-based control charts, referred to as OC-charts, and extend their applications. We propose a new OC-chart using the K-means data description (KMDD) algorithm, referred to as KM-chart. The proposed KM-chart gives the minimum closed spherical boundary around the in-control process data. It measures the distance between the center of KMDD-based sphere and the new incoming sample to be monitored. Any sample having a distance greater than the radius of KMDD-based sphere is considered as an out-of-control sample. Phase I and II analysis of KM-chart was evaluated through a real industrial application. In a comparative study based on the average run length (ARL) criterion, KM-chart was compared with the kernel-distance based control chart, referred to as K-chart, and the k-nearest neighbor data description-based control chart, referred to as KNN-chart. Results revealed that, in terms of ARL, KM-chart performed better than KNN-chart in detecting small shifts in mean vector. Furthermore, the paper provides the MATLAB code for KM-chart, developed by the authors.
1. Introduction
In recent years, several attempts have been proposed to integrate data mining with statistical process control (SPC) [1–6]. The objective was to overcome the limitations of traditional parametric control charts especially the normality assumption, which may not be applicable in the case of modern manufacturing systems. The most commonly applied data mining technique in SPC is one-class classification. Actually, one-class classification methods have been widely used for process monitoring [7–12]. The principle of one-class consists in constructing a sphere which contains the maximum of data with minimum volume. This sphere distinguishes in-control process data, also known as target data, from out-of-control process data. The shape and volume of the one-class depend on the used one-class classifier, also known as data description algorithm. The one-class approach was applied to develop a new family of control charts called one-class classification-based control charts, referred to as OC-charts.
Several types of one-class classifiers exist in the literature. For instance, only two one-class classifiers were used to develop control charts, named the support vector data description (SVDD) algorithm [13] and the k-nearest neighbor data description (KNNDD) algorithm [14]. Sun and Tsung [15] used SVDD to develop the kernel distance-based multivariate control chart, also known as the K-chart, which is considered as the first OC-chart that uses support vector principles. When monitoring more than two variables, the K-chart uses kernel methods that provide the advantage of dealing with high-dimension data. During the last decade, the K-chart has received significant attention and has witnessed several improvements through many works [16–19].
The SVDD algorithm was also employed to develop other control charts such as the SVDD based multivariate cumulative sum control chart [20], monitor batch process [21], monitor nonlinear processes [22], and perform industrial calibration [23]. Despite its successful application, the high computational cost of SVDD remains its main drawback. Actually, the SVDD algorithm loses its efficiency when the size of the training data becomes large. To overcome this shortcoming, Sukchotrat et al. [24] proposed the use of KNNDD algorithm to develop the KNN-chart. KNNDD is a simple and fast algorithm that performs better with high dimensional data and does not consume much time during the training phase. Gani and Limam [25] compared the performance of the K-chart and the KNN-chart and demonstrated that the K-chart is sensitive to small shifts in mean vector, while the KNN-chart is sensitive to moderate shifts in mean vector.
This paper investigates the use of another one-class classifier which is the K-means data description (KMDD) algorithm to construct a new OC-chart, referred to as the KM-chart. The objective of this work is twofold. First, we aim to enlarge the family of OC-charts and extend their applications by showing the methodology of their construction and providing the necessary software codes. Second, we attempt to propose an OC-chart that can compete with K-chart and KNN-chart in terms of the average run length (ARL) criterion.
The rest of this paper is organized as follows. A review of OC-charts is presented in Section 2. The proposed KM-chart is introduced in Section 3. Construction methodology of the KM-chart using a real data example is shown in Section 4, while performance analysis of the proposed control chart is discussed in Section 5. Section 6 summarizes this paper.
2. Background on OC-Charts
In the literature, there are two common OC-charts which are the K-chart and the KNN-chart. In the following, we give a review of these control charts.
2.1. The K-Chart
Under H0 the process is considered as in-control and under H1 the process is considered as out-of-control, when sample z was taken.
2.2. The KNN-Chart
The KNN-chart uses an unsupervised one-class classifier called KNNDD to construct a one-class by estimating the local density of data. To understand the mechanism of the KNN-chart a brief description of the KNNDD algorithm, based on the work of Sukchotrat et al. [24], is presented below.
The K2 values are used as monitoring statistics.
3. The Proposed KM-Chart
The proposed KM-chart gives the minimum closed spherical boundary around the in-control process data using the KMDD algorithm. The latter is an unsupervised one-class classifier, based on the K-means algorithm which is a very popular clustering method. It measures the distance between the KMDD-based sphere and the new incoming sample to be monitored. The sphere is described by K clusters placed such that the average distance to a cluster center is minimized.
- (i)
Given cluster centers μ1, …, μK, assign each point to the cluster with the closest center.
- (ii)
Given a cluster assignment, update the cluster centers to be the sample mean of the observations in each cluster.
4. A Real Industrial Application
To demonstrate the efficacy of the proposed KM-chart, we applied it to the “Cristal Light” cigarettes data set. Actually, “Cristal Light” cigarettes are Tunisian trademark produced by the Kairouan Tobacco Manufacture. The production process of “Cristal Light” cigarettes comprises 12 sequences of operations which are humidification of tobacco leaves, threshing tobacco leaves, strip processing, hashing, drying, expansion of edges, casing, flavoring, introduction of expanded tobacco, confecting ion of cigarettes, packing and boxing, and conditioning. Details about the “Cristal Light” data set can be found in Hajlaoui [26].
- (1)
The weight of a cigarette, which is the made up of the tobacco, the filter, and the cigarette paper weights. It varies between 0.965 and 1 gram.
- (2)
The module of a cigarette, which corresponds to its diameter; it varies from 6.75 to 8.0 millimeters.
- (3)
The humidity rate of tobacco, which is the proportion of water contained in a cigarette. It is considered acceptable if it varies between 11.5% and 13.5%.
- (4)
Pulling resistance of a cigarette, which is defined by the difference in pressure between the two extremities of a cigarette when a quantity of air is passed through it. The pulling resistance is considered acceptable when it varies from 100 to 115 CE (colonne d’eau).
- (5)
The folding density, which corresponds to the volume occupied by the mass of the tobacco inside a cigarette. It is tolerable to belong to 450 ± 20 cm3.
The “Cristal Light” data set is composed of 65 observations. The first 60 cigarettes are used to construct OC-charts in phase I. Each cigarette took one minute to be collected. The five remainder cigarettes are used for testing out-of-control states in phase II. For the construction of KM-chart, we follow the same methodology of Gani and Limam [25], which consists of three main steps.
Step 1. The data set is analyzed using principal component analysis (PCA) method to obtain independent and identical distributed data, which is a fundamental assumption for one-class classification problem.
Step 2. The principal components (PCs) resulting from Step 1 are used to construct the one-class. In our application, we have three one-class classifiers which are SVDD, KNNDD, and KMDD.
Step 3. The optimal one-class obtained from Step 2 is used to construct OC-charts by computing the charting statistics which are KD for the K-chart, K2 for the KNN-chart, and D2 for the KM-chart.
All calculations were carried out with MATLAB software. For the construction of K-chart and KNN-chart, we used the MATLAB codes of Gani and Limam [25]. For the construction of KM-chart, we used the MATLAB code developed by the authors (see Algorithm 1).
-
Algorithm 1
-
%Let Q be the (n × m) matrix of quality variables where n is the number
-
%of observations and m the number of quality variables in the training phase.
-
%Define the target class.
-
T=target_class(+Q);
-
%Use the KMDD algorithm to fit a sphere around the defined target class above,
-
%where c1 is a fraction error on the target class and c2 is a parameter
-
%defining the number of clusters.
-
w = kmeans_dd(T,c1,c2);
-
%Show the results of KMDD classifier.
-
W=+w;
-
%Phase I of the KM-chart:
-
%Compute the Euclidean distance of each training observation and the UCL.
-
n = size(T,1);
-
D_training = [sqrt(min(sqeucldistm(+T,W.w),[],2)) repmat(W.threshold,n,1)];
-
%Phase II of the KM-chart:
-
%Let now R be the (k x p) matrix of quality variables where k is the number
-
%of observations and p the number of quality variables in the testing phase.
-
%In Phase II we repeat the same computation as in Phase I but here we use
-
%test data and compare it with the UCL to detect out-of-control states.
-
%Compute the Euclidean distance of each test observation.
-
m= size(R,1);
-
D_test = [sqrt(min(sqeucldistm(+R,W.w),[],2)) repmat(W.threshold,m,1)];
-
y1=D_training(:,1); y2=D_training(:,2);x1=(1:n)’;
-
y3=[D_training(:,1); D_test(:,1)];
-
y4 =[D_training(:,2); D_test(:,2)];x2=(1:n+m)’;
-
%Display the KM-chart for Phase I and II.
-
figure;
-
SUBPLOT(2,1,1), plot(x1,y1, ‘-o’, x1,y2, ‘-’); title(‘Phase I of the KM-chart’)
-
SUBPLOT(2,1,2), plot(x2,y3, ‘-o’, x2,y4, ‘-’); title(‘Phase II of the KM-chart’)
After performing PCA, two PCs explaining more than 90% of the variation were retained. Several numbers of clusters were tested for the construction of KMDD-based one-class and to determine the in-control state of the “Cristal Light” process. It is clear from Figure 1 that the number of clusters influences the shape of KMDD-based one-class and plays a pivotal role in determining the trade-off between oversmoothness and undersmoothness of the control boundary. In our application, KMDD-based one-class was constructed with K = 1, since the used sample size was not large.









The detection of an abnormal observation in the target class depends on the shape of the established one-class. The KMDD provided a spherical one-class, while SVDD gave a flexible nonspherical one-class due to the use of SVs. It is worth noticing that the shape of SVDD-based one-class depends on the width of the radial basis function while the shape of KNNDD-based one-class is function of the size of the nearest neighbor, denoted by k. Details about the characteristics of SVDD and KNNDD-based one-classes can be found in Gani and Limam [25].
The KM-chart exceeded its control limit of 6.001 at around the 19th, 25th, 28th, 40th, 48th, and 50th cigarettes, as shown in Figure 2. On the other hand, these out-of-control cigarettes have a distance greater than the radius of the established KMDD-based sphere with K = 1. For these six abnormal cigarettes, at least one of their five quality characteristics did not respect its tolerance interval, as discussed above. In comparison with the two other control charts, the proposed KM-chart succeeded to detect a new out-of-control observation which is cigarette number 48. Both K-chart and KNN-chart failed to detect this abnormal cigarette. Once these out-of-control observations are removed, no additional outliers were detected, and the in-control process was established.



In phase II, five “Cristal Light” cigarettes were used to detect out-of-control states. The KM-chart triggered an alarm at around the 62nd cigarette and remained below its control limit for the last three cigarettes. Cigarette number 62 was also declared by K-chart as an out-of-control observation, while cigarettes number 62 and 65 were declared by KNN-chart as out-of-control observations. Figure 3 shows the discussed OC-charts for phase II.



5. Performance Comparison
A simulation study was conducted to estimate the ARL of OC-charts. In order to be consistent with Gani and Limam [25], we follow their simulation procedure given by the following.
Step 1. Five multivariate normal variables were generated with a mean vector μ = (0.986; 7.650; 0.121; 107.183; 451.527) and a covariance matrix
Step 2. Multivariable shifts, denoted by δ, were introduced in the mean vector according to Table 1. Basically, large values of δ correspond to bigger shifts in the mean. The value δ = 0 is the in-control state.
δ | K-chart | KNN-chart | KM-chart |
---|---|---|---|
0.00 | 200.000 | 200.000 | 200.000 |
0.25 | 100.000 | 200.000 | 192.308 |
0.50 | 66.667 | 200.000 | 185.185 |
0.75 | 66.667 | 100.000 | 166.667 |
1.00 | 50.000 | 40.000 | 147.059 |
1.25 | 50.000 | 14.285 | 117.647 |
1.50 | 40.000 | 9.091 | 86.956 |
1.75 | 28.571 | 3.703 | 60.976 |
2.00 | 22.222 | 1.695 | 39.370 |
2.25 | 13.333 | 1.020 | 25.707 |
2.50 | 6.896 | 1.000 | 16.340 |
2.75 | 4.167 | 1.000 | 9.597 |
3.00 | 2.439 | 1.000 | 5.540 |
3.25 | 1.709 | 1.000 | 3.250 |
3.50 | 1.197 | 1.000 | 1.997 |
3.75 | 1.036 | 1.000 | 1.390 |
4.00 | 1.000 | 1.000 | 1.116 |
5.00 | 1.000 | 1.000 | 1.000 |
For detecting small shifts (δ = 0.25), KM-chart performed better than KNN-chart since it gave an ARL = 192.308, while KNN-chart gave an ARL = 200. For the same shift level, K-chart yielded an ARL = 100, which was better than that of KM-chart.
For detecting moderate shifts (δ = 1), KNN-chart behaved better than the other two control charts since it gave an ARL = 40 against an ARL of 50 and 147.059 of K-chart and KM-chart, respectively.
The difference in sensitivity to shifts in the mean vector between the three OC-charts is due to the difference in the nature of distance used by each control chart. The K-chart uses KD, whereas KM-chart and KNN-chart are based on the Euclidean distance. The advantage of KD in comparison with Euclidean distance lies essentially in the use of the kernel function. The latter is equivalent to the distance between two samples measured in a higher dimensional space. This allows K-chart to easily detect any small shift in the process. In terms of ARL and for small shifts in the mean vector, one can draw the conclusion that our proposed KM-chart is situated between KNN-chart and K-chart (KNN-chart < KM-chart < K-chart). Broadly speaking, each OC-chart has its advantages and disadvantages. For example, K-chart performs better than KM-chart and KNN-chart in quickly detecting changes in the process, while the computational cost of KM-chart and KNN-chart is lower than that of K-chart. Table 2 summarizes the characteristics of each OC-chart.
K-chart | KNN-chart | KM-chart | |
---|---|---|---|
Charting statistic | KD | K2 | D2 |
One-class classifier | SVDD | KNNDD | KMDD |
Parameters | RBF’s width σ | The size of nearest neighbor k | The number of clusters K |
Distance type | Kernel distance | Euclidean distance | Euclidean distance |
Center | Kernel center | KNN center | Cluster center |
Suitable cases | Known and unknown distribution | Known and unknown distribution | Known and unknown distribution |
6. Conclusion
In this paper, we have developed a new OC-chart using KMDD algorithm, called KM-chart. Construction methodology of KM-chart is demonstrated through a real industrial application. Performance analysis of KM-chart in phase I and II showed that our proposed control chart is a competitive SPC tool. In phase I, the proposed KM-chart detected a new abnormal observation which is cigarette number 48. Both K-chart and KNN-chart failed to detect this abnormal cigarette. Based on the ARL criterion, our proposed control chart outperformed KNN-chart in detecting small shifts in the mean vector.
The proposed KM-chart can be extended to monitor nonlinear processes by using the global kernel K-means algorithm instead of using the standard K-means algorithm. The global kernel K-means has the advantage to identify nonlinearly separable clusters and therefore allows KM-chart to monitor sophisticated manufacturing processes.
Conflict of Interests
The authors declare that they do not have a direct financial relation with the software mentioned in this paper and no competing interests.
Acknowledgment
The authors express their appreciation to LARODEC of ISG, University of Tunis, for supporting this paper.
Appendix
The MATLAB Code for KM-Chart
The MATLAB code for the KM-chart requires the PRtools toolbox available at http://www.prtools.org and the dd_tools toolbox available at http://prlab.tudelft.nl/david-tax/dd_tools.html.
For more details see Algorithm 1.