Volume 2018, Issue 1 9204854

Research Article

Open Access

Hand Detection Using Cascade of Softmax Classifiers

Yan-Guo Zhao

Shenzhen Key Laboratory of Virtual Reality and Human Interaction Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China cas.cn

Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen 518055, China ucas.ac.cn

Search for more papers by this author

Feng Zheng,

Feng Zheng

Swanson School of Engineering, The University of Pittsburgh, Pittsburgh, PA 15261, USA pitt.edu

Search for more papers by this author

Zhan Song,

Corresponding Author

Zhan Song

[email protected]

orcid.org/0000-0003-3585-6522

Shenzhen Key Laboratory of Virtual Reality and Human Interaction Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China cas.cn

Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen 518055, China ucas.ac.cn

CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China cas.cn

Search for more papers by this author

Yan-Guo Zhao,

Yan-Guo Zhao

Shenzhen Key Laboratory of Virtual Reality and Human Interaction Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China cas.cn

Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen 518055, China ucas.ac.cn

Search for more papers by this author

Feng Zheng,

Feng Zheng

Swanson School of Engineering, The University of Pittsburgh, Pittsburgh, PA 15261, USA pitt.edu

Search for more papers by this author

Zhan Song,

Corresponding Author

Zhan Song

[email protected]

orcid.org/0000-0003-3585-6522

Shenzhen Key Laboratory of Virtual Reality and Human Interaction Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China cas.cn

Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen 518055, China ucas.ac.cn

CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China cas.cn

Search for more papers by this author

First published: 10 July 2018

https://doi.org/10.1155/2018/9204854

Citations: 2

Academic Editor: Andreas Uhl

Share a link

Email
Wechat
Bluesky

Abstract

Sliding-window based multiclass hand posture detections are often performed by detecting postures of each predefined category using an independent detector, which makes it lack efficiency and results in high postures confusion rates in real-time applications. To tackle such problems, in this work, an efficient cascade detector that integrates multiple softmax-based binary (SftB) models and a softmax-based multiclass (SftM) model is investigated to perform multiclass posture detection in parallel. The SftB models are used to distinguish the predefined postures from the background regions, and the SftM model is applied to discriminate among all the predefined hand posture categories. Another usage of the cascade structure is that it could effectively decompose the complexity of background pattern space and therefore improve the detection accuracy. In addition, to balance the detection accuracy and efficiency, the HOG features of increasing resolutions will be adopted by classifiers of increasing stage-levels in the cascade structure. The experiments are implemented under various scenarios with complicated background and challenging lightings. Results show the superiority of the proposed SftB classifiers over the traditional binary classifiers such as logistic regression, as well as the accuracy and efficiency improvements brought by the softmax-based cascade architecture compared with the noncascade multiclass softmax detectors.

1. Introduction

Hand detection refers to determining the hands location and their shapes. It works as a prerequisite step for various hand gesture recognition systems [1, 2] that have been widely studied, due to their potential application in entertainment and virtual reality [3], medical systems, and assistive technologies, as well as in crisis management and disaster relief [4]. However, hand detection is never an easy task due to the hand deformation [5], the sensitivity of skin colors to lighting conditions [6], and the complicated environments for practical applications. As a result, robust and efficient hand detection remains a challenging task in computer vision community.

Multiclass hand posture detection is worthy of investigation for several reasons: different users may be habituated to using different postures for interaction, many application systems require multiple postures to realize different functions, and robust detection of human hand from multiple viewpoints can be achieved through multiclass hand detection by letting different posture categories represent postures captured under different viewpoints. One way to deal with multiclass hand detection is first to locate the human hand and then to determine the hand shape by classification. Such methods are usually of low accuracy. For example, to locate human hand using skin color cues can be easily affected by the lighting condition and the skin-like background, which will lead to high miss and false rates and will degrade the follow-up classification accuracy/speed in detection. Another example is to train binary classifier for sliding-window-based hand localization, in which all predefined postures are treated as a positive class and the background is regarded as negative class. In this method, the difference in posture shapes increases the pattern complexity of positive space and resultantly leads to low excluding rate for background. Another way for multiclass hand detection is to build independent detector for each predefined posture and perform multiposture detection by sequentially detecting each of the predefined postures with the corresponding posture detectors [7, 8]. The disadvantage of such practice contains several aspects: (a) the computing cost is high, because multiple rounds of detections are required to find the postures of multiple categories; (b) a window image may be predicted into multiple posture categories, which would result in heavy overlapping detection results; and (c) the multiple detectors are trained independently rather than jointly and in collaboration, which causes confusion detection between different postures easily.

To improve the performance of multiclass hand posture detection system, here in this work, we provide a softmax-based cascade detector that integrates several SftB classifiers at early stages and a SftM classifier at the last stage. Advantages of this proposed method include the following: (a) the softmax-based structure makes it possible to perform multiclass posture detection in parallel; (b) the cascade structure helps decompose the complexity of background pattern space and therefore improve the detection accuracy; (c) the pass-rate of postures and the false rates of background can be adjusted easily by using the binary SftB classifiers (adapted from softmax models) in the first few stages; (d) the SftB-based binary classification is actually made based upon the multiple decision surfaces implied by the softmax model and has a stronger background excluding ability than the binary classifiers trained with examples of all defined posture categories as a single positive class; and (e) with cascaded softmax scheme, the prediction probability across multiple stages can be merged to make final decisions, which helps to reduce the confusion rates between posture categories. Moreover, stage-classifiers of increasing stage-levels will take the HOG features of increasing resolutions to balance the detection accuracy and efficiency. To sum up, the major contribution of this work can be concluded as follows:

(1)
A softmax-based cascade architecture is proposed to perform multiclass hand postures detection in parallel and meanwhile to decompose the complexity of background pattern space to improve the detection accuracy.
(2)
The SftB classifier is proposed to better distinguish the predefined postures from the background regions, since it could decompose the complexity of multiclass posture pattern space by the multiclass decision boundaries that are learned jointly.
(3)
The cascade is designed to take low-resolution HOG features at the lower stages and to use HOG features of higher resolutions for stage-classifiers of higher levels, which helps to balance between the detection accuracy and efficiency.

The remainder of this paper is organized as follows. Section 2 briefly reviews the existing work on vision-based human hand detection problem. The proposed softmax-based cascade architecture is described in Section 3 in detail. Experimental results and discussions are provided in Section 4. Conclusions and future work are offered in Section 5.

2. Related Work

The vision-based hand detection methods can generally be separated into two groups: the appearance-based methods and the 3D-model-based methods [2, 7]. The appearance methods carry out the detection by directly comparing the image features with prebuilt appearance models. These methods are usually of high efficiency, but their performance can be easily affected by viewpoint variation and hand deformation. The 3D methods adopt a kinematic model with high degree of freedom [5, 8]. Such methods offer a richer hand description and therefore could deal with more posture categories, but they are usually computationally expensive due to the complex model matching algorithms. Here in this work, an appearance method is explored to perform the multiclass hand posture detection in parallel.

The key of appearance methods is to seek effective features for hand posture representation as well as to develop an efficient and expressive posture classification model. The frequently used appearance features include the Haar-like [2, 7, 9], HOG [10–12], SIFT [13, 14], and BRIEF [14, 15]. However, such features are seriously affected by the cluttered backgrounds that introduce noise to features encodings. For this reason, recently there are trends to adopt the combination of multiple feature descriptions, such as the integration of HOG and skin features in [16] and the association of Haar-like and HOG in [2]. However, the accuracy improvements for such multifeature methods are usually gained at the expense of considerable increase in computing cost. To improve the efficiency, a classifier of two levels is presented in [1], in which the possible presence of hands is determined from a global perspective in the first level, and then hand regions are precisely delineated at pixel level by a probabilistic model in the second level. And, in [17], the saliency map generated by a Bayesian model is firstly thresholded to localize the hand regions, and then shape and texture features are extracted from the saliency map of hand regions for hand posture recognition. More recently, the deep learning (DL) methods are also investigated for hand posture detection, such as the integration of CNN scheme with fast candidate generation [18], the multiscale deep feature approach [19], and the deep architecture with three networks of sharing convolution layers [20]. However, the speeds of DL-based methods are much lower than those of the classical methods if the algorithms are running on a machine without advanced GPUs.

Multiclass posture detection problem is often addressed by two-stage methods [20–23], in which hand region proposals are firstly obtained by techniques like skin, motion, or saliency detection which are robust to hand deformation and viewpoint variation, and then these regions are classified by multiple binary models or single multiclass model to achieve the final posture recognition. For such methods, precise region proposals are prerequisite to achieve satisfactory recognition rates, while obtaining precise proposals is never an easy job in itself if no specific posture models are utilized. As a result, the misdetection is often relatively high for such methods. The sliding-window-based methods usually perform the multiclass posture detection with multiple posture-specific detectors [9, 24]. Such methods may have relatively high recall rates. But they lack efficiency since each window needs to be classified by multiple detectors and suffer from heavy confusion detections because the detectors for different categories are trained independently rather than in a coordinated manner. Besides, there are works that adopt tree-type structure [7], but practical experiments show that there is no significant improvement in accuracy or efficiency. Here in this work, we propose a softmax-based cascade detector to perform multiclass hand posture detection simultaneously rather than category by category. Moreover, owing to the multiclass objective function, the decision boundaries are essentially obtained by seeking a balance among all categories and therefore can help reduce the confusion rates among different posture categories.

3. The Proposed Methodology

In this section, the softmax model is firstly presented for multiclass classification. Then, the softmax-based cascade architecture is introduced for multiclass hand posture detection. And, finally, we will show how to apply multiresolution features to the cascade architecture to balance the detection accuracy and efficiency.

3.1. Multiclass Hand Posture Classification by Softmax Regression

Instead of utilizing multiple independent binary classifiers, here in our method, the softmax model [25] is applied to discriminate among the background category and multiple hand posture categories. To be specific, given the feature vector x_z of image z, the distribution of class label

can be modeled as

(1)

where

are model parameters and

represent basis functions used for feature transformation. l(z)∈{1, ⋯, P} means that z is an image of the pth posture category, and l(z) = 0 indicates that z is an image of background or undefined postures. In this work, the identity basis functions are adopted; that is, there is φ(x) = x. For kernelized softmax model, there is φ(x) = (k(x, x₁), ⋯, k(x, x_N)), where k(·, ·) is the kernel function and

are the features for the training examples. To facilitate the subsequent discussions, the ground-truth label of z is reformulated into a (P + 1)-dimensional vector as t = t(z)∈{0,1} ^P+1, where its pth element t_p (0 ≤ p ≤ P) is equal to 1 if l(z) = p and t_p = 0 otherwise. Moreover, we use y(·; Θ) to denote the softmax model with parameter Θ and use y(x; Θ) to denote the vector (y₀(x; Θ), ⋯, y_P(x; Θ)) for simplicity. With these notations, the distribution for label vector t can be formulated as

(2)

The model parameter Θ can be obtained by maximal likelihood estimation (MLE) [25, 26]. To be specific, given the training set

, under the assumption of identical and independent distributions, the likelihood for parameters Θ can be formulated as

(3)

where x_n is the feature representation for z_n, t_n is (P + 1)-dimensional label for example z_n, and t_np is the pth component of t_n. In implementation, Θ is acquired by minimizing the negative log-likelihood as follows:

(4)

Since the loss function in (4) remains unchanged as all elements in Θ change in the same proportion, the penalization on Θ should be added to the objective function to suppress the magnitude of model parameters. Therefore, in practice, we take the loss function with regularization term as follows:

(5)

where

and λ is the regularization coefficient. Finally, we take the efficient iterative BFGS algorithm [27, 28] to find the solution of (5). Once the model parameters Θ are obtained, the prediction of l(z) can be made based upon the softmax model by

(6)

This prediction formula will be slightly modified in the next subsection to carry out two-class classification.

3.2. Softmax-Based Cascade Architecture for Human Hand Detection

For multiscale sliding-window-based hand detection, the background pattern space is highly complicated because of the varied background window images. To decompose the complexity of background space, a softmax-based cascade architecture is introduced, which comprises a set of softmax-based binary (SftB) classifiers

and a softmax-based multiclass (SftM) classifier B_K+1(·). These classifiers are obtained based on the (K + 1) softmax regression models

which are learned with a cascade training procedure. The classifiers

with outputs in {0,1} are mainly used to distinguish the defined hand postures from the background window images, where SftB B_k(·) is formulated as

(7)

That is to say, for stage k, the window z can be accepted if and only if the maximal probability of posture categories is larger than the probability of background category by at least ξ_k. The parameters

are set to the values so that most windows that properly contain the defined postures can get through, and they are determined at the training stage based upon the settings for posture example pass-rates (for H_k(·),

could be computed based upon the posture examples set

which is used for learning Θ_k. Sort

in ascending order to produce vector

, and take the value ξ_k = χ(floor((1 − β_k)N_ps)) as threshold, where β_k are the preset posture examples pass-rates for the kth stage SftB during training period). The SftM classifier B_K+1(·) with output in

is of the formulation as described in (6), and it is mainly used to discriminate among the (P + 1) categories including the p classes of defined posture and the difficult backgrounds. To speed up the classification, the classifier

can be replaced by the classifiers

defined as follows:

(8)

The threshold

can be determined in a similar way to that in which the threshold

is determined (for H_k(·),

could be computed based upon the posture examples set

which is used for learning Θ_k. Sort

in ascending order to produce vector

, and take the value ξ_k = χ(floor((1 − β_k)N_ps)) as threshold, where β_k are the preset posture examples pass-rates for the kth stage SftB during training period).

The classification of window image z is achieved by a two-step decision process. In the first step, the class label of z is predicted as

(9)

where

represents the feature representation used by B_k(·). The range of

is {0,1, ⋯, P}. When

is 0, the window z will be directly excluded, and the second step will not be carried out any more. In the second step, the class label of window image z accepted by (9) is reidentified as

(10)

where γ(z) is the (P + 1)-dimensional score vector calculated using the softmax models at the high-level stages:

(11)

In the experimental part, k₀ is set at 2. For ease of understanding, the flowchart for the window image classification is provided in Figure 1.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

The flowchart of window image classification using softmax-based cascade classifier.

3.3. Multiresolution HOG Feature for Different Stage-Classifiers

For sliding-window-based hand detection, there are tens of thousands window images to be classified in single frame, which makes the detection system lack efficiency. To improve the efficiency, here in this work, the multiresolution HOG features are adopted for posture representations [24]. The cascade is designed so that the HOG features with low resolutions are utilized by classifiers of lower stage-levels, and HOG features with high resolutions are utilized by classifiers of higher stage-levels. The varied feature resolutions can be achieved by adjusting the density of cell splits in window images as discussed in [24]. With such multiresolution scheme, a large number of background windows can be excluded by the classifiers using low-resolution HOG features. And only few difficult background windows need to be further classified by the HOG features of high resolutions which are more discriminative and more computationally costly. In this way, the detection speed can be greatly improved without sacrificing the detection accuracy. Concretely, let J_k denote the time consumption for single window classification with B_k(·), and denote the percentage of windows through the kth stage as follows: ρ_k = number of windows through the first k stage-classifiers/number of all windows generated from the full-sized image. Then, based upon the proposed multiresolution and cascade scheme, the average time expense for classifying one window image is . However, if the detection system adopts a single softmax with HOG features of the highest resolution, the time expense would be J_K+1, which is usually several times as much as E₁.

To promote the understanding, details of the training process for the proposed method are described in Algorithm 1. In Step (1), the training data is prepared and some hyperparameters are defined to control the training process. In Step (2), the first stage-classifier is trained, while the rest of stage-classifiers are trained one by one in Step (3). During training of the first stage, the initial N_t negatives are randomly cropped, and all the rest are acquired using hard example mining techniques (Step (2.4)). Such strategy could enhance the discriminative ability of the first stage-classifier. For stage larger than 1, all the N_T negatives are directly mined based upon the previous stage-classifier (Step (3.2)). In the kth stage, the multiclass softmax model is firstly learned, and then based upon the predefined pass-rate hyperparameters β_k, the modified SftB classifiers B_k(·) and C_k(·) can be generated. Once the stage reaches the predefined K + 1, the procedure could stop and return the set of cascade components .

Algorithm 1: The procedure of training softmax-based cascade.

(1) Prepare multiclass posture example set and the full-sized background images set . Specify the control factors {N_t, N_T},
the stage number K + 1, the HOG resolutions for different stages, the posture samples pass-rates for the first k stages ,
and the size of train samples w × h. Set the current stage level as k = 1, the set of stage-classifiers as . Note that,
all sub-images cropped from full-sized background images are of size w × h in training process.
(2) Train the first stage classifier as follows:
(2.1) Set , Z = {N_t sub-images randomly cropped from images in , and S = ⌀.
(2.2) Train a softmax model with sample sets X, Z and HOG of specified resolution, and modify the model into two SftB
classifiers B₁(⋅) (Eq. (7)) and C₁(⋅) (Eq. (8)) based upon the pass-rate β₁.
(2.3) Add B₁(⋅) and C₁(⋅) to . If |Z| > N_T, go to step (3). Otherwise, go to step (2.4). Here |Z| represents the number of
examples in Z.
(2.4) Randomly crop sub-image z from an image I queried from . Add z to S if . Repeat this process until |S|
reaches to N_t.
(2.5) Reset Z = Z ∪ S and S = ⌀. And go to step (2.2).
(3) Train the remaining k stage classifiers:
(3.1) Set example sets X and Z as: , Z = ⌀.
(3.2) Randomly crop z from image . Add z to S if . Repeat this process until |Z| reaches to the
predefined N_T.
(3.3) Train a softmax model with sample sets X, Z and HOG of specified resolution.
Then, modify this model into SftB classifiers B_k+1(⋅) and C_k+1(⋅) based on pass-rate β_k+1, if k + 1 ≤ K.
(3.4) Add B_k+1(⋅) and C_k+1(⋅) to , and let k = k + 1.
(3.5) If k ≤ K, go to (3.1). Otherwise, cascade training has been finished and the procedure could be stopped.

4. Experimental Results and Discussions

The proposed method is evaluated on a dataset that is collected under various scenarios with complex background and challenging light conditions. In this section, we firstly describe the dataset and experimental settings. Then, performances of the proposed SftB classifier and softmax-based cascade are evaluated. And, finally, influences of the settings for posture example pass-rates are discussed.

4.1. Datasets and Experimental Settings

4.1.1. Datasets

The experimental dataset comprises four predefined posture categories. For each category, there are around 2000 positive examples with normalized size of 80 × 80 pixels. The samples are obtained by cropping hand regions from the full images that are collected from ten subjects under various backgrounds and lighting conditions. The negative samples are generated during training process by randomly cropping image regions from 500 extra complicated pictures of full size. These full-sized images comprise various undefined hand postures but contain no hand posture of predefined categories. Except for the training samples, we also prepare 4000 full-sized images to evaluate the performance of the proposed method, and each image contains at least one predefined posture instance. Examples for the defined posture categories are presented in Figure 2.

4.1.2. Experimental Settings

In the experiment, training samples are normalized into the resolution of 80 × 80 pixels. HOG features of various resolutions are utilized for classification, where different resolutions are achieved by adopting different cell splits. Cell splits for the adopted 3 resolutions are illustrated in Figure 3. Parameters for HOG features of all resolutions are fixed as unsigned gradient orientation, 9 equally distributed angle bins, 2 × 2 cells per block, and block steps equaling to cell size. Totally four stages-classifiers are incorporated into the softmax structure. The first three are SftB classifiers, and the last one is SftM classifier. Feature configuration for each stage-classifier is presented in Table 1. In addition, to improve the detection efficiency, changing window size is employed for multiscale search rather than resizing the image itself (e.g., we could take window size of 64 × 64, 80 × 80, 96 × 96, and 120 × 120 to detect hands of different scales in the frame. For window size of s × s, the region of cell c(i, j) will be taken as [x + x1, x + x2, y + y1, y + y2], where (x, y) are the top left coordinates of this window image, x1 = floor((i − 1)∗s/rc) + 1, x2 = ceil(i∗s/rc), y1 = floor((j − 1)∗s/rc) + 1, y2 = ceil(j∗s/rc), and rc is the cell number at horizontal or vertical direction (totally rc∗rc cells as shown in Figure 3). To sum up, the cell size changes with the window size. Although such calculation for cell location is not so accurate when s is not divisible by rc, the feature is still effective. In video-based detection, if the application scenario requires the users to be near the camera, the window sizes should be larger, while if the users are required to stay far away from the camera, the window sizes should be smaller) and the window step is set as 0.05 times of the window size. For live hand detection, the web-camera is set so that image with 320 × 240 resolution could be captured. All experiments are conducted on a PC equipped with Intel(R) Pentium(R) G3220 @3.00GHz CPU, 4.00GB RAM, and under the visual studio 2013 platform.

Table 1. The feature configuration for each cascade stage.

Cascade stage	Cell split	Block layout	Feature dimension	Output domain
Stage 1	5×5	4×4	576	{0,1}

Stage 2	8×8	7×7	1764	{0,1}

Stage 3	10×10	9×9	2916	{0,1}

Stage 4	10×10	9×9	2916	{0, ⋯, P}

4.2. Effectiveness of the Proposed SftB Classifiers

To evaluate the proposed SftB classifier, we, respectively, use the softmax and logistic regression (LR) techniques to train the first three binary stage-classifiers to produce the final four-stage cascade. During the SftB cascade training period, all samples prepared for the pth stage-classifiers C_(p,SftB)(·) are divided into the training set and the testing set . C_(p,SftB)(·) is learned from dataset , and ROC curve for C_(p,SftB)(·) is calculated based on the testing set (the ROC describes the variation relation between false positive rates (FPR) and true positive rates (TPR). Different TPR of are achieved by adjusting the value of threshold ζ_k. And varying ζ_k can in return produce varying FPR on . In this way, the ROC curve for SftB can be produced). Similarly, we can train C_(p,LR)(1 ≤ p ≤ 3). In this way, totally six ROC curves are produced based upon . In addition, an extra ROC curve is also generated for a SftB classifier based upon and using HOG features of the first resolution. All the seven ROC curves are displayed in Figure 4, where the notations “stage2&Reso2&LR” and “stage2&Reso2&SftB,” respectively, represent the LR and SftB classifiers trained with HOG features of the second resolution. Other notations can be explained in a similar way.

From Figure 4, we can see that, with the same HOG resolution and for fixed TPR (Table 4), the FPR (Table 4) under SftB classifier is much smaller than that calculated with LR classifier. This is because that the SftB is modified from a multiclass classifier, which essentially provides the decision boundaries among different posture categories and therefore can decompose the complex space formed by multiclass posture examples. Moreover, we find that the classifier “Stage2&Reso1&SftB” seriously underperforms the others, which indicates that increasing the resolution of HOG features is crucial to guarantee the classification accuracy.

In addition, the histograms for outputs from (see (8)) are calculated and presented in Figure 5, so that more knowledge can be gained about the proposed softmax-based binary classification. In the illustration, the upper histogram is calculated based upon the background examples and the bottom one is calculated based upon the predefined hand posture examples.

4.3. Effectiveness of the Proposed Softmax-Based Cascade Detector

To fully evaluate the proposed method, we compare the performance of softmax-based cascade and noncascade detectors based on their confusion matrices. The three compared noncascade softmax detectors are trained, respectively, with each of the three HOG feature resolutions as illustrated in Figure 3. For the cascade detector, posture pass-rates for the first three stage-classifiers are set to 98.0%, 98.5%, and 99.0%, respectively. In practice, the multiclass posture detection is carried out on the full-sized testing images with each of the four detectors (one cascade and three noncascade) and based on the multiscale sliding window scheme. For each detector, all rectangular regions that are classified into a same category will be postprocessed by the nonmaxima suppression techniques to determine the final locations for posture instances. The (P + 1)×(P + 1) confusion matrix W for a detector D is computed from the final results produced by detector D. With zero-based indexes, the elements of W are defined as follows:

(12)

where

instances that belong to the pth posture category but are predicted into the jth category},

instances from the pth posture category but they are not predicted into any of the defined categories},

posture instances from the pth posture category},

background regions that are predicted into the pth posture category},

full-sized pure background images used for evaluation}, and

full-sized pure background images that do not contain false detections}. The pure background image refers to the image that does not contain instances of the predefined posture categories. And a detected region R is the correct detection to an instance

if and only if the following exist: (a) the predicted class of R is just equal to the ground-truth class of

and (b) the overlap ratio between R and the ground-truth region of

is larger than 0.6.

The four confusion matrices corresponding to the four detectors are presented in Table 2, where the Softmax+Resolution1, Softmax+Resolution2, and Softmax+Resolution3, respectively, represent the confusion matrix computed from the three noncascade detectors. Note that the confusion matrix here is different from that for classification problem. In fact, for sliding-window-based detection, one target instance may be covered by many windows, and the postprocessing is only applied to windows that are classified into the same category. As a result, one region can be finally predicted into more than one posture category. For this reason, the sum of elements in each row does not necessarily equal to one.

Table 2. The confusion matrices for detection results computed with single-resolution-based softmax detectors and multiresolution-based cascade detector. Note that row elements of matrix do not need to sum to 1 for confusion matrix of detection problem.

	Softmax+Resolution1					Softmax+Resolution2
	BK	vict	close	open	fist	BK	vict	close	open	fist

BK	0.7104	0.1480	0.1211	0.0628	0.1469	0.8205	0.0843	0.0291	0.0326	0.1145

vict	0.0420	0.9136	0.2006	0.2157	0.2585	0.0706	0.9001	0.0492	0.0856	0.1776

close	0.0272	0.5438	0.9251	0.5447	0.4443	0.0017	0.4477	0.9626	0.4077	0.4911

open	0.0532	0.3291	0.2181	0.9070	0.3531	0.0630	0.1987	0.0532	0.9055	0.2234

fist	0.0183	0.7114	0.4862	0.5146	0.9233	0.0342	0.4370	0.1935	0.2494	0.9391

	Softmax+Resolution3					The Proposed Softmax Cascade

	BK	vict	close	open	fist	BK	vict	close	open	fist

BK	0.8369	0.0889	0.0090	0.0287	0.0983	0.9884	0.0061	0.0022	0.0033	0.0053

vict	0.0761	0.8993	0.0214	0.0928	0.2292	0.0547	0.9294	0.0032	0.0389	0.0151

close	0.0026	0.4323	0.9574	0.3872	0.4655	0.0153	0.0383	0.9838	0.1319	0.0502

open	0.0862	0.1717	0.0187	0.9033	0.2219	0.0300	0.0682	0.0090	0.9700	0.0787

fist	0.0350	0.3361	0.0984	0.1435	0.9299	0.0609	0.1093	0.0459	0.0300	0.8957

From Table 2, we can see that the hand detection with noncascade softmax detectors may cause high false detection rates at the background areas and high confusion rates among different posture categories. By contrast, the proposed softmax-based cascade could significantly suppress all kinds of false detections without sacrificing the recall rates. This is because the complexity of background space can be effectively decomposed by the usage of multiple stage-classifiers, and therefore it becomes much easier for the final multiclass softmax model to discriminate among the predefined postures and the minorities of remaining backgrounds.

To make more direct and intuitional comparisons, multiple performance values based on summary measures are also computed and provided in Table 3. The measures mean recall rate and mean correct rate, respectively, represent the averaged recall rates and the averaged confusion rates among the four predefined posture categories. For the definition of FPPI and mean correct rate, please refer to Table 4.

Table 3. Performance comparison between the noncascade (rows 1 to 3) softmax and cascade softmax (row 4) detectors.

	Mean recall rate	Mean confusion rate	FPPI	Mean correct rate	Time consumption for 240×320 picture
Softmax+Resolution1	0.9172	0.4017	0.4788	0.3548	25.16ms

Softmax+Resolution2	0.9268	0.2512	0.2605	0.4812	63.39ms

Softmax+Resolution3	0.9225	0.2182	0.2248	0.5155	99.45ms

The proposed softmax cascade	0.9448	0.0515	0.0169	0.8475	27.17ms

Table 4. List of acronyms, definitions, and terminology interpretation.

FPPI
FPPW
Mean correct rate
Detection rate
Case1	the case in which pass-rate of higher stage is larger than the pass-rate of lower stage
Case2	the case in which the pass-rate of higher stage is smaller than the pass-rate of lower stage
TPR	the abbreviation of true positive rate
FPR	the abbreviation of false positive rate
LR	the abbreviation of logistic regression

From Table 3, we can see that the detection accuracy with Softmax+Resolution3 is the highest among the three noncascade classifiers. However, by comparison, the proposed multiclass cascade detector further improves the mean recall rate from 0.9225 to 0.9448 and boosts the mean correct rate from 0.5155 to 0.8475. Meanwhile, the mean confusion rate is reduced from 0.2182 to 0.0515, and the FPPI is reduced from 0.2248 to 0.0169. In addition, the proposed detector is faster than Softmax+Resolution3 by almost 4 times.

Figure 6 shows some hand posture detection result based on a normal web-camera. From the results, we can see that the proposed method can detect the defined hand postures under various environments. And the system can reach a real-time running speed of 27 FPS under our experimental setup.

4.4. The Influences of the Settings for Posture Example Pass-Rates

Performance of the proposed cascade is directly affected by the thresholds ζ_k of its stage-classifiers as shown in (8). The thresholds affect not only the detection results but also the training process, since the background samples for the pth stage are acquired by the previous (p-1) stage-classifiers. These thresholds are determined based upon the settings for pass-rates of posture samples (for H_k(·), could be computed based upon the posture examples set which is used for learning Θ_k. Sort in ascending order to produce vector , and take the value ξ_k = χ(floor((1 − β_k)N_ps)) as threshold, where β_k are the preset posture examples pass-rates for the kth stage SftB during training period) which are set at the training stage to control the training process. To acquire better cascade detector, we prepare multiple groups of settings for the pass-rates and then train the four-stage cascade classifier with each group of settings. After that, the FPPW and detection rate (Table 4) are computed based upon each of these cascade detectors, and the best group of settings is selected by comparing the values of all FPPW and detection rates. Note that the detection rate does not necessarily equal to the mean correct rate, since confusion detections may exist among different posture categories.

The six groups of pass-rates being compared are, respectively, [97%#98%#99%], [99%#98%#97%], [94%#96%#98%], [98%#96%#94%], [95%#97%#99%], and [99%#97%#95%]. The notation [97%#98%#99%] means that, for the first three stage-classifiers, the pass-rate of posture examples is successively set to 97%, 98%, and 99%. Each group contains exact three pass-rate settings because there are exact four stage-classifiers in each cascade detector, while the fourth stage is a multiclass softmax model that will not be modified. The curves for variation relations of FPPW with the stage-level are presented in Figure 7, and the detection rates are illustrated in Figure 8. Except that FPPW and detection rate are both increasing with the product of three pass-rates, we have another important observation. That is, when the product of the three pass-rates is fixed, the detection rate in Case1 (Table 4) is significantly higher than that in Case2 (Table 4), while the FPPW in both cases are very close. This indicates that the detectors trained in Case1 are more discriminative than those trained in Case2. This observation suggests that, to achieve good performance, it is better to set low pass-rates for classifiers at low stages and set higher pass-rates for classifiers at higher stages.

5. Conclusion and Future Work

In this work, a softmax-based cascade detector is proposed to perform the multiclass hand posture detection in parallel. The cascade contains several SftB classifiers used for distinguishing all predefined postures from the backgrounds and a SftM classifier mainly used to discriminate among all predefined hand postures. Moreover, the HOG features of increasing resolutions are adopted by stage-classifiers with increasing stage-levels so as to further reduce the efficiency without sacrificing the detection accuracy. Experimental comparison of ROC curves demonstrates the superiority of the proposed SftB classifier. And evaluation results on a challenging dataset indicate that the proposed model structure could improve both the accuracy and efficiency as compared with the noncascade multiclass posture detection methods. In the future work, we will replace the softmax-based stage-classifiers in the cascade with more expressive classification model, such as the convolutional neural networks, to further improve the accuracy of single-stage classification.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Key R&D Program of China (no. 2017YFB1103602), the National Natural Science Foundation of China (nos. 51705513, U1613213, and U1713213), Shenzhen Science Plan (KQJSCX20170731165108047 and JCYJ20170413152535587), and Shenzhen Engineering Laboratory for 3D Content Generating Technologies (no. [2017]476).

Appendix

Acronyms, Definitions, and Terminology

See Table 4.

References

1 Betancourt A., López M. M., Regazzoni C. S., and Rauterberg M., A sequential classifier for hand detection in the framework of egocentric vision, Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2014, June 2014, 586–591, 2-s2.0-84908530336.
Google Scholar
2 Mei K., Xu L., Li B., Lin B., and Wang F., A real-time hand detection system based on multi-feature, Neurocomputing. (2015) 158, 184–193, 2-s2.0-84926521935, https://doi.org/10.1016/j.neucom.2015.01.049.
10.1016/j.neucom.2015.01.049
Web of Science® Google Scholar
3 LaViola Jr J. J., Context aware 3D gesture recognition for games and virtual reality, Proceedings of the ACM SIGGRAPH 2015 Courses, August 2015, ACM, https://doi.org/10.1145/2776880.2792711.
10.1145/2776880.2792711
Google Scholar
4 Wachs J. P., Kölsch M., Stern H., and Edan Y., Vision-based hand-gesture applications, Communications of the ACM. (2011) 54, no. 2, 60–71, 2-s2.0-79551718333, https://doi.org/10.1145/1897816.1897838.
10.1145/1897816.1897838
PubMed Web of Science® Google Scholar
5 Erol A., Bebis G., Nicolescu M., Boyle R. D., and Twombly X., Vision-based hand pose estimation: a review, Computer Vision and Image Understanding. (2007) 108, no. 1-2, 52–73, https://doi.org/10.1016/j.cviu.2006.10.012, 2-s2.0-34548206205.
10.1016/j.cviu.2006.10.012
Web of Science® Google Scholar
6 Rautaray S. S. and Agrawal A., Vision based hand gesture recognition for human computer interaction: a survey, Artificial Intelligence Review. (2012) 43, no. 1, 1–54, https://doi.org/10.1007/s10462-012-9356-9, 2-s2.0-84921999882.
10.1007/s10462-012-9356-9
Web of Science® Google Scholar
7 Chen Q., Georganas N. D., and Petriu E. M., Hand gesture recognition using Haar-like features and a stochastic context-free grammar, IEEE Transactions on Instrumentation and Measurement. (2008) 57, no. 8, 1562–1571, https://doi.org/10.1109/tim.2008.922070, 2-s2.0-48749129060.
10.1109/TIM.2008.922070
Web of Science® Google Scholar
8 Pisharady P. K. and Saerbeck M., Recent methods and databases in vision-based hand gesture recognition: A review, Computer Vision and Image Understanding. (2015) 141, 152–165, 2-s2.0-84948168846, https://doi.org/10.1016/j.cviu.2015.08.004.
10.1016/j.cviu.2015.08.004
Web of Science® Google Scholar
9 Kölsch M. and Turk M., Robust hand detection, Proceedings of the 6th IEEE International Conference on Automatic Face and Gesture Recognition (FGR ′04), May 2004, 614–619, 2-s2.0-4544351006.
Google Scholar
10 Zhou H., Lin D. J., and Huang T. S., Static hand gesture recognition based on local orientation histogram feature distribution model, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2004, July 2004, IEEE, 161–161, 2-s2.0-84932647576.
Google Scholar
11 Guo J., Cheng J., Pang J., and Guo Y., Real-time hand detection based on multi-stage HOG-SVM classifier, Proceedings of the 2013 20th IEEE International Conference on Image Processing, ICIP 2013, September 2013, IEEE, 4108–4111, 2-s2.0-84897814644.
Google Scholar
12 Prasuhn L., Oyamada Y., Mochizuki Y., and Ishikawa H., A HOG-based hand gesture recognition system on a mobile device, Proceedings of the IEEE International Conference on Image, 2014, IEEE, 3973–3977, 2-s2.0-84949927149.
Google Scholar
13 Wang C.-C. and Wang K.-C., Hand posture recognition using adaboost with sift for human robot interaction, Recent Progress in Robotics: Viable Robotic Service to Human, 2007, Springer, 317–329.
Google Scholar
14 Li C. and Kitani K. M., Pixel-level hand detection in ego-centric videos, Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2013, June 2013, 3570–3577, 2-s2.0-84887400234.
Google Scholar
15 Liew C. F. and Yairi T., Generalized BRIEF: A novel fast feature extraction method for robust hand detection, Proceedings of the 22nd International Conference on Pattern Recognition, ICPR 2014, August 2014, IEEE, 3014–3019, 2-s2.0-84919897791.
Google Scholar
16 Mittal A., Zisserman A., and Torr P. H. S., Hand detection using multiple proposals, Proceedings of the 2011 22nd British Machine Vision Conference, BMVC 2011, September 2011, 1–11, 2-s2.0-84898442734.
Google Scholar
17 Pisharady P. K., Vadakkepat P., and Loh A. P., Attention based detection and recognition of hand postures against complex backgrounds, International Journal of Computer Vision. (2013) 101, no. 3, 403–419, 2-s2.0-84880653911, https://doi.org/10.1007/s11263-012-0560-5.
10.1007/s11263-012-0560-5
Web of Science® Google Scholar
18 Bambach S., Lee S., Crandall D. J., and Yu C., Lending a hand: detecting hands and recognizing activities in complex egocentric interactions, Proceedings of the 15th IEEE International Conference on Computer Vision, (ICCV ′15), December 2015, 1949–1957, https://doi.org/10.1109/ICCV.2015.226, 2-s2.0-84973872414.
10.1109/ICCV.2015.226
Google Scholar
19 Le T. H. N., Zhu C., Zheng Y., Luu K., and Savvides M., Robust hand detection in Vehicles, Proceedings of the 23rd International Conference on Pattern Recognition, ICPR 2016, December 2016, IEEE, 573–578, 2-s2.0-85019075167.
Google Scholar
20 Chen T., Wu M., Hsieh Y., and Fu L., Deep learning for integrated hand detection and pose estimation, Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), December 2016, IEEE, 615–620, https://doi.org/10.1109/ICPR.2016.7899702.
10.1109/ICPR.2016.7899702
Google Scholar
21 Dardas N. H. and Georganas N. D., Real-time hand gesture detection and recognition using bag-of-features and support vector machine techniques, IEEE Transactions on Instrumentation and Measurement. (2011) 60, no. 11, 3592–3607, https://doi.org/10.1109/tim.2011.2161140, 2-s2.0-80054078077.
10.1109/TIM.2011.2161140
Web of Science® Google Scholar
22 Chuang Y., Chen L., and Chen G., Saliency-guided improvement for hand posture detection and recognition, Neurocomputing. (2014) 133, 404–415, 2-s2.0-84894585981, https://doi.org/10.1016/j.neucom.2013.12.023.
10.1016/j.neucom.2013.12.023
Web of Science® Google Scholar
23 Li S., Ni Z., and Sang N., Multiple-classifiers based hand gesture recognition, Chinese Conference on Pattern Recognition, 2016, Springer, 155–163, https://doi.org/10.1007/978-981-10-3002-4_13.
10.1007/978-981-10-3002-4_13
Google Scholar
24 Zhao Y., Song Z., and Wu X., Hand detection using multi-resolution HOG features, Proceedings of the 2012 IEEE International Conference on Robotics and Biomimetics, ROBIO 2012, December 2012, IEEE, 1715–1720, 2-s2.0-84876459844.
Google Scholar
25 Bishop C. M., Pattern Recognition and Machine Learning, 2006, Springer, New York, NY, USA, MR2247587, Zbl1107.68072.
10.1007/978-0-387-45528-0
Google Scholar
26 Enders C. K., Maximum likelihood estimation, Encyclopedia of Statistics in Behavioral Science. (2005) .
Google Scholar
27 Liu D. C. and Nocedal J., On the limited memory BFGS method for large scale optimization, Mathematical Programming. (1989) 45, no. 1–3, 503–528, https://doi.org/10.1007/BF01589116, MR1038245, Zbl0696.90048, 2-s2.0-0024901442.
10.1007/BF01589116
Web of Science® Google Scholar
28 Nocedal J. and Wright S. J., Numerical Optimization, 2006, Springer, New York, NY, USA, MR2244940, Zbl1104.65059.
Google Scholar

Citing Literature

All articles

Hand Detection Using Cascade of Softmax Classifiers

Abstract

1. Introduction

2. Related Work

3. The Proposed Methodology

3.1. Multiclass Hand Posture Classification by Softmax Regression

3.2. Softmax-Based Cascade Architecture for Human Hand Detection

3.3. Multiresolution HOG Feature for Different Stage-Classifiers

4. Experimental Results and Discussions

4.1. Datasets and Experimental Settings

4.1.1. Datasets

4.1.2. Experimental Settings

4.2. Effectiveness of the Proposed SftB Classifiers

4.3. Effectiveness of the Proposed Softmax-Based Cascade Detector

4.4. The Influences of the Settings for Posture Example Pass-Rates

5. Conclusion and Future Work

Conflicts of Interest

Acknowledgments

Appendix

Acronyms, Definitions, and Terminology

References

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley