Volume 2022, Issue 1 7191537

Research Article

Open Access

[Retracted] Scene Classification Using Deep Networks Combined with Visual Attention

Retraction(s) for this article

Jing Shi,

Corresponding Author

Jing Shi

[email protected]

orcid.org/0000-0001-5624-9698

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China xaut.edu.cn

Shaanxi Key Laboratory of Complex System Control and Intelligent Information Processing, Xi’an University of Technology, Xi’an 710048, China xaut.edu.cn

Search for more papers by this author

Hong Zhu,

Corresponding Author

Hong Zhu

[email protected]

orcid.org/0000-0003-2993-1928

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China xaut.edu.cn

Search for more papers by this author

Yuxing Li,

Yuxing Li

orcid.org/0000-0001-5035-223X

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China xaut.edu.cn

Search for more papers by this author

YangHui Li,

YangHui Li

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China xaut.edu.cn

Search for more papers by this author

Sen Du,

Sen Du

orcid.org/0000-0002-2919-7148

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China xaut.edu.cn

Search for more papers by this author

Jing Shi,

Corresponding Author

Jing Shi

[email protected]

orcid.org/0000-0001-5624-9698

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China xaut.edu.cn

Shaanxi Key Laboratory of Complex System Control and Intelligent Information Processing, Xi’an University of Technology, Xi’an 710048, China xaut.edu.cn

Search for more papers by this author

Hong Zhu,

Corresponding Author

Hong Zhu

[email protected]

orcid.org/0000-0003-2993-1928

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China xaut.edu.cn

Search for more papers by this author

Yuxing Li,

Yuxing Li

orcid.org/0000-0001-5035-223X

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China xaut.edu.cn

Search for more papers by this author

YangHui Li,

YangHui Li

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China xaut.edu.cn

Search for more papers by this author

Sen Du,

Sen Du

orcid.org/0000-0002-2919-7148

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China xaut.edu.cn

Search for more papers by this author

First published: 28 August 2022

https://doi.org/10.1155/2022/7191537

Citations: 2

Academic Editor: Yuan Li

Share a link

Email
Wechat
Bluesky

Abstract

In view of the scene’s complexity and diversity in scene classification, this paper makes full use of the contextual semantic relationships between the objects to describe the visual attention regions of the scenes and combines with the deep convolution neural networks, so that a scene classification model using visual attention and deep networks is constructed. Firstly, the visual attention regions in the scene image are marked by using the context-based saliency detection algorithm. Then, the original image and the visual attention region detection image are superimposed to obtain a visual attention region enhancement image. Furthermore, the deep convolution features of the original image, the visual attention region detection image, and the visual attention region enhancement image are extracted by using the deep convolution neural networks pretrained on the large-scale scene image dataset Places. Finally, the deep visual attention features are constructed by using the multilayer deep convolution features of the deep convolution networks, and a classification model is constructed. In order to verify the effectiveness of the proposed model, the experiments are carried out on four standard scene datasets LabelMe, UIUC-Sports, Scene-15, and MIT67. The results show that the proposed model improves the performance of the classification well and has good adaptability.

1. Introduction

As a basic problem in the field of computer vision and image understanding, scene image classification has received extensive attention and research [1–7]. The most important problem to be solved in scene classification is to give proper expression to the content in the scene. In order to improve the accuracy of scene classification, the researchers constantly explore new ways, which has advantage of global features and local features as well as the middle to form visual word bag; the bag will represent a visual scene image word combination methods [8] and, by iteration and cross-validation, get the image block with degree of differentiation, as image middle expression method of [9]. The mean-shift algorithm is used to find the distinguishing mode in the image block distribution space, so as to create the image representation method of middle-level scene [10]. By establishing metric learning formulas and learning the best metric parameters, online metric learning and parallel optimization of large-scale and high-dimensional data can be solved [2]. Although these methods have achieved certain classification effects, the classification performance is reduced when there are many objects or complex contents in the scene image.

In recent years, the proposal of deep convolutional networks has made it possible to obtain richer high-level semantics of images [11–13]. Donahue et al. directly use convolutional neural networks (CNNs) [14], which are pretrained on ImageNet dataset, for scene classification. Zhou et al. constructed a large-scale dataset centered on the scene and trained convolutional neural network on this basis [15], which significantly improved the performance of scene classification. Bai proposed that through CNN transfer learning, deep features were used to express special scene targets for classification [16]. Zou et al. built a fusion method based on nonnegative matrix factorization, which can preserve feature nonnegative properties and improve their representation performance. Furthermore, an adaptive feature fusion and boosting algorithm is developed to improve the efficiency of image features. There are two versions of the proposed feature fusion method for nonnegative single-feature fusion and multifeature fusion [1].

Zhang et al. proposed a spatially aware aggregation network for scene classification, which detects a set of visually semantically significant regions from each scene through a semisupervised and structurally reserved nonnegative matrix decomposition (NMF). Gaze shift path (GSP) was used to characterize the process of human perception of each scene image, and a spatial perception CNN called SA-NET was developed to describe each GSP in depth. Finally, the deep GSP function learned from the whole scene image is integrated into the image kernel, which is integrated into the kernel SVM to classify the scene [3]. Yee et al. propose a DeepScene model that leverages convolutional neural network as the base architecture, which converts grayscale scene images to RGB images. Spatial Pyramid Pooling is incorporated into the convolutional neural network [17]. These methods have greatly improved the effect of scene classification.

However, most of the current algorithms regard the scene as a combination of multiple objects [18–20] and lack description of contextual semantic relations between objects, thus restricting the accuracy of scene classification [21]. For this purpose, the significance of detection algorithm based on context [22], annotation in the scene visual focus area, and the area contains the main target in the scene and can express the context of a part of the background region and at the same time, combined with the depth of the convolutional neural network, build a kind of fusion depth scene classification characteristic of visual attention model. It overcomes the limitation of using object and structure feature to classify effectively and obtains good scene classification performance.

2. The Construction of Scene Classification Model

In order to adapt to the diversity of images, this article will image the context of the significant characteristics as visual attention characteristics, superimposed onto the original image and into the depth of the convolution network; build scene classification model; make the model of images to express deep intrinsic characteristics at the same time; andalso can express the target in the scene context between semantic features.

2.1. Detection of Areas of Visual Attention

The area that has a major influence on visual judgment is called the area of visual attention. Here, the context-based saliency detection algorithm proposed by Goferman et al. is used to extract the visual areas of interest of the image [22]. The extracted saliency areas take full account of global and local features at different scales and mark saliency targets with their adjacent areas to varying degrees. It well reflects the contextual relationship between the objects in the scene and the surrounding scenery and filters out some repetitive texture information.

The image block is taken as the comparison unit and compared in Lab color space. The closer the distance is, the greater the difference is and the more significant it is. The difference degree of the two image blocks is expressed as

(1)

where p represents the image block, i is the central pixel point of p_i, and d_c(p_i, p_j) and d_p(p_i, p_j) represent the color distance and spatial distance between the two blocks, respectively.

For the difference degree value under a single scale, usually, only the difference degree value between the previous block and a certain block and the significance value of the pixel point under the scale need to be calculated as follows:

(2)

where M is the most similar front block taken,

, and r represents a certain scale.

In order to make the detected significance region significant in multiple scales, it is necessary to calculate multiple single-scale significance values and then take the average value, as follows:

(3)

where K represents the number of scales and R represents the scale space.

In addition, it is necessary to combine the context of the image to make the region with different distances from the saliency target have different saliency. The significance value of pixel point is finally defined as

(4)

where

represents the Euclidean distance between the pixel point i at the scale r and the nearest pixel point in the significant region. K represents the number of scales, and R represents the scale space.

Figure 1 is an example of detecting the area of visual attention. The brightness value in Figure 1(b) is the visual attention of this position. It can be seen that the attention of the background area varies with the change of closeness to the target.

Details are in the caption following the image — **Figure 1 (a) Original images**
Open in figure viewer PowerPoint

Example of visual area of attention detection.

2.2. Construction of Enhanced Images of Visual Areas of Concern

Although the scene information contained in the original image was comprehensive, it could not distinguish effective information from invalid information. In order to enhance the scene area containing different information to different degrees, the detection map of visual area of concern was superimposed on the original image to obtain the enhanced image of visual area of concern.

Assume that f(i, j) is the original image, f_s(i, j) is the detection map of visual area of concern, and f_e(i, j) is the enhancement map of visual area of concern. Before stacking, the size of the original image and the detection map of visual area of concern are normalized to 256 × 256, and the significance value of the detection map of visual area of concern is normalized to [0,1].

(5)

Figure 2 shows the enhancement of visual attention area. It can be seen that different areas in the scene have different visual attention, and some repeated textures and less obvious regional information in the scene have been effectively suppressed, such as the passage in the airport scene and the decorative paintings in the bedroom. While preserving visual attention, the superimposed images supplement the information of some gray areas (transitional areas between the attention and unattention areas).

2.3. Deep Feature Fusion

In order to describe the content attributes of scene images effectively, AlexNet network model, which has been pretrained on large-scale scene dataset Places, is used to extract the deep convolution features of original image, visual area of concern detection image, and visual area of concern enhancement image. In addition, since different layers of the deep convolutional network have different abstract expressions of the original image data, the output of multiple fully connected layers of the deep convolutional network is used in this paper to form the deep fusion feature as the final expression of the scene image.

As shown in Figure 3, in the full connection layer of AlexNet, convolution feature is expressed in layer 6, and classification association feature is expressed in layer 7. Therefore, in this paper, the 4096-dimensional output feature of layer 7 and layer 6 is connected in series to generate the deep fusion feature of the image, and the calculation formula is as follows:

(6)

where F_c is the depth fusion feature of FC7 and FC6, F_fc7 is the output feature of layer FC7, and F_fc6 is the output feature of layer FC6.

Then, the deep fusion features of the original image, the detection image of visual focus area, and the enhanced image of visual focus area of the same image are spliced to generate the deep visual focus features. The calculation formula is as follows:

(7)

where F_VS are the features of deep visual attention, F_c, F_{c_s}, and F_{c_e} are the deep fusion features of the original image, the detection image of visual attention area, and the enhanced image of visual attention area, respectively.

Finally, the depth visual attention features of the training images in the target dataset are sent into the random forest to train the classifier, and the trained classifier is used for scene classification. Because the extracted features have both the contextual semantic relationship between objects in the image and the intrinsic characteristics of scene depth, the effectiveness of scene classification is greatly improved.

3. Experimental Results and Analysis

3.1. Datasets and Experimental Settings

This paper conducted tests on four common standard scene datasets, LabelMe (OT) [23], UIUC-Sports (SE) [24], Scene-15 (LS) [23, 25, 26], and MIT67 (IS) [27], respectively, partial images of each scene dataset are shown in Figure 4. In order to compare with similar algorithms, experiments were carried out according to the training and testing ratios of different datasets in the references, and the average classification accuracy of 10 experiments was taken as the final test result.

(i)
The LabelMe (OT) dataset contains 2688 color images of 8 categories, all 256 × 256 in size. In each category, 200 images were randomly used for training, and the rest images were used as test images
(ii)
UIUC-Sports (SE) dataset contains 1579 color sports scene images of different sizes in 8 categories. Each category was randomly assigned 70 images for training and 60 images for test
(iii)
Scene-15 (LS) dataset contains a total of 4485 indoor and outdoor scene images of 15 categories, of which 8 categories are the same as the LabelMe dataset. 100 images were randomly used in each category for training, and the rest of the images were used as test images
(iv)
MIT67 (IS) is a challenging indoor scene image dataset containing a total of 15,620 images in 67 categories. 80 images were randomly used in each category for training, and 20 images were used as test images

3.2. Classification Performance Evaluation

Figure 5 shows the comparison test results of classification on four datasets between deep learning features without visual attention region detection and features proposed in this paper by using the same classification method.

It can be seen that all the features proposed in this paper have certain effects on the test datasets, and the classification accuracy effect is most significantly improved in the LS dataset, mainly because the dataset contains indoor and outdoor scenes, which indicates that the algorithm is universal. In addition, the classification effect of simple outdoor scenes is also significantly improved. However, the effect of the SE and IS dataset is limited, mainly due to the fact that there are many objects in the scene, and the context relationship of prominent objects in the scene is complex. People in many scenes are not the main objects to distinguish scenes, but sometimes, they are enhanced as prominent objects, which interferes with the discrimination of scene content. In particular, scene discrimination in the SE dataset is mainly determined by the relationship between characters’ actions and scenes, while characters’ actions are sometimes very similar in multiple scenes. Therefore, the classification effect of visual area of concern detection algorithm on these datasets is limited.

Precision and recall were used to evaluate and analyze the OT, SE, and LS dataset, respectively.

Table 1 shows the fusion matrix obtained by a test of the method in this paper on the OT dataset. It can be seen that this method can achieve 100% accuracy and recall rate in the “MITinsidecity” class and can also achieve good classification effect for other categories. It is easy to confuse the “MITopencountry” class with the “‘MITcoast” class. Figure 6 shows partial misclassification images of the OT dataset. These two images misclassify images of “MITopencountry” into “MITcoast,” because the context relationship between sky and land in “MITopencountry” is similar to that of sky and coast in “MITcoast.” Lawns and deserts on the slopes have a similar texture to sea level.

Table 1. Fusion matrix of the OT dataset.

Category	1	2	3	4	5	6	7	8	Recall (%)
MITcoast (1)	154	2	—	—	—	4	—	—	96.3
MITforest (2)	—	127	—	—	1	—	—	—	99.2
MIThighway (3)	—	—	59	—	—	1	—	—	98.3
MITinsidecity (4)	—	—	—	108	—	—	—	—	100
MITmountain (5)	—	1	—	—	171	2	—	—	98.3
MITopencountry (6)	7	5	—	—	1	197	—	—	93.8
MITstreet (7)	—	—	—	—	—	—	91	1	98.9
MITtallbuilding (8)	—	—	—	—	—	—	—	156	100
Precision (%)	95.7	94.1	100	100	98.8	96.6	100	99.4

Table 2 shows the fusion matrix obtained by the proposed method in a certain test on the LS dataset. The most confusing categories are “Bedroom” and “Livingroom” and “MITtallbuilding” and “Industrial.” Figure 7 shows partial segmentation images of the LS dataset. In (a), the scene images of “Industrial” are misclassified to “MITtallbuilding,” because the high-rise buildings in the image are very similar in appearance to tall buildings, and the context relationship with the surrounding environment is similar to that of “MITtallbuilding.” On the right, the scene image of “PARoffice” is misclassified into “Kitchen.” The reason is that the cabinet in the upper part of the image has the same position and appearance with the cabinet, and the context relationship with the desktop is similar to that of “Kitchen,” thus causing the misclassification.

Table 2. The fusion matrix of the LS dataset.

Category	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	Recall (%)
Bedroom (1)	107	—	—	—	9	—	—	—	—	—	—	—	—	—	—	92.2
CALsuburb (2)	—	141	—	—	—	—	—	—	—	—	—	—	—	—	—	100
Industrial (3)	—	—	192	1	1	—	—	—	3	—	—	1	5	—	8	91.0
Kitchen (4)	—	—	—	103	2	—	—	—	—	—	—	—	—	4	1	93.6
Livingroom (5)	4	—	—	2	179	—	—	—	1	—	—	—	—	2	1	94.7
MITcoast (6)	—	—	—	—	—	252	2	—	—	1	5	—	—	—	—	96.9
MITforest (7)	—	—	—	—	—	—	221	—	—	6	1	—	—	—	—	96.9
MIThighway (8)	—	—	1	—	—	—	—	158	—	—	1	—	—	—	—	98.8
MITinsidecity (9)	—	2	5	3	—	—	—	—	194	—	—	3	1	—	—	93.3
MITmountain (10)	—	—	—	—	—	—	—	—	—	273	1	—	—	—	—	99.6
MITopencountry (11)	—	—	—	—	—	6	1	—	—	6	297	—	—	—	—	95.8
MITstreet (12)	—	—	3	—	—	—	—	—	2	—	—	186	—	—	1	96.9
MITtallbuilding (13)	—	—	9	—	—	—	2	—	1	—	—	—	243	—	1	94.9
PARoffice (14)	1	—	—	3	1	—	—	—	1	—	—	—	—	109	—	94.8
Store (15)	—	—	1	1	1	—	—	—	—	—	—	—	1	—	211	98.1
Precision (%)	95.5	98.6	91.0	91.2	92.7	97.7	97.8	100	96.0	95.5	97.4	97.9	97.2	94.8	94.6

Table 3 shows the fusion matrix obtained by a certain test of the method in this paper on the SE dataset. It can be seen that the method can achieve 100% accuracy and recall rate in the “Sailing” class. However, “Bocce” has the lowest accuracy and recall rate, and it is most easily confused with “Croquet,” mainly because the identification of these two scenes is mainly based on character movements and the relationship between character and scene, and the character movements of these two sports are very similar to the surrounding environment, so it is easy to misjudge.

Table 3. The fusion matrix of the SE dataset.

Category	1	2	3	4	5	6	7	8	Recall (%)
Badminton (1)	57	3	—	—	—	—	—	—	95.0
Bocce (2)	3	45	9	—	2	—	—	1	75.0
Croquet (3)	—	5	54	—	—	1	—	—	90.0
Polo (4)	1	3	—	56	—	—	—	—	93.3
RockClimbing (5)	—	—	—	—	59	—	—	1	98.3
Rowing (6)	—	—	—	—	—	60	—	—	100
Sailing (7)	—	—	—	—	—	—	60	—	100
Snowboarding (8)	1	—	—	—	—	—	—	59	98.3
Precision (%)	91.9	80.4	85.7	100	96.7	98.4	100	96.7

3.3. Comparison of Experimental Results

The experimental results of the proposed method on four standard scene datasets were compared with those of the reference method.

The comparison test results on the OT dataset are shown in Table 4. It can be seen that the algorithm using deep convolutional network has obvious advantages over the traditional feature extraction algorithm. GECMCT method [28] adds the far neighborhood information to the nonparametric transformation calculation and spatial information. Gist feature and spatial correction census transform are combined to form a new image descriptor, but this method lacks the deep description of scene images. HGD algorithm [29] uses pLSA to train multichannel classifier on the topic distribution vector of each image, which is not only complex in modeling but has also limited classification effect. Compared with other algorithms using deep convolutional networks, the deep convolutional classification model constructed in this paper based on the visual area of interest of images has obviously better classification effect.

Table 4. Comparative test results on the OT dataset.

Methods	Accuracy (%)
CENTRIST [31]	76.49
CMCT [32]	79.91
GIST [23]	82.60
LPC [33]	83.40
GECMCT [28]	86.95
HGD [29]	87.80
ImageNet-CNN [14]	92.83
Places-CNN [15]	94.30
DSPMK+RVFs+BSRC [34]	95.11
Ours	97.70

The comparative test results on the SE dataset are shown in Table 5. It can be seen that the classification performance of the model in this paper is significantly better than other classification algorithms. Among them, SKDES+Grad+color+shape method [8] embeds image and label information into block-level kernel descriptors to form supervised kernel descriptors and uses visual word bags to learn low-level block expressions. The implementation process of this method is relatively complex. And the representational ability of visual word bag is limited. LGF methods classified using global and local features of images, not the analysis on the contents of the image; using visual attention area detection algorithms on different areas of the image effectively; and combining the depth have been trained on the Places dataset convolution network and can better access the spatial structure of image information [30]. AdaNSFF-Color boost [1] proposes a novel fusion framework of adaptive nonnegative feature fusion (AdaNFF) for scene classification. The AdaNFF integrates nonnegative matrix factorization, adaptive feature fusion, and feature fusion boosting into an end-to-end process. However, although this method fuses and enhances features, the training data lacks pertinence, which affects its generalization ability.

Table 5. Comparative test results on the SE dataset.

Methods	Accuracy (%)
Object bank [19]	76.30
LRML-PCDM [2]	76.97
GECMCT [28]	77.96
GIST [23]	82.60
O2C kernels [18]	86.02
CENTRIST [31]	86.22
SC+LCSR [35]	87.23
LGF [41]	88.52
AdaNSFF-Color boost [1]	90.21
SKDES+grad+color+shape [8]	91.00
Ours	93.75

The comparison test results on the LS dataset are shown in Table 6. The algorithm in this paper still has a good classification effect, which is not only better than traditional classification methods but also better than many classification methods using deep learning. The features of SDO+fc algorithm of cooccurrence with all objects in the scene mode [4] and the correlation of different objects in the scene configuration to choose representative and distinguish between objects, thus, enhance the discriminability between the classes, with the emergence of identifying objects in the image block probability to represent the image descriptors and to eliminate the influence of the public target. Although the algorithm considers the correlation between objects in the scene, it is still limited to the simple object and does not consider the surrounding background area adjacent to the object, so the classification effect is limited. DeepScene [17] integrates Spatial Pyramid Pooling into the convolutional neural network to perform multilevel pooling on the scene image and implements the weighted average ensemble of convolutional neural networks to fuse the class scores thus improving the overall performance in scene classification. However, this method still takes the scene as a whole and does not enhance the information representing of scene, so the classification effect is limited.

Table 6. Comparative test results on the LS dataset.

Methods	Accuracy (%)
Bow [25]	74.80
CMN [36]	77.20
SPMSM [37]	82.50
GECMCT [28]	82.96
EMFS [38]	85.70
DLGB (saliency) [5]	87.40
O2C kernels [18]	88.80
ISPR+IFV [39]	91.06
Hybrid-CNN [15]	91.59
DGSK [3]	92.30
MKL [30]	92.50
DDSFL+Caffe [11]	92.81
DeepScene [17]	95.60
SDO+ fc features [4]	95.88
Ours	96.01

The comparative test results on the IS dataset are shown in Table 7. It can be seen that, similar to the results of other datasets, the effect of using convolutional neural network is significantly better than that of traditional features in general, and the classification results in this paper still have obvious advantages. Among them, the Places-CNN algorithm [15] using the Places dataset pretraining network outperforms the ImageNet-CNNS algorithm [11] using the ImageNet dataset pretraining network by nearly 12%, mainly because the network using the scene training is more effective in judging the scene category. However, hybrid-CNN algorithm uses the pretrained network of ImageNet and Places datasets to extract deep convolution features of images [15]. Therefore, the classification effect is improved compared with the first two algorithms. However, the algorithm in this paper only uses the pretrained convolutional neural network of the Places dataset and combines the context information of the visual attention area of the image to achieve a better classification effect than hybrid-CNN algorithm.

Table 7. Comparative test results on the IS dataset.

Methods	Accuracy (%)
GECMCT [28]	36.57
Object bank [19]	37.60
Midlevel elements+IFV [10]	66.87
ImageNet-CNN [14]	56.79
MOP-CNN [12]	68.88
Places-CNN [15]	68.24
Hybrid-CNN [15]	70.80
DeepScene [17]	71.00
S2ICA [6]	71.20
DLGB(saliency) [5]	71.40
CNNaug+SVM [13]	71.90
RF-CNNs [40]	72.35
Ours	72.37

4. Conclusions

This paper proposes a scene classification model based on the depth feature of visual focus area. Based on the context of significant regional detecting scene image visual interest areas in the image, with the original image overlay, enhance image visual interest area, then, will the three images into AlexNet, respectively, extraction depth visual focus features, in the end, will it into to the random forest classifier for training and classification.

Since the feature extraction model of deep visual attention in this paper also describes the target information in the image and the contextual semantic information between the target and the surrounding scene, different visual attention is constructed and the expression ability of scene characteristics is improved. Combined with the feature of multilayer deep convolution, the deep visual expression of scene image is constructed. The test results on four standard scene image sets verify the effectiveness of the proposed method, and it is better than several methods with good characteristics.

The method for the classification of the test datasets as a whole has a good effect, but when individual scene image content itself exists, ambiguity or visual attention content area does not fully express the scene; there is still a fault phenomenon, to this, and another step in the future research work will be digging deeper visual focusing on context information, in order to obtain better classification effect.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This research was funded by the Natural Science Basic Research Program of Shaanxi Province, Grant Number 2021JQ-487.

Open Research

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

1 Zou Z., Liu W., and Xing W., AdaNFF: a new method for adaptive nonnegative multi-feature fusion to scene classification, Pattern Recognition. (2022) 123, article 108402, https://doi.org/10.1016/j.patcog.2021.108402.
10.1016/j.patcog.2021.108402
Google Scholar
2 Sun G., Cong Y., Wang Q., and Xu X., Online low-rank metric learning via parallel coordinate descent method, 2018 24th International Conference on Pattern Recognition (ICPR), 2018, Beijing, China, 207–212, https://doi.org/10.1109/ICPR.2018.8546239, 2-s2.0-85059778467.
10.1109/ICPR.2018.8546239
Google Scholar
3 Trouilloud A., Kauffmann L., Roux-Sibilon A., Rossel P., Boucart M., Mermillod M., and Peyrin C., Rapid scene categorization: From coarse peripheral vision to fine central vision, Vision Research. (2020) 170, 60–72, https://doi.org/10.1016/j.visres.2020.02.008, 32259648.
10.1016/j.visres.2020.02.008
PubMed Google Scholar
4 Cheng X., Lu J., Feng J., Yuan B., and Zhou J., Scene recognition with objectness, Pattern Recognition. (2018) 74, 474–487, https://doi.org/10.1016/j.patcog.2017.09.025, 2-s2.0-85032289282.
10.1016/j.patcog.2017.09.025
Web of Science® Google Scholar
5 Zhang L., Liang R., Yin J., Zhang D., and Shao L., Scene categorization by deeply learning gaze behavior in a semisupervised context, IEEE Transactions on Cybernetics. (2021) 51, no. 8, 4265–4276, https://doi.org/10.1109/TCYB.2019.2913016, 31144650.
10.1109/TCYB.2019.2913016
PubMed Google Scholar
6 Hayat M., Khan S. H., Bennamoun M., and An S., A spatial layout and scale invariant feature representation for indoor scene classification, IEEE Transactions on Image Processing. (2016) 25, no. 10, 4829–4841, https://doi.org/10.1109/TIP.2016.2599292, 2-s2.0-84986208898.
10.1109/TIP.2016.2599292
Web of Science® Google Scholar
7 Li Y., Geng B., and Jiao S., Dispersion entropy-based Lempel-Ziv complexity: a new metric for signal analysis, Chaos, Solitons & Fractals. (2022) 161, article 112400, https://doi.org/10.1016/j.chaos.2022.112400.
10.1016/j.chaos.2022.112400
Google Scholar
8 Wang P., Wang J., Zeng G., Xu W., Zha H., and Li S., Supervised kernel descriptors for visual recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, Portland, Oregon, USA, 2858–2865.
Google Scholar
9 Singh S., Gupta A., and Efros A. A., A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, Unsupervised discovery of mid-level discriminative patches, European Conference on Computer Vision, 2012, 7573, Springer, Berlin, Heidelberg, 73–86, Lecture Notes in Computer Science, https://doi.org/10.1007/978-3-642-33709-3_6, 2-s2.0-84867880615.
10.1007/978-3-642-33709-3_6
Google Scholar
10 Doersch C., Gupta A., and Efros A. A., Mid-level visual element discovery as discriminative mode seeking, Neural Information Processing Systems. (2013) 26, 494–502.
Google Scholar
11 Zuo Z., Wang G., Shuai B., Zhao L., and Yang Q., Exemplar based deep discriminative and shareable feature learning for scene image classification, Pattern Recognition. (2015) 48, no. 10, 3004–3015, https://doi.org/10.1016/j.patcog.2015.02.003, 2-s2.0-84931564462.
10.1016/j.patcog.2015.02.003
Web of Science® Google Scholar
12 Gong Y., Wang L., Guo R., and Lazebnik S., D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Multi-scale orderless pooling of deep convolutional activation features, European conference on computer vision, 2014, 8695, Springer, Cham, 392–407, Lecture Notes in Computer Science, https://doi.org/10.1007/978-3-319-10584-0_26, 2-s2.0-84906352772.
10.1007/978-3-319-10584-0_26
Google Scholar
13 Razavian A. S., Azizpour H., Sullivan J., and Carlsson S., CNN features off-the-shelf: an astounding baseline for recognition, Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2014, Columbus, OH, USA, 806–813.
Google Scholar
14 Donahue J., Jia Y., Vinyals O., Hoffman J., Zhang N., Tzeng E., and Darrell T., DeCAF: a deep convolutional activation feature for generic visual recognition, International conference on machine learning. (2014) 32, 647–655.
Google Scholar
15 Zhou B., Lapedriza A., Xiao J., Torralba A., and Oliva A., Learning deep features for scene recognition using places database, Neural Information Processing Systems. (2014) 27, 487–495.
Google Scholar
16 Bai S., Scene categorization through using objects represented by deep features, International Journal of Pattern Recognition and Artificial Intelligence. (2017) 31, no. 9, article 1755013, https://doi.org/10.1142/S0218001417550138, 2-s2.0-85016948782.
10.1142/S0218001417550138
Google Scholar
17 Yee P. S., Lim K. M., and Lee C. P., DeepScene: scene classification via convolutional neural network with spatial pyramid pooling, Expert Systems with Applications. (2022) 193, article 116382, https://doi.org/10.1016/j.eswa.2021.116382.
10.1016/j.eswa.2021.116382
Google Scholar
18 Zhang L., Zhen X., and Shao L., Learning object-to-class kernels for scene classification, IEEE Transactions on Image Processing. (2014) 23, no. 8, 3241–3253, https://doi.org/10.1109/TIP.2014.2328894, 2-s2.0-84903266238, 24960104.
10.1109/TIP.2014.2328894
PubMed Google Scholar
19 Li L., Su H., Feifei L., and Xing E., Object Bank: a high-level image representation for scene classification & semantic feature sparsification, Neural Information Processing Systems. (2010) 23, 1378–1386.
Google Scholar
20 Li Y., Tang B., and Yi Y., A novel complexity-based mode feature representation for feature extraction of ship-radiated noise using VMD and slope entropy, Applied Acoustics. (2022) 196, article 108899, https://doi.org/10.1016/j.apacoust.2022.108899.
10.1016/j.apacoust.2022.108899
Web of Science® Google Scholar
21 Li Y., Tang B., Jiang X., and Yi Y., Bearing fault feature extraction method based on GA-VMD and center frequency, Mathematical Problems in Engineering. (2022) 2022, 2058258, https://doi.org/10.1155/2022/2058258.
10.1155/2022/2058258
Web of Science® Google Scholar
22 Goferman S., Zelnik-Manor L., and Tal A., Context-aware saliency detection, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2012) 34, no. 10, 1915–1926, https://doi.org/10.1109/TPAMI.2011.272, 2-s2.0-84865331032, 22201056.
10.1109/TPAMI.2011.272
PubMed Web of Science® Google Scholar
23 Oliva A. and Torralba A., Modeling the shape of the scene: a holistic representation of the spatial envelope, International Journal of Computer Vision. (2001) 42, no. 3, 145–175, https://doi.org/10.1023/A:1011139631724, 2-s2.0-0035328421.
10.1023/A:1011139631724
Web of Science® Google Scholar
24 Li L. and Feifei L., What, where and who? Classifying events by scene and object recognition, 2007 IEEE 11th international conference on computer vision, 2007, Rio de Janeiro, Brazil, 1–8, https://doi.org/10.1109/ICCV.2007.4408872, 2-s2.0-50649103674.
10.1109/ICCV.2007.4408872
Google Scholar
25 Lazebnik S., Schmid C., and Ponce J., Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR'06), 2006, New York, NY, USA, 2169–2178, https://doi.org/10.1109/CVPR.2006.68, 2-s2.0-33845572523.
10.1109/CVPR.2006.68
Google Scholar
26 Feifei L. and Perona P., A Bayesian hierarchical model for learning natural scene categories, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005, San Diego, CA, USA, 524–531, https://doi.org/10.1109/CVPR.2005.16, 2-s2.0-33745155436.
10.1109/CVPR.2005.16
Google Scholar
27 Quattoni A. and Torralba A., Recognizing indoor scenes, 2009 IEEE conference on computer vision and pattern recognition, 2009, Miami, FL, USA, 413–420, https://doi.org/10.1109/CVPR.2009.5206537.
10.1109/CVPR.2009.5206537
Google Scholar
28 de Souza G. K. A. and Salles E. O. T., Exploring neighborhood and spatial information for improving scene classification, Pattern Recognition Letters. (2014) 46, 83–88.
10.1016/j.patrec.2014.05.009
Google Scholar
29 Bosch A., Zisserman A., and Muñoz X., Scene classification using a hybrid generative/discriminative approach, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2008) 30, no. 4, 712–727, https://doi.org/10.1109/TPAMI.2007.70716, 2-s2.0-42449162318.
10.1109/TPAMI.2007.70716
PubMed Web of Science® Google Scholar
30 Sun X., Zhang L., Wang Z., Chang J., Yao Y., Li P., and Zimmermann R., Scene categorization using deeply learned gaze shifting kernel, IEEE Transactions on Cybernetics. (2019) 49, no. 6, 2156–2167, https://doi.org/10.1109/TCYB.2018.2820731, 2-s2.0-85046814295, 29993760.
10.1109/TCYB.2018.2820731
PubMed Google Scholar
31 Wu J. and Rehg J. M., CENTRIST: a visual descriptor for scene categorization, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2011) 33, no. 8, 1489–1501.
10.1109/TPAMI.2010.224
PubMed Web of Science® Google Scholar
32 de Souza G. K. A. and Salles E. O. T., A contextual image descriptor for scene classification, Online Proceedings on Trends in Innovative Computing, 2012, 66–71.
Google Scholar
33 Morioka N. and Satoh S., K. Daniilidis, P. Maragos, and N. Paragios, Building compact local pairwise codebook with joint feature space clustering, European conference on computer vision, 2010, 6311, Springer, Berlin, Heidelberg, 692–705, Lecture Notes in Computer Science, https://doi.org/10.1007/978-3-642-15549-9_50, 2-s2.0-78149341020.
10.1007/978-3-642-15549-9_50
Google Scholar
34 Sharma K., Gupta S., Dileep A. D., and Rameshan R., Scene image classification using reduced virtual feature representation in sparse framework, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, Calgary, AB, Canada, 2701–2705, https://doi.org/10.1109/ICASSP.2018.8462429, 2-s2.0-85054226316.
10.1109/ICASSP.2018.8462429
Google Scholar
35 Shabou A. and Leborgne H., Locality-constrained and spatially regularized coding for scene categorization, 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, Providence, RI, USA, 3618–3625, https://doi.org/10.1109/CVPR.2012.6248107, 2-s2.0-84866713665.
10.1109/CVPR.2012.6248107
Google Scholar
36 Rasiwasia N. and Vasconcelos N., Holistic context models for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2012) 34, no. 5, 902–917, https://doi.org/10.1109/TPAMI.2011.175, 2-s2.0-84859200073, 21844625.
10.1109/TPAMI.2011.175
PubMed Google Scholar
37 Kwitt R., Vasconcelos N., and Rasiwasia N., A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, Scene recognition on the semantic manifold, European Conference on Computer Vision, 2012, 7575, Springer, Berlin Heidelberg, 359–372, Lecture Notes in Computer Science, https://doi.org/10.1007/978-3-642-33765-9_26, 2-s2.0-84867882564.
10.1007/978-3-642-33765-9_26
Google Scholar
38 Song H. O., Girshick R., Zickler S., Geyer C., Felzenszwalb P., and Darrell T., Generalized sparselet models for real-time multiclass object recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2015) 37, no. 5, 1001–1012, https://doi.org/10.1109/TPAMI.2014.2353631, 2-s2.0-84927166207, 26353324.
10.1109/TPAMI.2014.2353631
PubMed Google Scholar
39 Lin D., Lu C., Liao R., and Jia J., Learning important spatial pooling regions for scene classification, Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, Columbus, OH, USA, 3726–3733.
Google Scholar
40 Bai S., Growing random forest on deep convolutional neural networks for scene categorization, Expert Systems with Applications. (2017) 71, 279–287, https://doi.org/10.1016/j.eswa.2016.10.038, 2-s2.0-85001053540.
10.1016/j.eswa.2016.10.038
Web of Science® Google Scholar
41 Zou J., Li W., Chen C., and Du Q., Scene classification using local and global features with collaborative representation fusion, Information Sciences. (2016) 348, 209–226, 2058258, https://doi.org/10.1155/2022/2058258.
10.1155/2022/2058258
Web of Science® Google Scholar

Citing Literature

All articles

[Retracted] Scene Classification Using Deep Networks Combined with Visual Attention

Retraction(s) for this article

Retracted: Scene Classification Using Deep Networks Combined with Visual Attention

Abstract

1. Introduction

2. The Construction of Scene Classification Model

2.1. Detection of Areas of Visual Attention