International Journal of Intelligent Systems

Volume 2025, Issue 1 1610145

Research Article

Open Access

Wildfire Smoke Detection System: Model Architecture, Training Mechanism, and Dataset

Chong Wang

orcid.org/0009-0008-5190-7792

State Key Laboratory of Fire Science , University of Science and Technology of China , Hefei , 230026 , Anhui, China , ustc.edu.cn

Search for more papers by this author

Chen Xu,

Chen Xu

orcid.org/0000-0003-4163-6024

Institute of Advanced Technology , University of Science and Technology of China , Hefei , 230031 , Anhui, China , ustc.edu.cn

Search for more papers by this author

Adeel Akram,

Adeel Akram

orcid.org/0000-0001-9901-0716

State Key Laboratory of Fire Science , University of Science and Technology of China , Hefei , 230026 , Anhui, China , ustc.edu.cn

Institute of Advanced Technology , University of Science and Technology of China , Hefei , 230031 , Anhui, China , ustc.edu.cn

Search for more papers by this author

Zhong Wang,

Zhong Wang

orcid.org/0000-0001-7987-4883

State Key Laboratory of Fire Science , University of Science and Technology of China , Hefei , 230026 , Anhui, China , ustc.edu.cn

Search for more papers by this author

Zhilin Shan,

Zhilin Shan

orcid.org/0000-0002-7236-3981

Institute of Advanced Technology , University of Science and Technology of China , Hefei , 230031 , Anhui, China , ustc.edu.cn

iFIRE TEK Co., Ltd. , Hefei , 230031 , Anhui, China

Search for more papers by this author

Qixing Zhang,

Corresponding Author

Qixing Zhang

[email protected]

orcid.org/0000-0002-8784-8674

State Key Laboratory of Fire Science , University of Science and Technology of China , Hefei , 230026 , Anhui, China , ustc.edu.cn

Search for more papers by this author

Chong Wang,

Chong Wang

orcid.org/0009-0008-5190-7792

State Key Laboratory of Fire Science , University of Science and Technology of China , Hefei , 230026 , Anhui, China , ustc.edu.cn

Search for more papers by this author

Chen Xu,

Chen Xu

orcid.org/0000-0003-4163-6024

Institute of Advanced Technology , University of Science and Technology of China , Hefei , 230031 , Anhui, China , ustc.edu.cn

Search for more papers by this author

Adeel Akram,

Adeel Akram

orcid.org/0000-0001-9901-0716

State Key Laboratory of Fire Science , University of Science and Technology of China , Hefei , 230026 , Anhui, China , ustc.edu.cn

Institute of Advanced Technology , University of Science and Technology of China , Hefei , 230031 , Anhui, China , ustc.edu.cn

Search for more papers by this author

Zhong Wang,

Zhong Wang

orcid.org/0000-0001-7987-4883

State Key Laboratory of Fire Science , University of Science and Technology of China , Hefei , 230026 , Anhui, China , ustc.edu.cn

Search for more papers by this author

Zhilin Shan,

Zhilin Shan

orcid.org/0000-0002-7236-3981

Institute of Advanced Technology , University of Science and Technology of China , Hefei , 230031 , Anhui, China , ustc.edu.cn

iFIRE TEK Co., Ltd. , Hefei , 230031 , Anhui, China

Search for more papers by this author

Qixing Zhang,

Corresponding Author

Qixing Zhang

[email protected]

orcid.org/0000-0002-8784-8674

State Key Laboratory of Fire Science , University of Science and Technology of China , Hefei , 230026 , Anhui, China , ustc.edu.cn

Search for more papers by this author

First published: 05 July 2025

https://doi.org/10.1155/int/1610145

Academic Editor: Mohamadreza (Mohammad) Khosravi

Share a link

Email
Wechat
Bluesky

Abstract

Vanilla Transformers focus on semantic relevance between mid- to high-level features and are not good at extracting smoke features, as they overlook subtle changes in low-level features like color, transparency, and texture, which are essential for smoke recognition. To address this, we propose the cross contrast patch embedding (CCPE) module based on the Swin Transformer. This module leverages multiscale spatial contrast information in both vertical and horizontal directions to enhance the network’s discrimination of underlying details. By combining cross contrast with the transformer, we exploit the advantages of the transformer in the global receptive field and context modeling while compensating for its inability to capture very low-level details, resulting in a more powerful backbone network tailored for smoke recognition tasks. In addition, we introduce the separable negative sampling mechanism (SNSM) to address supervision signal confusion during training and release the SKLFS-WildFire test dataset, the largest real-world wildfire test set to date, for systematic evaluation. Extensive testing and evaluation on the benchmark dataset FIgLib and the SKLFS-WildFire test dataset show significant performance improvements of the proposed method over baseline detection models.

1. Introduction

Visual wildfire detection refers to the utilization of imaging devices to capture image or video signals, followed by the application of intelligent algorithms to detect potential wildfire cues. Two critical indicators for determining the presence of wildfires are smoke and flames. Smoke often manifests earlier in the wildfire occurrence process than flames and has the advantage of being less susceptible to vegetation obstruction, making it increasingly intriguing for researchers [1]. In terms of imaging methods, wildfire detection typically involves using visible light cameras or infrared cameras to capture images or videos. Visible light-based wildfire detection is favored for its cost-effectiveness and broad coverage, gaining popularity among researchers [2, 3]. However, because of the complex background interferences in open scenes, visible light-based wildfire detection demands higher requirements for intelligent algorithms, necessitating further research efforts to advance this technology.

Over the past decade, researchers have developed a large number of deep learning–based wildfire recognition, detection, and even segmentation models. For example, References [4, 5] have proposed fire image classification networks based on deep convolutional neural networks, while References [3, 6] added classical detector heads on the basis of CNNs backbone to identify or locate smoke or flame. In recent years, transformer-based models have demonstrated comparable or superior performance to CNNs on many tasks. Therefore, some researchers have tried to apply transformers to the field of fire detection. For example, Khudayberdiev et al. [7] proposed to use Swin Transformer [8] as the backbone network to realize the classification of fire images. In Reference [9], a variety of classical backbone networks including ResNet [10], MobileNet [11], Swin Transformer [8], and ConvNeXt [12] are used to realize wildfire detection with self-designed detection heads, and it is proven in experiments that the transformer model has no obvious advantage over CNNs. We have observed similar phenomena in our experiments. Why do transformer-based models work well in other tasks but fail in wildfire detection?

Through the analysis of a large amount of real fire data, we found that smoke, one of the most typical early cues of fire, has special properties that are different from entity objects. An important basis for judging the presence of fire smoke is the spatial distribution of transparency, color, and texture. These features are generally extracted at the bottom of the deep neural networks. The transformer network establishes the correlation between different areas through the attention mechanism and has unique advantages in modeling long-distance dependencies and contextual correlation, but it has poor ability to capture low-level details. Based on this observation, this article proposes the cross contrast patch embedding (CCPE) module to promote the Swin Transformer’s ability to distinguish the underlying smoke texture. Specifically, we sequentially cascade a vertical multispatial frequency contrast structure and a horizontal multispatial frequency contrast structure within the patch embedding and use the cascaded spatial contrast results to enhance the original embedding results and input them into the subsequent network. We found that this simple design can bring extremely significant performance improvements with an almost negligible increase in computational effort.

The main difference between wildfire smoke detection and general object detection tasks is the ambiguity of smoke object boundaries. On the one hand, wildfire smoke detection in open scenes often encounters false alarms and requires the addition of a large number of negative error-prone image samples, that is, the images do not contain wildfire but there are objects with high appearance similarity to smoke. Since the number of negative image samples far exceeds the number of wildfire images, error-prone objects only account for a small proportion of the image area. The natural idea is to use online hard example mining (OHEM) [13] to focus on confusing areas when sampling negative proposals, thereby improving the accuracy of the detection model. On the other hand, smoke objects show different transparency at different spatial locations because of different concentrations. It is difficult to clearly define the density boundary between the foreground and background of smoke during manual annotation. This leads to ambiguity in the labeling range of the smoke bounding boxes. In the classic object detection framework, the label assignment of proposals needs to be determined based on the ground-truth boxes. As shown in Figure 1, the green box is the manually labeled smoke foreground, and the red boxes represent the controversial proposals, which are practically exhaustive and cannot be annotated one by one. During the model training phase, it is inappropriate to assign background labels to proposals represented by the red boxes. When the OHEM strategy encounters ambiguous smoke objects, most of the negative proposals acquired during training will be ambiguous. To solve this problem, this paper proposes a separable negative sampling mechanism (SNSM). Specifically, the positive and negative images in the batch are separated during training, and a small number of negative proposals are collected from the positive images with wildfire smoke, and OHEM is used to collect the confusing areas in the negative images without wildfire smoke. Separable negative instance sampling can increase recall and improve model performance in high recall intervals.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Display of fuzzy characteristics of smoke boundary and uncertainty of annotation. The green bounding boxes are the manually labeled ground truth, while the red are the possible candidate boxes. Following the general object detection paradigm, the red box is likely to be assigned a negative label, causing difficulties in model training.

Open-scene wildfire detection lacks a large-scale test dataset for comparison and validation. For instance, the widely employed public dataset, the fire detection dataset [14], comprises 149 videos, of which 74 videos feature fire smoke. The recently released FIgLib dataset [15] encompasses 24.8K images from 315 videos (including omitted images), while the test set only comprises 4880 images sourced from 45 videos. And, some existing researches [6, 16] only report results on undisclosed datasets. The insufficient scale of the public datasets leads to large fluctuations in test results, and the credibility of the experimental results is questionable. This article discloses a large-scale test set, named SKLFS-WildFire Test, which contains 3309 short video clips, including 340 real wildfire videos, and the rest are negative examples of no fire incidents. We obtained a total of 50,735 images by sampling frames with intervals, of which 3588 were images with smoke. False positives are the most critical issue in wildfire detection that affects the user experience, so our test set contains a large number of negative sample images with a large number of interference objects that resemble the appearance of smoke. It provides a benchmark for testing the performance of models in unpredictable environments. A preprint has previously been published [17].

To sum up, the main contributions of this paper are twofold:

1.
We propose a new transformer-based wildfire smoke detector. The issue of the insufficient ability of the transformer backbone to capture smoke details is addressed through the CCPE module. In addition, the challenge of difficult label assignment caused by the unclear boundaries of smoke objects is mitigated by the SNSM.
2.
SKLFS-WildFire test, the largest real wildfire test dataset to date, was released, surpassing even most of the published wildfire training sets.

Our algorithm underwent extensive evaluation on 1200 ground-mounted cameras, affirming its efficacy and reliability over prolonged periods.

2. Related Work

2.1. Wildfire Detection Methods

Traditional wildfire detection methods can be broadly divided into two categories: sensor-based approaches, which rely on physical and chemical sensing of combustion byproducts, and vision-based approaches, which analyze image or video data. Sensor-based wildfire detection systems, encompassing ionization [22], photoelectric [23], and gas sensors [24], have been widely implemented in residential and industrial environments because of their rapid response capabilities in detecting both flaming and smoldering combustion events. However, this technology exhibits prohibitive deployment and operational costs when applied to vast and unstructured wildland areas, thereby restricting its scalability for large-scale fire monitoring applications. Vision-based methodologies demonstrate significant potential for wildland fire monitoring because of their broad coverage capabilities and intuitive visualization advantages. Predeep learning era approaches employed handcrafted feature engineering leveraging color, texture, and motion characteristics to discriminate smoke signatures from complex backgrounds [25–27]. However, these methodologies faced critical limitations in addressing smoke’s intraclass heterogeneity (e.g., morphological variations and chromatic dispersion) and environmental confounders (e.g., reflective water surfaces and atmospheric haze), substantially restricting their operational reliability in heterogeneous wildland environments. In recent years, deep learning techniques have revolutionized wildfire detection by automating feature extraction and improving detection robustness, though practical challenges remain concerning computational efficiency, environmental variability, and false-alarm reduction.

There have been a large number of public reports of deep learning–based wildfire detection methods, and remarkable progress has been made. However, the existing methods of wildfire detection are still not satisfactory in practical applications. The characteristics of wildfire detection missions, the challenges they face, and the obstacles to their practical application are still lacking in more in-depth and adequate discussions. Most of the studies focus on the application of mainstream backbones, adjustments of parameters, and performance-efficiency trade-offs. For example, References [28–30] use CNN to extract deep features instead of manual features for image smoke detection, and [28] emphasize the use of batch normalization (BN) in the network. Some researchers [31–33] use classical backbone networks or detectors such as MobileNetV2 [34], YOLOV3 [35], SSD [36], and Faster R-CNN [37] for smoke/flame recognition or detection. For model efficiency, References [21, 38, 39] have all designed new CNN backbone networks for fire recognition, and they emphasize that the newly designed network is highly efficient. Although the efficiency of the model is of great significance, the performance of the wildfire smoke detection model has not yet reached a satisfactory level, that is, too many false positives or low recall. Xue et al. [3] try to improve the model’s small object detection capabilities by improving YOLOV5 [40]. However, in the general object detection field, there are numerous discussions and solutions regarding small object detection. While the detection of small smoke targets in the early stages of wildfires is challenging, it is not the primary issue in wildfire detection. Dewangan et al. [15] proposed wildfire smoke detection methods utilizing spatiotemporal modeling, and Bhamra et al. [41] proposed to incorporate multimodal data such as weather and satellite data, achieving remarkable performance. However, the baseline approach relying on static images for smoke detection still faces unresolved issues. Recently, researchers have also tried to use transformers to improve the model expression ability of fire recognition tasks. Khudayberdiev et al. [7] use Swin Transformer as the backbone network to classify fire images without making any improvements. Hong et al. [9] use various CNNs and transformer networks as backbones for fire recognition tasks, including Swin Transformer, DeiT [42], ResNet, MobileNet, EfficientNet [43], ConvNeXt [12], etc., to compare the effects of different backbones. Experimental results show that the transformer backbones have no significant advantage over the CNN backbones. This is consistent with our observations.

2.2. Wildfire Datasets

Publicly available wildfire datasets are few and generally small in size. Table 1 summarizes the commonly used public datasets for wildfire recognition or detection. The fire detection dataset [14] published by Pasquale et al. is one of the most commonly used fire test sets, containing 31 video clips, 14 of which contain flames. It is worth mentioning that, videos with only smoke but no flame are considered as negative samples in this dataset. They also published the smoke detection dataset, which contains 149 video clips, and 74 videos contain smoke. However, to the best of our knowledge, no published papers have been found to report experimental results on the smoke detection dataset. Chino et al. [18] made a dataset publicly available, called Bowfire, which contains 226 images, of which 119 contain flames and 107 are negative samples without flames, and are equipped with pixel-level semantic segmentation labels. Ali et al. [19] collected the forest-fire dataset, which contains a total of 1900 images in the training set and the test set, of which 950 are images with fire. In Reference [20], a dataset named FireFlame is released, which contains three categories: Fire (namely flame in this paper), smoke, and neutral, with 1000 images each, for a total of 3000 images. Arpit et al. [21] collected a publicly available dataset, named FireNet, which contains a total of 108 video clips and 160 images that are prone to error detection. Dewangan et al. [15] introduced FIgLib, which to our knowledge is the largest wildfire recognition dataset, comprising 315 videos, of which 270 are usable. Building upon FIgLib, Bhamra et al. [41] incorporated multimodal information and removed some samples with missing information, resulting in a final set of 255 videos. However, both studies suffered from relatively small test set sizes, impacting the stability of their test results. Figure 2 shows some samples of several commonly used datasets. We observe two problems with the existing datasets. Firstly, in the stage of fire development and outbreak, smoke and flame have been very intense, and the value of automatic fire detection and alarm is not high, which is not in line with the original intention of early warning and disaster loss reduction. Secondly, the data styles of laboratory ignition scenes and realistic wildfires are quite different. In the SKLFS-WildFire test dataset proposed in this paper, we collect real early wildfire data, where the fire is in the initial stage. This ensures that the dataset is more consistent with real-world scenarios of wildfire detection, allowing for a more direct evaluation of the performance of wildfire detection models in practical applications.

Table 1. Overview of publicly available fire datasets and the SKLFS-WildFire test.

Datasets	Data format	Flame/smoke	Positive number	Total number	Annotation form
Fire detection dataset [14]	Video	Flame	14 videos	31 videos	Video level
Smoke detection dataset [14]	Video	Smoke	74 videos	149 videos	Video level
Bowfire [18]	Image	Flame	119 images	226 images	Pixel level
Forest fire dataset [19]	Image	No distinction	950 images	1900 images	Image level
FireFlame [20]	Image	Flame & smoke	2000 images	3000 images	Image level
FireNet dataset [21]	Image & video	No distinction	46 videos	62 videos & 160 negative images	Image level
FIgLib dataset [15]	Video	Smoke	315 videos	315 videos	Image level
SKLFS-WildFire test dataset (OURS)	Image	Flame & smoke	3588 images	50,735 images	Box level

3. Method

Wildfire smoke detection is an extension of general object detection. General-purpose object detectors can be generally divided into two categories: single stage and multistage. Multistage object detectors are generally slightly better than single-stage object detectors because of the feature alignment, which fine-tunes the proposals from the first stage using the aligned features. However, Figure 1 shows that the location and range of smoke are very ambiguous, and the location fine-tuning in the second stage has little significance, and even degrades the detector performance by introducing excessive location cost. Therefore, based on the classic single-stage detector YOLOX [44], this paper improves the model structure and training strategy according to the particularity of smoke detection. The reason why this paper is not based on more recent studies such as the newest YOLOV8 [45] is that these methods incorporate a large number of tricks based on general object detection tasks, which are unverified in the field of wildfire detection. Furthermore, the CCPE and SNSM proposed in this paper can be easily applied to the new detector since the conclusion of this paper is general and can be extended.

3.1. System Framework

The wildfire smoke detection system is an integral component of forest fire monitoring systems. As illustrated in Figure 3, the smoke detector constitutes the core of this system. This paper proposes a novel transformer-based smoke detection model from three perspectives: model architecture, training mechanisms, and dataset composition. The proposed model is deployed on intelligent gimbal cameras or edge AI boxes to perform real-time video analysis, transmitting suspicious video segments and detection results back to the fire comprehensive management platform. Maintenance personnel on duty manually verify whether the detected incidents constitute actual fires. Upon confirmation of high-risk levels, the platform sends alarm signals via text messages, WeChat Mini Programs, voice calls, and other means to notify wildfire prevention personnel for on-site investigation. The system employs a human-machine coupling approach, balancing high recall rates while tolerating a certain number of false alarms. However, an increase in false alarms significantly burdens the workload of the personnel.

3.2. Model Pipeline

The overall network structure is shown in Figure 4. In this paper, the Swin Transformer backbone network is used to extract multiscale features; the PAFPN [46] is used for integrating multiscale features by adding a bottom-up fusion pathway after the top-down fusion pathway of FPN [47]. Finally, the YOLOX detection head is used to identify and locate smoke instances.

The input of the backbone network is the RGB image group, denoted as , where B, W, and H are the batch size, the width, and height of the image, respectively. After the backbone network and neck, three scales of features , , and are obtained, where C is a hyperparameter and refers to output channels of the Swin Transformer’s first stage. Finally, the multiscale features are fed into the YOLOX head to predict confidence, class, and bounding box. The transformer has strong modeling ability for context correlation and long-distance dependence, but its ability to capture the detailed texture information of images is weak. To address this problem, this paper redesigns the patch embedding of Swin Transformer for smoke detection. In order to solve the problem of ambiguity of label assignment caused by the difficulty of determining the smoke boundary, a SNSM is added to the loss function, and the obtained positive and negative sample masks are used to weight the costs of classification, regression, and confidence.

3.3. CCPE

Smoke has a very unique nature, that is, an important clue to the presence of smoke lies in the contrast of color and transparency in space. Capturing spatial contrast can highlight smoke foreground in the background and distinguish fire smoke from homogeneous blurriness, such as poor air visibility, motion blur, zoom blur, and so on. However, the transformer architecture is weak in capturing low-level visual cues. To solve this problem, we propose a novel CCPE module, which is composed of a horizontal contrast component and a vertical contrast component in series.

3.3.1. Horizontal Contrast

As shown in Figure 5(a), a group of RGB images

is fed into a 2D convolution layer with stride 4, and the output is the feature

. Decompose F into column vectors

, and then shift the column to the left. For example, when the shift stride is s, a new feature map is obtained.

()

in which, [·] is the column index operator, and mod(·) is the remainder function. The proposed horizontal contrast component contains column shifts with multiple strides, denoted as the set of strides

, and the total number of elements is S. After subtracting the misalignment feature F_s generated by each stride from the original feature F, a 2D convolution with stride of 3 × 3 is performed to obtain the lateral contrast mask, denoted as

()

Finally, the original feature F and the contrast masks of all scales are concatenated on the channel, and a 3 × 3 2D convolution is applied to adjust the feature maps F^H to 48 channels.

3.3.2. Vertical Contrast

Vertical contrast does exactly the same thing as the horizontal contrast, except changing the input to F^H, the output of vertical contrast, and changing column shift to row shift. As shown in Figure 5(b), the input feature map F^H is decomposed into row vectors

. The set

containing various row shift strides is designed, and then the F^H is shifted with

as stride to obtain a new feature map.

()

in which [·] is the row index operator. Then, the shifted feature maps are subtracted from F^H, respectively, and the potential vertical contrast mask,

, is output after 2D convolution. All masks are concatenated with the F^H, and the 48-channel feature map F^V is obtained after a 2D convolution adjustment.

Finally, F and F^V are concatenated in the channel dimension to obtain the final patch embedding result. Compared with the vanilla patch embedding, the proposed CCPE can quickly capture the multiscale contrast of smoke images. Moreover, the computational cost of spatial shift operation is very small; only a small amount of convolution operators increases the computational cost. Therefore, CCPE makes up for the natural defects of transformers in smoke detection and obtains significant improvement in effect at a controllable computational cost.

3.4. Separable Negative Sampling

In this paper, the images with smoke objects are called positive image samples, and the images without smoke are called negative image samples. And, we refer to image regions containing smoke according to detector rules (e.g., IoU threshold) as positive instance samples, while regions without smoke as negative instance samples.

Smoke detection applications in open scenes often suffer from missed detection and false detection. It has been shown in Figure 1 that the smoke boundary is difficult to define. One of the reasons for missed smoke detection is that in the positive image samples, a large number of image regions may have smoke but are assigned negative labels during the training process, while only the manually labeled regions are assigned positive labels. The areas with smoke that get negative labels often contribute to more loss, which increases the difficulty of training, which then leads to a high false negative rate of the detector. Meanwhile, the cause of false smoke detection lies in the complex background in open scenes. There are a large number of image regions with high appearance similarity to smoke. In engineering, false detections can be suppressed by adding error-prone images. The error-prone region only accounts for a small part of the image, so the loss share of the error-prone region can be strengthened by OHEM [13]. OHEM involves identifying the top-K negative samples with the highest scores during training, which typically represent hard-negative examples. If the traditional OHEM is used to highlight the difficult regions on all the images, the label ambiguity problem in the positive image samples will be more prominent. Therefore, we propose the SNSM to alleviate this problem.

As shown in Figure 4, the feature map is taken as an example. YOLOX head has three branches for confidence prediction, classification, and regression, respectively. The positive mask mask_pos is determined using manually annotated ground truth as well as positive sample assignment rules. YOLOX sets the center region of 3 × 3 as positive, and this paper follows this design. The classification loss as well as the loss for the regression are weighted using Mask_pos. In the vanilla YOLOX head, all spatial locations contribute to confidence loss. However, because of the addition of a large number of negative error-prone images, we use all positive locations, namely, Mask_pos, and part of the negative locations after sampling to weight the confidence loss by location.

Considering the importance of sampling negative locations for loss computation, we propose the SNSM. We denote the initial mask containing all negative locations as initMask_neg = 1 − mask_pos. We divide initMask_neg into two groups according to the presence or absence of smoke, that is, the positive image group and the negative image group , where B_p and B_n are the number of positive and negative images, respectively. In , negative locations are randomly sampled according to the number of positive locations, and the result is denoted as . We randomly select a number of negative samples, that is, α¹ times the number of positive samples. In , OHEM is used to collect negative samples. Initially, all negative samples are ranked in the descending order based on their scores, and then the top-K negative samples are selected for training, which are denoted as . Here, K is set to α² times the number of positive samples. We set α¹ ≫ α², so that the negative samples learned by the model are more from negative images, and more attention is paid to the image regions prone to misdetection because of OHEM.

4. Experiments

4.1. Datasets

FIgLib [15] introduced by Dewangan et al. in 2022 stands as the largest forest fire dataset to date. It encompasses 315 wildfire videos, of which 270 are usable. The dataset is divided into three subsets: training, validation, and testing, comprising 144, 64, and 62 videos, respectively. Notably, FIgLib’s distinguishing feature lies in its reliance on temporal information for smoke identification. Even human intelligence struggles to discern smoke solely from single-frame images within FIgLib.

SKLFS-WildFire is an open scene early fire dataset, which contains two parts: training set and test set. The train split contains 299,827 images derived from 30,470 video clips, of which 72,145 images contain smoke objects. The test split is all derived from real forest fire videos and contains 50,735 images (from 3309 videos), of which 3588 images are positive images (from 340 videos). There are two image sizes: 2688 × 1520 and 1920 × 1080, and all the videos were captured using visible light cameras of unknown make and model. SKLFS-WildFire has four characteristics:

1.
High diversity. Most of the data of SKLFS-WildFire comes from video, but we are not taking too many images on one video. About 10 frames are extracted from each video on average to ensure the diversity of data.
2.
Strict train-test split design. We split the test dataset according to the administrative region where the cameras are deployed, and all the test videos are in completely different geographical locations from the training dataset.
3.
Difficult negative samples. We added a large amount of nonsmoke data on both the training and test sets. These nonsmoke data come from the false detection of classical detectors such as Faster R-CNN [37], YOLOV3 [35], and YOLOV5 [40].
4.
Realistic early fire scenes. All SKLFS-WildFire data come from the backflow of application data, mainly for the initial stage of fire, not the ignition data in the laboratory scenario or the middle and late stage of wildfire.

To protect user privacy, the training set of SKLFS-WildFire is not available for the time being. We will open up the SKLFS-WildFire test to facilitate academic researches.

4.2. Metrics for SKLFS-WildFire

The biggest difference between smoke detection and general object detection is the ambiguity of the boundary. Therefore, directly using the commonly used metrics in object detection cannot comprehensively reflect the quality of the model. Therefore, we evaluate the comprehensive performance of the model from the three levels: bounding box, image, and video. At the bounding box level, we use the PR curve and the average precision with the IoU threshold of 0.1 ([email protected]). At the image/video level, we treat the smoke detection as a classification task and use PR curve and ROC curve as well as AUC to evaluate the image/video classification performance. It should be pointed out that we take the maximum score of all bounding boxes in a image/video as the score of whether a image/video has smoke or not. We compute AUC using the Mann–Whitney U Test.

()

in which,

and

represent the set of positives and negatives, respectively. Score_i is the sample i’s score. |·| is the cardinal number operator, the I(·) is the scoring function defined as follows:

()

4.3. Implement Details

All models are implemented using the MMDetection [48] framework based on the PyTorch [49] backend. We used the NVIDIA A100 GPUs to train all models while using a single NVIDIA GeForce RTX 3080Ti to compare inference efficiency. The input of the wildfire smoke detection models is a batch of images, and we set the batch size B to 64. In the training phase, we follow the data augmentation strategy of YOLOX and use mosaic, random affine, random HSV, and random flip to enhance the diversity of data and improve the generalization ability of the models. Finally, through resize and padding operations, the size W and H of the input are both 640. The set of horizontal strides S^H and the set of vertical strides S^V in CCPE are both set as {1, 2, 4, 8, 16, 32, 64, 128} to capture the contrast information of different scales with eight kinds of strides. In SNSM, the negative sampling ratio α¹ of positive image samples is set to 10, and the negative sampling ratio α² of negative image samples is set to 190. The proposed model is optimized with the SGD optimizer with an initial learning rate of 0.01, momentum set to 0.9, and weight decay set to 0.0005, and a total of 40 epochs are trained. The mixed precision training method of float32 and float16 was used in the training, and the loss scale was set to 512.

4.4. Comparison on FIgLib

We did not find bounding box-level labels for FIgLib, so we reannotated the boxes in the training subset. Since smoke from FIgLib is visually more challenging to discern, we increased the input size of the images from 640 to 1024. Because of FIgLib’s emphasis on temporal features for smoke detection, we not only retrained the proposed image model on its training set but also concatenated the current frame I_t with the past frame I_t−2 in the image channels to obtain a temporal model, where t and t − 2 represent the current time and the time 2 minutes ago, respectively (the frame interval in FIgLib is 1 minute). By implementing the simplest spatiotemporal modeling framework (channel-level concatenation), we provide empirical evidence that the exceptional performance of our proposed model on FIgLib primarily derives from the efficacy of its image-centric architectural design, rather than relying on sophisticated spatiotemporal processing mechanisms. We adopt the metrics in SmokeyNet [15] and compare the results in P (precision), R (recall), F1 score, Acc (accuracy), and TTD (time to detection, minutes) against those reported in Reference [15]. Accuracy and F1 score are the primary composite metrics of importance, with other metrics serving as supplementary references. Table 2 presents the comparison of the effects. In the context of single-frame input models, the proposed model has achieved comparable results, although not the best. This is attributed to the smoke in the FIgLib dataset, which is prone to being confused with other objects in single-frame images, thereby hindering more expressive models from achieving superior performance. A testament to this observation is the comparison between the results of ResNet34 in row 5 and ResNet50 in row 6. Conversely, in the realm of multiframe input models, even with our model’s simple utilization of concatenation to model temporal sequences, we have still attained state-of-the-art (SOTA) results. This underscores the efficacy of the proposed approach.

Table 2. Comparative results on the FIgLib dataset.

Models	Backbone	Params (M)	Acc (%)	F1 (%)	Precision (%)	Recall (%)	TTD
Single-frame models
OURS	ContrastSwin	35.9	81.48	81.12	83.74	78.66	2.73
SmokeyNet [15]	ResNet34 + ViT	40.3	82.53	81.30	88.58	75.19	2.95
SmokeyNet [15]	ResNet34	22.3	79.40	78.90	81.62	76.58	2.81
SmokeyNet [15]	ResNet50	26.1	68.51	74.30	63.35	89.89	1.01
FasterRCNN [15]	ResNet50	41.3	71.56	66.92	81.34	56.88	5.01
MaskRCNN [15]	ResNet50	43.9	73.24	69.94	81.08	61.51	4.18

Multiframe models
OURS	2 frames ContrastSwin	35.9	84.67	84.05	88.74	79.83	2.26
SmokeyNet [15]	3 frames ResNet34 + LSTM + ViT	56.9	83.62	82.83	90.85	76.11	2.94
SmokeyNet [15]	2 frames ResNet34 + LSTM + ViT	56.9	83.49	82.59	89.84	76.45	3.12
SmokeyNet [15]	2 frames MobileNet + LSTM + ViT	36.6	81.79	80.71	88.34	74.31	3.92
SmokeyNet [15]	2 frames TinyDeiT + LSTM + ViT	22.9	79.74	79.01	84.25	74.44	3.61

Note: The proposed model demonstrates strong performance in single-frame input models and achieves the best results in multiframe input models on the FIgLib dataset. Bold values represent the best values among all methods of their category under each metric (column).

4.5. Ablation Studies

The CCPE module is proposed to make up for the insufficient ability of the transformer backbone network to capture the details of smoke texture changes in space. A separable negative sampling strategy is proposed to alleviate the ambiguity of smoke negative sample assignment and improve the recall rate, that is, the model performance under the premise of high recall. In this section, the effectiveness of these two methods is verified by ablation experiments on the SKLFS-WildFire test dataset.

4.5.1. Effectiveness of CCPE

As shown in Table 3, after replacing the CNN-based backbone network CSPDarknet [50] with the Swin Transformer Tiny (denoted as Swin-T) in YOLOX, the AP of the bounding box level decreased. The AUC of image level and video level significantly improved. This proves our observation that the transformer architecture has a stronger ability to model contextual relevance, while CNNs are better at capturing the detailed texture information for better location prediction. In the last row of the table, the patch embedding of Swin-T is replaced by the proposed CCPE. The experimental results show that CCPE consistently outperforms the original YOLOX model and the model replaced by the Swin-T backbone in terms of bounding box, image, and video levels. Figure 6 presents the PR curves at the bounding box level and the PR curves and ROC curves at image and video levels, respectively. These curves show the same conclusions as the numerical metrics, proving that the proposed CCPE can effectively make up for the lack of transformer model ability to model the low-level details of smoke, and significantly improve the comprehensive performance of smoke object detection model with a small number of parameters and calculations.

Table 3. Effectiveness of cross contrast patch embedding on Swin.

Models	Backbone	BBox [email protected]	Image AUC	Video AUC	Params (M)	GFLOPs
YOLOX	CSPDarknet large	0.506	0.712	0.900	54.209	155.657
YOLOX-SwinT	Swin tiny	0.476	0.732	0.909	35.840	50.524
YOLOX-SwinS	Swin small	0.484	0.735	0.914	57.158	75.642
YOLOX-ContrastSwinT	CCPE + Swin tiny	0.537	0.765	0.908	35.893	53.250
YOLOX-ContrastSwinS	CCPE + Swin small	0.560	0.776	0.915	57.211	78.368

Note: CNN-based YOLOX significantly outperforms Swin-based YOLOX on BBox-level metric. After using CCPE to improve Swin, model performance significantly exceeds the baseline on all metrics. Bold values represent the best values under each metric (column).

4.5.2. Effectiveness of Separable Negative Sampling

Separable negative sampling is designed for the problem of uncertain boundary of smoke instances. As shown in Table 4, the baseline model YOLOX-ContrastSwin does not employ any negative sampling strategy, that is, all locations participate in the training of the confidence branch. We experimented by randomly sampling 200 times the number of positive locations on negative locations, denoted as “Random.” Similarly, we also tried to sample the negative locations with 200 times the number of positive locations using OHEM with the order of the highest score to the lowest score, denoted as “OHEM.” Finally, SNSM is used to select the negative locations participating in the training, denoted as “SNSM.”

Table 4. Effectiveness of separable negative sampling.

Models	Sampling	BBox [email protected]	Image AUC	Video AUC	Params (M)	GFLOPs
YOLOX-ContrastSwin	None (baseline)	0.537	0.765	0.908	35.893	53.250
YOLOX-ContrastSwin	Random	0.527	0.847	0.933	35.893	53.250
YOLOX-ContrastSwin	OHEM	0.574	0.783	0.924	35.893	53.250
YOLOX-ContrastSwin	SNSM (OURS)	0.503	0.900	0.934	35.893	53.250

Note: Although there is a decrease in box level AP, the proposed SNSM is able to significantly enhance the comprehensive metrics at image level and video level. Bold values represent the best values under each metric (column).

As can be seen from the results in Table 4, after adopting SNSM, the AP of bounding box decreases from 0.537 to 0.503, while the AUC values of image level and video level are absolutely increased by 13.5% and 2.6%, respectively. The AP of the bounding box level decreases because although SNSM can alleviate ambiguity and increase recall, it provides fewer negative training samples in positive images, which leads to inaccurate box positions. It can also be seen from Figure 7 that SNSM does not perform well in the high-precision and low-recall interval, but it can make the model have relatively high precision in the high-recall interval. The results reveal that our model exhibits discrepancies between predicted and human-annotated bounding boxes in positive frames (because of localization and quantity variations), while generating fewer false positives in negative frames. This trade-off holds greater practical value for smoke detection systems, where alarm accuracy for video clips is prioritized over box precision.

4.6. Comparison to Classical Detectors

It is a tradition to compare with classical methods. Since the code and datasets used in the researches of wildfire detection are rarely publicly available, we did not compare existing wildfire detection methods. Instead, we reproduce some classic and general object detectors, such as YOLOV3 [35], YOLOX [44], RetinaNet [51], Faster R-CNN [37], and Sparse R-CNN [52], on our own training and testing sets. As can be seen from the Table 5 below, the proposed model outperforms all existing models in terms of image- and video-level classification metrics without a significant increase in the number of parameters and calculation. The metrics at the bounding box level are slightly worse than the YOLOX model with CSPDarkNet backbone network in the third and fourth rows. There are two reasons for this phenomenon. First, the location and range of the smoke objects are controversial, so it is difficult to accurately detect. Secondly, SNSM reduces controversy by reducing the contribution of negative instances in positive image samples, which not only improves the recall rate of fire alarms but also damages the fire clues localization inside positive image samples. However, we believe that in the wildfire detection task, detecting fire event and raising alarm is the first priority, while locating the position of smoke or flame in the image is a lower priority although it is also important.

Table 5. Results of the proposed model and the other classical detectors on SKLFS-WildFire test dataset.

Models	Backbone	BBox [email protected]	Image AUC	Video AUC	Params (M)	GFLOPs
YOLOV3	DarkNet-53	0.421	0.588	0.823	61.529	139.972
YOLOX	CSPDarknet small	0.504	0.679	0.882	8.938	26.640
YOLOX	CSPDarknet large	0.506	0.712	0.900	54.209	155.657
YOLOX	Swin tiny	0.476	0.732	0.909	35.840	50.524
RetinaNet	ResNet101	0.365	0.810	0.879	56.961	662.367
Faster R-CNN	ResNet101	0.432	0.598	0.858	60.345	566.591
Sparse R-CNN	ResNet101	0.505	0.852	0.914	125.163	472.867
OURS	Swin tiny	0.503	0.900	0.934	35.893	53.250

Note: The proposed model outperforms all existing models in both image- and video-level metrics. Bounding box level [email protected] is slightly worse than the YOLOX model with CSPDarkNet backbone network. Bold values represent the best values under each metric (column).

4.7. Visualization Analysis

Figure 8 shows the detection results of the proposed model and other baseline models on some samples of the SKLFS-WildFire test dataset. The threshold score is set to 0.5. The green image border indicates that the wildfire smoke objects are correctly detected, while the orange border indicates that the wildfire smoke objects are missed, and the red bounding boxes in the images represent the smoke detection results. As can be seen from the figure, the proposed model has a huge advantage in terms of recall. The advantage of recall rate is precisely derived from the alleviation of the ambiguity problem of smoke box location and range by SNSM. However, as can be seen from the results of the last row, the high recall rate comes at the cost of multiple detection boxes being repeated in the smoke images, and these detection boxes are difficult to suppress with NMS. This phenomenon also confirms from the side that the problem of positional ambiguity of smoke objects does exist. The model generating multiple detection boxes for a single smoke instance will lead to a relatively low bounding box–level metrics, such as [email protected], which is exactly the problem shown in Tables 4 and 5.

5. Deployment and Limitations

This section details our systematic integration of the wildfire smoke detection model (Swin Tiny variant) into the operational pipeline, including methodology for converting model outputs into actionable platform alerts. We also conduct a critical evaluation of the system’s inherent limitations. Such comprehensive technical disclosure not only facilitates a deeper comprehension of our framework’s practical implementation but also offers concrete implementation guidance for researchers and engineers aiming to adapt our algorithm into real-world surveillance systems operating under resource-constrained conditions.

5.1. Deployment

5.1.1. Postprocessing Strategies for Reliable Smoke Alarm Delivery

The raw output of a wildfire smoke detection model is not directly suitable for end-user applications. On the one hand, the complex backgrounds and numerous distractors in wildland scenes can result in a high false alarm rate when model predictions are used as is. This leads to significant waste of firefighting resources, as each detection would require manual verification. A typical forest fire monitoring PTZ (pan-tilt-zoom) camera covers a range of 3 to 10 km, with 5 km being a common value. This implies that a single camera may monitor an area of approximately 80 square kilometers. The coverage of inspection drones is even larger. Within such vast areas, the likelihood of false positives caused by confusing objects is nonnegligible. If detection results are directly forwarded to users, the system could generate hundreds or even thousands of false alarms per day. On the other hand, even in the event of a real wildfire, users do not expect to receive frequent or redundant alerts. Typically, users prefer not to be alerted repeatedly within every 10 minutes following a reliable alarm. Therefore, the high-frequency outputs of object detection models—often generating several detections per second—are unacceptable in practical deployments. To address these issues and enhance system reliability, we encapsulate the object detection model with lightweight postprocessing algorithms. First, a temporal smoothing strategy is employed to reduce false positives. For PTZ cameras, our system adopts a 6 s temporal sliding window, where alerts are triggered only if over 60% of frames within the window contain smoke detections. For drones, the model analyzes every frame in real-time during inspection. When smoke is detected, the drone suspends its flight and performs a “gaze” operation by observing the suspicious area for 6 s. Similarly, alerts are triggered only if over 60% of frames within the window contain smoke detections. Second, we introduce a regional cooling mechanism to suppress repeated alarms. After an alarm is issued, the system estimates the approximate location of the fire source based on the camera’s longitude, latitude, altitude, orientation, and geographic information system (GIS) map. No additional alerts will be triggered within a predefined radius (e.g., 500 m) around this location for the next 30 minutes. Lastly, several false alarm suppression rules are integrated into the system. In certain scenarios, we incorporate semantic segmentation models to identify and mask regions such as sky, water bodies, buildings, and factories. Smoke detection bounding boxes that exhibit high IoU with these regions (above certain thresholds) are classified as false positives. For frequently misidentified static objects such as walls, rocks, or roads, we define “ignore zones” in the system. These zones are excluded from consideration based on either their geographic coordinates or image-based matching results.

5.1.2. Deployment on Embedded AI Platforms

Our model is primarily deployed on intelligent PTZ cameras and edge computing boxes (for drones, data are transmitted to a ground-based edge box for processing). The smart PTZ cameras are integrated with the Rockchip RK3588 SoC, which features a 6 tera operations per second (TOPS) NPU. The edge computing boxes are equipped with the Jetson Orin Nano module, providing 36 TOPS of AI compute. Both platforms support INT8 and INT16 inference modes. We apply posttraining quantization to convert the original floating-point model into a hybrid INT8/INT16 version. The quantized model maintains performance fidelity, with all metrics deviating by no more than 0.08% from the original float model. On the RK3588 platform, we employ the RKNN inference backend to enable parallel processing. Empirical testing shows a per-frame inference latency of 116 ms with an input resolution of 640 × 640, and memory usage remains under 1.2 GB. On the Jetson Orin Nano, we use ONNX Runtime as the frontend, paired with TensorRT/CUDA as the backend for acceleration. This configuration achieves a per-frame inference latency of 89 ms at the same input resolution. Our wildfire smoke detection system operates with a frame-skipping strategy, analyzing two frames per second. Under this setting, the inference speed comfortably meets real-time processing requirements. Field tests demonstrate that during continuous 24-hour operation, the average power consumption—including both model inference and postprocessing—is 3.9 W on the RK3588 and 7.7 W on the Jetson Orin Nano.

5.2. Limitations

5.2.1. SNSM-Induced Inaccuracies

The proposed SNSM mitigates the assignment of negative labels to regions perceived as smoke-containing by humans but conflicting with ground truth annotations during training, thereby enhancing smoke box recall. However, a limitation of SNSM lies in the generation of numerous plausible yet annotation-inconsistent detection boxes within positive images. This phenomenon stems from a lack of negative instance supervision within positive training images. Figure 9(a) illustrates typical samples exhibiting such prediction inaccuracies. Theoretical analysis reveals that when the sum of parameters α¹ and α² remains constant (i.e., fixed positive-negative instance ratio during training), increasing the α¹/α² ratio implies more negative instances sampled from positive images. This improves spatial prediction accuracy at the cost of heightened training ambiguity and reduced recall for smoke-containing images/videos. Conversely, decreasing the α¹/α² ratio amplifies SNSM’s impact, leading to less accurate box localization in positive images while enhancing image/video-level recall. This trade-off offers configurable solutions for diverse application requirements: smaller α¹/α² ratios should be prioritized when timely wildfire alerting is critical, whereas larger ratios are preferable when precise box localization is needed to derive fire point coordinates.

5.2.2. Fog

The most critical issue is false alarms caused by fog. As shown in Figure 9, the model may produce incorrect detections in images containing fog. Although expanding the training dataset can alleviate this problem, the high visual similarity between fog and smoke prevents a fundamental solution through data augmentation alone. We propose the following solutions: (1) Integrate meteorological information (precipitation, atmospheric humidity) to filter wildfire alarms, as fog typically forms under specific weather conditions. (2) Adopt spatiotemporal smoke detection models. While fog and smoke appear similar visually, their motion characteristics differ, which spatiotemporal approaches can exploit to reduce false positives. Since a single paper cannot resolve all issues, we focus on future work for comprehensive solutions.

5.2.3. Smoke-Like Objects

Besides fog, specific objects may also trigger false alarms. Figure 9 illustrates typical examples. Water surface reflections are common interference—under special angles, reflections can resemble smoke. Plant leaves and flowers may also cause false detections. Though humans perceive them as distinct from smoke, models face challenges in open scenarios where generalization of performance remains limited. Notably, such errors occur across detection algorithms. Based on statistical metrics and subjective observations, our method exhibits fewer such errors, primarily because of our more diverse training dataset and the transformer model’s robust scene understanding capabilities.

5.2.4. Small Objects

The proposed model demonstrates strong small object detection capabilities. As shown in Figure 9, small objects often appear at long distances (several kilometers or more). Our model effectively detects even faint smoke undetectable to human eyes. However, weak visual cues in small objects risk frame drops in video sequences. We recommend: (1) Model improvements. Using high-resolution networks for detail extraction or motion information to compensate for spatial feature limitations and (2) Engineering solutions. Increasing focal length to confirm small smoke objects.

5.2.5. Diffused Smoke

Image-based smoke detection relies on color and transparency contrasts between smoke and background. Fully diffused, contour-less smoke is inherently undetectable. Figure 9 shows detection results under varying diffusion levels. Detection capability depends critically on diffusion degree: our CCPE module effectively captures spatial contrast, enabling robust detection even in severe diffusion cases (left). However, completely homogeneous smoke remains undetectable (right).

5.2.6. Divergent Perspectives

To evaluate performance under different viewpoints (particularly vertical overhead), we deployed drones to fly vertically over smoke (As shown in Figure 9(b)). The model consistently detected smoke. This occurs because natural wind-driven smoke moves obliquely upward rather than vertically, making complete smoke contours highly probable across viewpoints.

5.2.7. Lighting Conditions

Since cameras use overhead viewpoints with automatic exposure control, natural daylight intensity has no impact. Our smoke detection model applies only to daytime normal lighting conditions and cannot operate at night. Wildfire detection at night typically employs thermal imaging devices to detect open flames rather than smoke.

6. Conclusion

The performance of the transformer model in wildfire detection is not significantly better than that of CNN-based models, which deviates from the mainstream understanding of general computer vision tasks. Through analysis, we find that the advantage of the transformer model is to capture the long-distance global context, but the ability to model the texture, color, and transparency of images is poor. Furthermore, the main clue of fire smoke discrimination lies in these low-level details. Therefore, this paper proposes a CCPE module to improve the Swin Transformer. Experiments show that the CCPE module can significantly improve the performance of Swin in smoke detection.

Another main difference between smoke detection and general object detection is that the range of smoke is difficult to determine, and it is difficult to distinguish single or multiple instances of smoke. The assignment mechanism of negative instances in traditional object detectors can lead to contradictory supervision signals during training. Therefore, this paper proposes SNSM, which separates positive and negative image samples and uses different mechanisms to sample negative instances to participate in training. Experiments show that SNSM can effectively improve the recall rate of smoke, especially in the image-level and video-level metrics. However, the method proposed in this paper detects smoke based on static images, and through analysis, it is found that with this method, it is difficult to eliminate the interference of cloud and fog between mountains. Building a fire detection model based on spatiotemporal information is the next step.

Further, we released a new early wildfire dataset of real scenes, the SKLFS-WildFire test, which can comprehensively evaluate the performance of wildfire detection model from three levels: bounding box, image, and video. Its publication can provide a fair comparison benchmark for future research, whether it is detection or classification, static image schemes or video spatiotemporal schemes, and boost the development of the field of wildfire detection.

Disclosure

The preprint is available on arXiv and can be accessed at https://arxiv.org/pdf/2311.10116.

Conflicts of Interest

The authors declare no conflicts of interest.

Author Contributions

Chong Wang: conceptualization, formal analysis, investigation, methodology, software, validation, visualization, and writing – original draft. Chen Xu: data curation, validation, visualization, and writing – review and editing. Adeel Akram: validation and writing – review and editing. Zhong Wang: project administration. Zhilin Shan: resources. Qixing Zhang: conceptualization, funding acquisition, resources, and supervision.

Funding

This research was supported by National Natural Science Foundation of China, 32471866; Anhui Provincial Science and Technology Major Project, 202203a07020017; Central Government-Guided Local Science and Technology Development Funds of Anhui Province, 2023CSJGG1100; Key Project of Emergency Management Department of China, 2024EMST010101.

Acknowledgments

The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of University of Science and Technology of China. We thank the center for providing computational resources and technical support.

Open Research

Data Availability Statement

The data that support the findings of this study are openly available in Github at https://github.com/WCUSTC/CCPE.

References

1 Chaturvedi S., Khanna P., and Ojha A., A Survey on Vision-Based Outdoor Smoke Detection Techniques for Environmental Safety, ISPRS Journal of Photogrammetry and Remote Sensing. (2022) 185, 158–187, https://doi.org/10.1016/j.isprsjprs.2022.01.013.
10.1016/j.isprsjprs.2022.01.013
Web of Science® Google Scholar
2 Yar H., Ullah W., Ahmad Khan Z., and Wook Baik S., An Effective Attention-Based Cnn Model for Fire Detection in Adverse Weather Conditions, ISPRS Journal of Photogrammetry and Remote Sensing. (2023) 206, 335–346, https://doi.org/10.1016/j.isprsjprs.2023.10.019.
10.1016/j.isprsjprs.2023.10.019
Web of Science® Google Scholar
3 Xue Z., Lin H., and Wang F., A Small Target Forest Fire Detection Model Based on Yolov5 Improvement, Forests. (2022) 13, no. 8, 1332–1346, https://doi.org/10.3390/f13081332.
10.3390/f13081332
Web of Science® Google Scholar
4 Muhammad K., Ahmad J., Lv Z., Bellavista P., Yang P., and Baik S. W., Efficient Deep Cnn-Based Fire Detection and Localization in Video Surveillance Applications, IEEE Transactions on Systems, Man, and Cybernetics: Systems. (2019) 49, no. 7, 1419–1434, https://doi.org/10.1109/tsmc.2018.2830099, 2-s2.0-85048524015.
10.1109/TSMC.2018.2830099
Web of Science® Google Scholar
5 Gu K., Xia Z., Qiao J., and Lin W., Deep Dual-Channel Neural Network for Image-Based Smoke Detection, IEEE Transactions on Multimedia. (2020) 22, no. 2, 311–323, https://doi.org/10.1109/tmm.2019.2929009.
10.1109/TMM.2019.2929009
Web of Science® Google Scholar
6 Lin J., Lin H., and Wang F., A Semi-supervised Method for Real-Time Forest Fire Detection Algorithm Based on Adaptively Spatial Feature Fusion, Forests. (2023) 14, no. 2, 361–381, https://doi.org/10.3390/f14020361.
10.3390/f14020361
Web of Science® Google Scholar
7 Khudayberdiev O., Zhang J., Elkhalil A., and Balde L., Fire Detection Approach Based on Vision Transformer, International Conference on Adaptive and Intelligent Systems, April 2022, 41–53, https://doi.org/10.1007/978-3-031-06794-5_4.
10.1007/978-3-031-06794-5_4
Google Scholar
8 Liu Z., Lin Y., Cao Y. et al., Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows, IEEE International Conference on Computer Vision, April 2021, 10012–10022.
Google Scholar
9 Hong Z., Hamdan E., Zhao Y., Ye T., Pan H., and Cetin A. E., Wildfire Detection via Transfer Learning: a Survey, Signal, Image and Video Processing. (2024) 18, no. 1, 207–214, https://doi.org/10.1007/s11760-023-02728-3.
10.1007/s11760-023-02728-3
Web of Science® Google Scholar
10 He K., Zhang X., Ren S., and Sun J., Deep Residual Learning for Image Recognition, IEEE Conference on Computer Vision and Pattern Recognition, May 2016, 770–778, https://doi.org/10.1109/cvpr.2016.90, 2-s2.0-84986274465.
10.1109/cvpr.2016.90
Google Scholar
11 Howard A. G., Zhu M., Chen B. et al., Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications, arXiv preprint arXiv:1704.04861. (2017) .
Google Scholar
12 Liu Z., Mao H., Wu C., Feichtenhofer C., Darrell T., and Xie S., A Convnet for the 2020s, IEEE Conference on Computer Vision and Pattern Recognition, July 2022, 11976–11986.
Google Scholar
13 Shrivastava A., Gupta A., and Girshick R., Training Region-Based Object Detectors with Online Hard Example Mining, IEEE Conference on Computer Vision and Pattern Recognition, July 2016, 761–769.
Google Scholar
14 Foggia P., Saggese A., and Vento M., Real-time Fire Detection for Video-Surveillance Applications Using a Combination of Experts Based on Color, Shape, and Motion, IEEE Transactions on Circuits and Systems for Video Technology. (2015) 25, no. 9, 1545–1556, https://doi.org/10.1109/tcsvt.2015.2392531, 2-s2.0-84940991073.
10.1109/TCSVT.2015.2392531
Web of Science® Google Scholar
15 Dewangan A., Pande Y., Braun H. W. et al., Figlib & Smokeynet: Dataset and Deep Learning Model for Real-Time Wildland Fire Smoke Detection, Remote Sensing. (2022) 14, 1007–1021.
10.3390/rs14041007
Web of Science® Google Scholar
16 Hu C., Tang P., Jin W., He Z., and Li W., Real-time Fire Detection Based on Deep Convolutional Long-Recurrent Networks and Optical Flow Method, Chinese Control Conference, April 2018, 9061–9066, https://doi.org/10.23919/chicc.2018.8483118, 2-s2.0-85056122164.
10.23919/chicc.2018.8483118
Google Scholar
17 Wang C., Xu C., Akram A., Wang Z., Shan Z., and Zhang Q., Wildfire Smoke Detection System: Model Architecture, Training Mechanism, and Dataset, 2025.
Google Scholar
18 Chino D. Y., Avalhais L. P., Rodrigues J. F., and Traina A. J., Bowfire: Detection of Fire in Still Images by Integrating Pixel Color and Texture Analysis, SIBGRAPI Conference on Graphics, Patterns and Images, May 2015, 95–102, https://doi.org/10.1109/sibgrapi.2015.19, 2-s2.0-84959330819.
10.1109/sibgrapi.2015.19
Google Scholar
19 Khan A., Hassan B., Khan S., Ahmed R., Abuassba A. et al., Deepfire: A Novel Dataset and Deep Transfer Learning Benchmark for Forest Fire Detection, Mobile Information Systems. (2022) 2022, 1–14, https://doi.org/10.1155/2022/5358359.
10.1155/2022/5358359
Web of Science® Google Scholar
20 Olayemi A., Fireflame Dataset, 2019.
Google Scholar
21 Jadon A., Omama M., Varshney A., Ansari M. S., and Sharma R., Firenet: a Specialized Lightweight Fire & Smoke Detection Model for Real-Time Iot Applications, arXiv preprint arXiv:1905.11922. (2019) .
Google Scholar
22 Simon F. N. and Rork G. D., Ionization-type Smoke Detectors, Review of Scientific Instruments. (1976) 47, no. 1, 74–80, https://doi.org/10.1063/1.1134496, 2-s2.0-0016914395.
10.1063/1.1134496
Web of Science® Google Scholar
23 Wei M. C., Lin B. R., Lin Y. Y., Chiou G. J., and Kuo W. K., Experimental Study on Effects of Light Source and Different Smoke Characteristics on Signal Intensity of Photoelectric Smoke Detectors, 2021 IEEE 3rd Eurasia Conference on IOT, Communication and Engineering (ECICE), June 2021, IEEE, 518–522.
Google Scholar
24 Chen S. J., Hovde D. C., Peterson K. A., and Marshall A. W., Fire Detection Using Smoke and Gas Sensors, Fire Safety Journal. (2007) 42, no. 8, 507–515, https://doi.org/10.1016/j.firesaf.2007.01.006, 2-s2.0-35349004746.
10.1016/j.firesaf.2007.01.006
CAS Web of Science® Google Scholar
25 Chunyu Y., Jun F., Jinjun W., and Yongming Z., Video Fire Smoke Detection Using Motion and Color Features, Fire Technology. (2010) 46, no. 3, 651–663, https://doi.org/10.1007/s10694-009-0110-z, 2-s2.0-77952430973.
10.1007/s10694-009-0110-z
Web of Science® Google Scholar
26 Tung T. X. and Kim J. M., An Effective Four-Stage Smoke-Detection Algorithm Using Video Images for Early Fire-Alarm Systems, Fire Safety Journal. (2011) 46, no. 5, 276–282, https://doi.org/10.1016/j.firesaf.2011.03.003, 2-s2.0-79956138737.
10.1016/j.firesaf.2011.03.003
Web of Science® Google Scholar
27 Dimitropoulos K., Barmpoutis P., and Grammalidis N., Higher Order Linear Dynamical Systems for Smoke Detection in Video Surveillance Applications, IEEE Transactions on Circuits and Systems for Video Technology. (2017) 27, no. 5, 1143–1154, https://doi.org/10.1109/tcsvt.2016.2527340, 2-s2.0-85018866432.
10.1109/TCSVT.2016.2527340
Web of Science® Google Scholar
28 Yin Z., Wan B., Yuan F., Xia X., and Shi J., A Deep Normalization and Convolutional Neural Network for Image Smoke Detection, IEEE Access. (2017) 5, 18429–18438, https://doi.org/10.1109/access.2017.2747399, 2-s2.0-85028693056.
10.1109/ACCESS.2017.2747399
Web of Science® Google Scholar
29 Namozov A. and Im Cho Y., An Efficient Deep Learning Algorithm for Fire and Smoke Detection with Limited Data, Advances in Electrical and Computer Engineering. (2018) 18, no. 4, 121–128, https://doi.org/10.4316/aece.2018.04015, 2-s2.0-85058789954.
10.4316/aece.2018.04015
Web of Science® Google Scholar
30 Muhammad K., Ahmad J., and Baik S. W., Early Fire Detection Using Convolutional Neural Networks during Surveillance for Effective Disaster Management, Neurocomputing. (2018) 288, 30–42, https://doi.org/10.1016/j.neucom.2017.04.083, 2-s2.0-85041595565.
10.1016/j.neucom.2017.04.083
Web of Science® Google Scholar
31 Zhang Q., Lin G., Zhang Y., Xu G., and Wang J., Wildland Forest Fire Smoke Detection Based on Faster R-Cnn Using Synthetic Smoke Images, Procedia Engineering. (2018) 211, 441–446, https://doi.org/10.1016/j.proeng.2017.12.034, 2-s2.0-85045284948.
10.1016/j.proeng.2017.12.034
Google Scholar
32 Jiao Z., Zhang Y., Xin J. et al., A Deep Learning Based Forest Fire Detection Approach Using Uav and Yolov3, International Conference on Industrial Artificial Intelligence, June 2019, 1–5, https://doi.org/10.1109/iciai.2019.8850815, 2-s2.0-85073495472.
10.1109/iciai.2019.8850815
Google Scholar
33 Li P. and Zhao W., Image Fire Detection Algorithms Based on Convolutional Neural Networks, Case Studies in Thermal Engineering. (2020) 19, 100625–100635, https://doi.org/10.1016/j.csite.2020.100625.
10.1016/j.csite.2020.100625
Web of Science® Google Scholar
34 Sandler M., Howard A., Zhu M., Zhmoginov A., and Chen L. C., Mobilenetv2: Inverted Residuals and Linear Bottlenecks, IEEE Conference on Computer Vision and Pattern Recognition, May 2018, 4510–4520, https://doi.org/10.1109/cvpr.2018.00474, 2-s2.0-85062799511.
10.1109/cvpr.2018.00474
Google Scholar
35 Redmon J. and Farhadi A., Yolov3: An Incremental Improvement, 2018.
Google Scholar
36 Liu W., Anguelov D., Erhan D. et al., Ssd: Single Shot Multibox Detector, Lecture Notes in Computer Science. (2016) 14, 21–37, https://doi.org/10.1007/978-3-319-46448-0_2, 2-s2.0-84990068627.
10.1007/978-3-319-46448-0_2
Google Scholar
37 Ren S., He K., Girshick R., and Sun J., Faster R-Cnn: Towards Real-Time Object Detection with Region Proposal Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2017) 39, no. 6, 1137–1149, https://doi.org/10.1109/tpami.2016.2577031, 2-s2.0-85019258369.
10.1109/TPAMI.2016.2577031
PubMed Web of Science® Google Scholar
38 Muhammad K., Ahmad J., Mehmood I., Rho S., and Baik S. W., Convolutional Neural Networks Based Fire Detection in Surveillance Videos, IEEE Access. (2018) 6, 18174–18183, https://doi.org/10.1109/access.2018.2812835, 2-s2.0-85043375749.
10.1109/ACCESS.2018.2812835
Web of Science® Google Scholar
39 Muhammad K., Khan S., Elhoseny M., Hassan Ahmed S., and Wook Baik S., Efficient Fire Detection for Uncertain Surveillance Environment, IEEE Transactions on Industrial Informatics. (2019) 15, no. 5, 3113–3122, https://doi.org/10.1109/tii.2019.2897594, 2-s2.0-85065414877.
10.1109/TII.2019.2897594
Web of Science® Google Scholar
40 Jocher G., Chaurasia A., Stoken A. et al., Yolov5 by Ultralytics, 2020.
Google Scholar
41 Bhamra J. K., Anantha Ramaprasad S., Baldota S. et al., Multimodal Wildland Fire Smoke Detection, Remote Sensing. (2023) 15, no. 11, 2790–2805, https://doi.org/10.3390/rs15112790.
10.3390/rs15112790
Web of Science® Google Scholar
42 Touvron H., Cord M., Douze M., Massa F., Sablayrolles A., and Jégou H., Training Data-Efficient Image Transformers & Distillation through Attention, International Conference on Machine Learning, October 2021, 10347–10357.
Google Scholar
43 Tan M. and Le Q., Efficientnet: Rethinking Model Scaling for Convolutional Neural Networks, International Conference on Machine Learning, November 2019, PMLR, 6105–6114.
Google Scholar
44 Ge Z., Liu S., Wang F., Li Z., and Sun J., Yolox: Exceeding Yolo Series in 2021, 2021.
Google Scholar
45 Jocher G., Chaurasia A., and Qiu J., Yolo by Ultralytics, 2023.
Google Scholar
46 Liu S., Qi L., Qin H., Shi J., and Jia J., Path Aggregation Network for Instance Segmentation, IEEE Conference on Computer Vision and Pattern Recognition, March 2018, 8759–8768.
Google Scholar
47 Lin T. Y., Dollár P., Girshick R., He K., Hariharan B., and Belongie S., Feature Pyramid Networks for Object Detection, IEEE Conference on Computer Vision and Pattern Recognition, November 2017, 2117–2125.
Google Scholar
48 Chen K., Wang J., Pang J. et al., MMDetection: Open Mmlab Detection Toolbox and Benchmark, arXiv preprint arXiv:1906.07155. (2019) .
Google Scholar
49 Imambi S., Prakash K. B., and Kanagachidambaresan G., Pytorch. Programming with TensorFlow: Solution for Edge Computing Applications, 2021.
Google Scholar
50 Bochkovskiy A., Wang C., and Liao H. M., Yolov4: Optimal Speed and Accuracy of Object Detection, 2020.
Google Scholar
51 Lin T., Goyal P., Girshick R., He K., and Dollár P., Focal Loss for Dense Object Detection, IEEE International Conference on Computer Vision, August 2017, 2980–2988.
Google Scholar
52 Sun P., Zhang R., Jiang Y. et al., Sparse R-Cnn: End-To-End Object Detection with Learnable Proposals, IEEE Conference on Computer Vision and Pattern Recognition, April 2021, 14454–14463.
Google Scholar

All articles

Wildfire Smoke Detection System: Model Architecture, Training Mechanism, and Dataset

Abstract

1. Introduction

2. Related Work

2.1. Wildfire Detection Methods

2.2. Wildfire Datasets

3. Method

3.1. System Framework

3.2. Model Pipeline

3.3. CCPE

3.3.1. Horizontal Contrast

3.3.2. Vertical Contrast

3.4. Separable Negative Sampling

4. Experiments

4.1. Datasets

4.2. Metrics for SKLFS-WildFire

4.3. Implement Details

4.4. Comparison on FIgLib

4.5. Ablation Studies

4.5.1. Effectiveness of CCPE

4.5.2. Effectiveness of Separable Negative Sampling

4.6. Comparison to Classical Detectors

4.7. Visualization Analysis

5. Deployment and Limitations

5.1. Deployment

5.1.1. Postprocessing Strategies for Reliable Smoke Alarm Delivery

5.1.2. Deployment on Embedded AI Platforms

5.2. Limitations

5.2.1. SNSM-Induced Inaccuracies

5.2.2. Fog

5.2.3. Smoke-Like Objects

5.2.4. Small Objects

5.2.5. Diffused Smoke

5.2.6. Divergent Perspectives

5.2.7. Lighting Conditions

6. Conclusion

Disclosure

Conflicts of Interest

Author Contributions

Funding

Acknowledgments

Open Research

Data Availability Statement

References

Figures

References

Related

Information