The existing Siamese trackers express visual tracking through the cross-correlation operation between two neural networks. Although they dominated the tracking field, their adopted pattern caused two main problems. One is the adoption of the deep architecture that drives the Siamese tracker to sacrifice speed for performance, and the other is that the template is fixed to the initial features; namely, the template cannot be updated timely, making performance entirely dependent on the Siamese network’s matching ability. In this work, we propose a tracker called SiamMLG. Firstly, we adopt the lightweight ResNet-34 as the backbone to improve the proposed tracker’s speed by reducing the computational complexity, and then, to compensate for the performance loss caused by the lightweight backbone, we embed the SKNet from the attention mechanism to filter out the valueless features, and finally, we utilize the gradient-guide strategy to update the template timely. Extensive experiments on four large tracking datasets, including VOT-2016, OTB100, GOT-10k, and UAV123, confirming SiamMLG satisfactorily balance performance and efficiency, where it scores 0.515 on GOT-10k while running at 55 frames per second, which is nearly 3.6 times that of the state-of-the-art method.

1. Introduction

Visual object tracking is a crucial yet fundamental topic in computer vision. With its extensive range of applications [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], it has emerged as a highly sought-after research direction.

Visual tracking has recently evolved into three branches. The first branch depends on correlation filters [11, 12, 13, 14], and the second one mainly utilizes strong deep architectures [15, 16]. But both have their limitations. The first needs generalizations for various scenarios and objects, while the second branch cannot meet real-time requirements.

As the third branch, the Siamese trackers [17, 18, 19, 20] have garnered great attention from the academic community due to their simple structure and acceptable generalization capabilities. Yet, these Siamese-based trackers still have two major areas for improvement, which might enlarge the gap between academic approaches and industrial applications. In specific, firstly, ever since the SiamRPN++ [21] has broken the limitations that deep neural networks [22] are incapable of using as the backbone, an increased number of Siamese trackers have utilized deep neural networks to improve their performance, while ignoring the trackers is becoming heavy and expensive, as shown in Figure 1. Secondly, the Siamese trackers are usually given the object at the initial frame of the video sequence and adopt this object as the template to perform similarity matching in the next frames. Although such a pattern has expressed tracking effectively, it still ignores the fact that the template may seriously deform in next frames, as shown in Figure 2. The reason is that this template is the only one used in the Siamese tracking process, which means that the manually selected object in the initial frame of the video is the template used in the entire tracking process. Despite the fact that some trackers [23, 24] have tried to address the first shortcoming, their simplistic structure has caused performance to drop dramatically. Other methods [25, 26, 27] have tried to update templates by different mechanisms, but they simply focus on combining the previous object features while neglecting the discriminative information in the background clutter. Such obsolete solutions cause the main challenge of Siamese tracking how to design an effective model that can update the template in a timely manner. Also, they cause the Siamese trackers to struggle in complicated scenarios and leave the potential for further improvement.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

One of the motivations of our proposed tracker. Existing mainstream Siamese trackers usually find it challenging to satisfactorily balance performance and processing speed.

Inspired by this observation, in this paper, we first try to adopt the modified and lightweight ResNet-34 [22] as the backbone because it is the light version of ResNet-50, which guarantees the tracking speed. However, precisely because the ResNet-34 is a lightweight network, it may cause some performance drop. Accordingly, we embed the selective kernel network (SKNet) [28] in the lightweight network ResNet-34 to balance the performance loss, giving the ResNet-34 channel-discriminative capability for input features. And enlightened by the GradNet [29], we later utilized the gradient in ResNet-34 to execute template updates. Specifically, we utilize discriminatory information in the gradient via feed-forward and backpropagation to complete template updates. This is because the gradients are calculated based on the final loss, which takes into account both positive and negative candidates. They not only have discriminative information that reflects the object’s deformation but can also distinguish the object from the background clutter. Besides, unlike classic gradient-based but hand-designed trackers [30, 31], utilizing the gradients in the backbone to update the templates can improve speed by further reducing computation.

To summarize, the major contributions of this work are listed below:

(1)
We design a lightweight backbone with an embedded attention mechanism by using the ResNet-34 to improve tracking speed.
(2)
Utilizing the gradients in the designed backbone, we promptly update the template in our proposed tracker.
(3)
Extensive experiments conducted on four popular datasets demonstrate that our proposed tracker performs competitive results while running at 55 frames per second (FPS).

The structure of this paper is as follows: Section 2 briefly reviews the interrelated works, Section 3 illustrates our proposed tracker, Section 4 gives the results and analysis of experiments, and Section 5 is the conclusion of this paper.

2. Related Works

2.1. Siamese Network-Based Trackers

Since SiamFC [18] introduced AlexNet [32] as the backbone, increasingly Siamese trackers such as SiamRPN [33] and DaSiamRPN [23] have also begun to utilize AlexNet as their backbone, which allows them to run at nearly 100 FPS. Yet, these trackers’ performance still has the potential to be improved due to the utilities of shallow backbones. Noticing this phenomenon, SiamRPN++ [21] successfully introduced the deep neural network ResNet-50 [22] into visual object tracking, which improves performance further. Henceforth, while the latest Siamese tracker [34, 35] endeavors to tackle finer-grained issues, such as anchor-free-based [24, 36, 37], unsupervised learning [38], 3D object tracking [39, 40, 41], rotation [42], redetection [43], and nighttime tracking [44], most of them still employ a deep architecture that fails to reach an acceptable balance between performance and speed. SiamSERPN [45] uses the lightweight network MobileNet-v2 [46] as the backbone and designs a squeeze and excitation region proposal network to compensate for the performance loss, which tries to attend to both, but it is still incapable of updating the template timely due to the natural structure.

2.2. Model Updating in Visual Tracking

Ensuring timely template updating is essential for improving the performance and robustness of the trackers. Based on such consideration, DSiam [25] proposes a rapid transformation learning method that facilitates efficient online learning from historical frames. Later, the method FlowTrack [26] utilizes optical flow information to convert templates and integrate them according to their weights. Yet, the two methods mentioned above perform template updating by combining the templates, not only driving them to exhibit high computational complexity but also ignoring the impact of the background clutter, especially when handling large-scale video sequences. Although some gradient-descent-based trackers [30, 31] emerged, they need multitime training iterations to catch the appearance of deformation, which causes them to be less effective and distant from real-time requirements. The latter UpdateNet [47] proposes a compact method to update templates in a timelymanner. In specific, it employs the initial, accumulated, and current templates to estimate the optimal template for the subsequent video’s frame. Despite the fact that UpdateNet can be integrated into different Siamese trackers, such as SiamFC and DaSiamRPN, previous knowledge about parameter tuning is required. Compared to UpdateNet, our proposed tracker update template is in an end-to-end network.

3. Proposed Tracker

The structure of our proposed tracker is shown in Figure 3. The upper branch is the template branch, which receives historical frames as the input, and the bottom branch is the search branch, receiving current frames as the input. Both branches are the ResNet-34 embedded with SKNet [28], and the SKNet has three stages (Split, Fuse, and Select), as illustrated in Section 3.3. Then, the template updating strategy also consists of three stages (Initial Embedding, Gradient Calculation, and Template Updating), provided in Section 3.4 in detail.

3.1. Basic Tracker

The previous SiamMask [48] is the first tracker that integrates tracking and semantic segmentation, which has two versions, the three-branch (anchor-based) and two-branch (fully convolution-based) versions, where we modify the latter and utilized the modified SiamMask as our basic tracker.

Although our modified basic tracker uses the ResNet-34, which is lightweight and embedded with SKNet, as the backbone, the working pattern remains the same as the original two-branched SiamMask. Namely, for the basic tracker we used, the backbone first extracts the input features and directly conducts cross-correlation operations to obtain the similarity map. Then, the mask function is obtained by semantic segmentation according to the scores generated in the similarity map. However, integrating the mask function with visual tracking poses several challenges to the basic tracker. One is that the basic tracker cannot balance acceptable performance and speed; the other is that its performance rapidly declines when faced with complicated scenarios. In short, the robustness of our basic tracker still has the potential to be improved further because it cannot update the template promptly.

3.2. Siamese Lightweight Backbone

The backbone of our basic tracker consists of two identical and parameter-shared ResNet-34, but unlike the one utilized in other computer vision tasks, the ResNet-34 we adopted is modified. Specifically, we modify the original stride from 32 to 8 to enhance the resolution of the features, allowing the SiamMLG to locate the small-size objects more accurately, which learns from the previous work SiamRPN++ [21].

In the two modified ResNet-34 mentioned above, one is called the template branch, which inputs the historical frames (z) of size 127 × 127 × 3, and the other is called the search branch, which takes in the current frames (x) of size 127 × 127 × 3. After video frames are inputted into the model, the rough features extracted by the template branch are denoted by ξ(z), and those extracted by the search branch are signified by ξ(x). The SKNet embedded in the backbone then filters the rough features to generate the refined features, which are denoted as Ω(z) and Ω(x). By using the refined features, the similarity map noted in Section 3.1 is calculated as follows (Eqaution (1)):

()

where the Ω(z) is served as kernel, × denotes the cross-correlation operation, and P denotes the score map to find the object.

Moreover, we crop the historical images that are inputted to the template branch. Precisely, we retain the central region of 7 × 7 as the template feature and use the average RGB values to occupy the remaining areas, effectively capturing the entire target region and alleviating the substantial computational load [18].

Owing to its computational complexity of only 3.6 × 10⁹ FLOPS, the ResNet-34 is more efficient than the deeper ResNet-50 (3.8 × 10⁹ FLOPS) [22]. But simultaneously, the ResNet-34 we adopted, due to having a deeper structure than shallow AlexNet [32] and ResNet-18, can extract more information. The capability of ResNet-34 to balance performance and efficiency makes it suitable as the backbone for the proposed tracker.

3.3. The Embedded Attention Mechanism

Attention mechanisms have been widely adopted in computer vision tasks, especially in image classification, to allow models to focus adaptively on the valuable features among the input rough features. Inspired by this, we introduce the SKNet among the branch attention mechanisms into the tracking field to compensate for the performance drop caused by the lightweight ResNet-34 as the backbone of our tracker. The SKNet introduced has three stages, including Split, Fuse, and Select, and the structure is shown in Figure 4. Note that the illustration in this subsection takes the template branch as an example.

3.3.1. Split Stage

The rough feature ξ(z) input from the template branch needs to be decomposed into two copies by convolution kernels of size 3 × 3 and 5 × 5, in which the 5 × 5 one is the dilated convolution [49] for efficiency, and the obtained features are denoted by L₁ and L₂, respectively.

3.3.2. Fuse Stage

Since the ultimate purpose of the SKNet is to let the model discriminate rough features adaptively, the gating mechanism screens the input features before rough ones carrying different information are fed to the next layer. Before the screening, the rough template features L₁ and L₂ firstly fused by fusion operation to integrate the information from all branches, given in Equation (2):

()

in which L denotes the fused features.

The fused template features L and then encounter the global averaging pooling to generate the 1D channel-wise statistics E, which can be embedded into the rough template features. Specifically, the cth element of E is calculated by shrinking L via spatial dimensions H × W (Equation (3)):

()

After the global averaging pooling operation, the statistics E is inputted into the fully connected layer to generate the compact statistics Y, which can be calculated as Equation (4):

()

where ReLu denotes the ReLu function and BN represents the batch normalization and W denotes the parameters.

3.3.3. Select Stage

The compact statistics Y dynamically selects the information in the rough features based on the weight assigned to each channel. Specifically, the Softmax function is operated to the compact statistics Y to generate the soft attention a and b, which are assigned to template features L₁ and L₂, respectively (Equation (5)):

()

where A_c and B_c represent the cth row of the weight matrix A and B, respectively, and a_c and b_c denote the cth element in the soft attention a and b. The refined template features Ω(z) can be calculated by using the weight (Equation (6)):

()

where the Ω(z)_c = [Ω(z)₁, Ω(z)₂, …, Ω(z)_c].

Compared to the SENet [50], the SKNet we adopted possesses fewer parameters, and it keeps the original structure of the neural network as the backbone. Therefore, our decision to adopts the SKNet as the filter has the potential to enhance the performance of the tracker significantly.

3.4. Gradient-Guide Strategy

In the two ResNet-34 that served as the backbone of our tracker, we implemented the gradient-guided strategy on the template branch. Utilizing gradient information in a timely manner, our proposed method executes template updating and improves performance. Similar to the SKNet mentioned in Section 3.3, the gradient-guided strategy we adopted is divided into three stages either: Initial Embedding, Gradient Calculation, and Template Updating.

3.4.1. Initial Embedding Stage

Similar to other Siamese trackers, when the video sequences are inputted into the model, the template branch extracts the rough features from the historical frames and filters them through the SKNet to obtain the refined features Ω(x) that contain the object template Q, which can be expressed as in Equation (7):

()

where D₁ denotes one of the ResNet-34 that serves as the template branch and a₁ represents the parameters that used for template updating.

Then, as mentioned in Section 3.2, the refined template feature Ω(z), which contains the template Q, conducts the cross-correlation operation with the refined search region feature Ω(x) to generate the similarity map (Equation (8)):

()

in which × represents the convolution operation and P denotes the score map, known as the initial similarity map.

3.4.2. Gradient Calculation Stage

Utilizing the similarity map P generated in the last stage and the training label M, the initial loss can be calculated as given in Equation (9):

()

Then, the initial loss is backpropagated to the network D₂, which means that network D₂ finds the partial derivative of the initial loss to generate the gradient information G for the template branch (Equation (10)):

()

in which α₂ denotes the parameters of network D₂.

Later, the gradient information G is applied to the refined template features Ω(z) to generate the updated template features U(z) (Equation (11)):

()

At last, the updated template features U(z) will undergo to the next stage to generate the optimal template.

3.4.3. Template Updating Stage

In this stage, the updated features U(z) are inputted into the ResNet-34 that serves as the template branch to obtain the optimal template (Equation (12)):

()

where Q^best represents the optimal template.

The final similarity map P^′ is obtained by the refined template features, which contains the optimal template, convolute with the refined search region features (Equation (13)):

()

Specifically, in the first stage, SiamMLG performs convolution operation by using refined features to generate scores map P; in the second stage, SiamMLG employs the generated scores map P and training label M to calculate the initial loss l, and then the latter backpropagates to the network D₂ to obtain the gradient information G and utilizes the gradient information G with the refined template features Ω(z) to generate the updated template features U(z); and in the last stage, the updated template feature U(z) is inputted into the template branch to obtain the optimal template Q^best, and the final similarity map P^′ is obtained by convolving the optimal template Q^best with the refined search region features Ω(x). By repeating the above process, SiamMLG achieves timely updates to the templates and eventually completes the object tracking.

Compared to UpdateNet [47], our proposed tracker utilizes backpropagation information to update the template timely, which reduces time consumption. Besides, unlike GradNet [29] either, our SiamMLG is based on SiamMask, which has more competitive performance and robustness than the early SiamFC, making our tracker stable when encountering complex scenarios.

4. Experiments and Analysis

Our comparison experiments are conducted on four extensive and mainstream datasets, including VOT-2016 [51], OTB100 [52], UAV123 [53], and GOT-10k [54]. In addition, we conducted ablation experiments on the UAV123 datasets and utilized GradNet as the benchmark tracker to compare with. Before providing the results and analysis, we will demonstrate the training dataset adopted for our tracker and implementation details about our experiments in Section 4.1 and Section 4.2, respectively.

4.1. Training Dataset

We adopt the pretrained backbone to accelerate convergence and improve the generalization of the tracker concerning general objects in visual tracking further. Namely, we use the ResNet-34, which is trained on several large training datasets, including ImageNet-DET [55], ImageNet-VID [56], LaSOT [57], and GOT-10k (for training), as the backbone. In both training and testing, we use single-scale images with 127 pixels for template patches and 255 pixels for searching regions.

4.2. Implementation Details

4.2.1. Computer Configuration

Our SiamMLG is implemented on a PC with the following configuration: Intel Xeon (R) E5-2667 v3 (Dual CPU), Nvidia RTX 2080ti (GPU), 32G (RAM), and Ubuntu 18.04 LTS (OS).

4.2.2. Operation Settings

The proposed tracker uses the SGD optimizer to optimize the loss function for a total of 70 epochs, where the initial learning rate is set to 0.001 for the first 20 epochs through a warm-up learning rate mechanism, and the learning rate decays to 0.0005 for subsequent epochs. Based on this, the training batch size is set to 64, the decay rate of the weights is 0.0005, and the momentum is 0.9.

4.3. Comparison Experiments

4.3.1. VOT-2016 Dataset

We test the proposed tracker on the VOT-2016 dataset and compare it with several mainstream Siamese trackers in visual tracking, including SiamRPN [33], DaSiamRPN [23], and SiamMask [48]. The VOT-2016 dataset includes 60 video sequences that contain lots of complex scenarios and object deformations. The evaluation protocol of the VOT-2016 dataset is Accuracy, Robustness, and Expected Average Overlap (EAO), where the Robustness is based on failure rate; namely, the higher the value, the more unstable trackers are, and EAO, combining Accuracy and Robustness, is the primary metric for evaluating performance. The comparison results on the VOT-2016 dataset are shown in Table 1 and Figure 5.

Table 1. Comparisons with the mainstream trackers in terms of the Expected Average Overlap (EAO), Robustness (failure rate), and Accuracy on the VOT-2016 dataset.

Trackers	MLDF	SSAT	TCNN	CCOT	SiamFC	SiamRPN	DaSiamRPN	SiamMaks	SiamMLG
Accuracy	0.490	0.577	0.554	0.539	0.568	0.618	0.612	0.623	0.632
Robustness	0.233	0.291	0.268	0.238	0.262	0.238	0.221	0.233	0.182
EAO	0.311	0.321	0.325	0.331	0.387	0.393	0.411	0.412	0.430

The best two results are highlighted in italics and bold fonts.

In the results, our proposed tracker achieves the best score with 0.430 on EAO and 0.632 on Accuracy, which exceeds the SiamMask by 4.2% and 1.5%, respectively. The core reason is that our tracker, based on SiamMask, embeds SKNet to filter valueless features and adopts the gradient-guided strategy to update the template in a timely. Despite our tracker’s backbone being the ResNet-34, the SKNet we embedded and the gradient-guided strategy we adopted still enhanced our tracker’s performance. It is also worth noting that our tracker is the most robust due to its ability to update templates timely, improving on SiamMask by 22%.

4.3.2. OTB100 Dataset

The OTB100 dataset, which contains 100 video sequences with different challenging factors, was formerly presented as the OTB50 dataset in 2013 and has since gone on to be supplemented with 50 video sequences in 2015. The evaluation methods used by the OTB100 dataset are the success plots and the precision plots, and the performance of the trackers calculated is ranked by the area under the curve (AUC) in the generated plot. In the comparison experiment, we compare our proposed tracker with several trackers, including Ocean [58] and SiamFC++ [24], the results of which are shown in Figure 6.

In the comparison results of the success and precision plots, our SiamMLG achieves the best scores in both plots. Compared to SiamFC++, our tracker exceeds 1.1% and 1.7% in the success and precision plots, respectively. Though our tracker and SiamFC++ are both lightweight trackers, our tracker performs better due to the SKNet filtering out useless features input into the backbone. Also, compared to GradNet [29], which updates the template by adopting the gradient-guide strategy either, our tracker achieves a 6.9% improvement in success and a 4.3% improvement in precision, respectively. The reason is that our tracker adopts the modified SiamMask, whose backbone is the ResNet-34, as the base tracker, whereas GradNet utilizes the earlier SiamFC as the base tracker. In addition, compared with other Siamese trackers, SiamRPN and DaSiamRPN, which rely on the template in the initial frame of the video for similarity-matching learning, our SiamMLG is more capable of facing challenging factors in the OTB100 dataset.

4.3.3. GOT-10k Dataset

The GOT-10k dataset was proposed by the Chinese Academy of Sciences (CAS) in 2018, and it includes a considerable and highly diverse group of wild objects, with over 10,000 video segments containing real-world moving objects. To ensure fairness in the evaluation protocol, GOT-10k contains two types of datasets, one for training and one for testing, with zero overlap between the classes. Furthermore, the trained and tested trackers must upload their results to the official website, which obtains the performance score by official evaluation. The official metrics for performance evaluation comprise average overlap (AO), success rate (SR), and FPS. AO symbolizes the average overlaps between all predicted bounding boxes and the ground-truth boxes; SR0.5 signifies the ratio of successfully tracked frames with an overlap exceeding 0.5, while SR0.75 represents instances where this overlap surpasses 0.75.

We compare our SiamMLG with several trackers, including CGACD, DaSiamRPN, and SiamMask. The comparison results are shown in Table 2. Compared to the original SiamMask that uses the deep architecture ResNet-50, our tracker outperforms the original SiamMask in all other metrics despite scoring slightly behind in SR0.75, especially in the FPS metric, which expresses the speed; our tracker is approximately three times better than the deep SiamMask. The core reason is that our tracker utilizes the lightweight ResNet-34 as the backbone to reduce the parameters and improve performance by embedding the SKNet to filter the useless features. Though DaSiamRPN’s speed is almost 2.4 times faster than ours, its performance is more deficient. Like the original SiamMask, our tracker is slightly behind the CGACD in the SR0.75 metric but substantially ahead in speed. Moreover, we also compared the state-of-the-art tracker SiamRCNN [43], which is a two-stage tracker. It has higher performance than ours because it can detect the trajectory of the object while tracking. Yet, precisely because SiamRCNN is a two-stage tracker, its speed is weighed down by the heavy model, which is only 2.79 FPS, as shown in Table 2. In comparison, our tracker achieved a speed of around 55FPS, which is about 19 times faster than SiamRCNN. The explanation is that our tracker adopts the more lightweight ResNet-34 as the backbone, and the lighter backbone makes our method more real-time. Within an acceptable range of performance losses, comparison experiments on the GOT-10k dataset show that our tracker balances satisfactory performance and acceptable speed.

Table 2. The comparison results of the mainstream trackers and our proposed trackers on the GOT-10k (for testing) dataset.

Trackers	MDNet	DaSiamRPN	THOR	SiamRPN	CGACD	SiamMask	SiamMLG	SiamRCNN
mAO	0.299	0.444	0.447	0.463	0.511	0.514	0.515	0.649
mSR0.5	0.303	0.536	0.538	0.549	0.612	0.587	0.604	0.728
mSR0.75	0.100	0.220	0.204	0.253	0.323	0.366	0.352	0.597
FPS	1.52	134.40	1.00	74.46	37.73	15.37	55.63	2.79

The top 2 results are highlighted in italics and boldface fonts.

4.3.4. UAV123 Dataset

The UAV123 dataset holds 123 video segments taken by UAV from high altitudes, with over 110k frames. The challenge factors, such as small-size objects and occlusion, mainly characterize it, which makes it challenging for the trackers tested. Additionally, all the videos in the dataset are fully annotated with upright bounding boxes. The evaluation metrics of the UAV123 dataset are the same as the OTB100 dataset, which are success plots and precision plots.

We compare our proposed SiamMLG with several mainstream trackers, and the comparison results are shown in Figure 7. The proposed tracker scores 0.621 in the success plots to rank first and scores 0.810 in the precision plots to rank second. Compared with the latest tracker, SiamFC++, which adopts shallow neural networks as the backbone, our tracker improves by 0.7% in success plots and 1.4% in precision plots. The core reason is for this result is that our SiamMLG’s backbone is ResNet-34, which has a deeper architecture than the shallow AlexNet and discriminates adaptively for the rough features. Besides, compared with the CGACD, which is a two-stage tracker, although our proposed tracker is not performing as well as it in precision plots, our proposed SiamMLG still takes the lead in specific challenging scenarios, indicating that our tracker becomes robust due to the template updating strategy.

4.4. Ablation Study

We conduct the ablation study on the UAV123 dataset to investigate the consequence of each improvement through designing multiple variants of the proposed tracker. The results are presented in Table 3.

Table 3. The ablation study on UAV123 dataset.

Tracker	Compare operations	Success	ΔSuc (%)	Precision	ΔPre (%)
SiamFC++	—	0.617	—	0.799	—
Variant 1	Only adopts ResNet-34 as the backbone	0.601	−2.6	0.787	−1.6
Variant 2	Original backbone adopted gradient-guide strategy	0.606	−1.8	0.797	−0.3
Variant 3	Backbone with SKNet	0.608	−1.5	0.798	−0.2
SiamMLG	Backbone with SKNet + gradient-guide strategy	0.703	+0.7	0.810	+1.4

ΔSuc and ΔPre denote the degree of advancement of the success and precision, and the latest tracker SiamFC++ is used as benchmark method. Italic values are the results of SiamMLG in UAV123 dataset.

Variant 1 only uses the lightweight ResNet-34 as the backbone. Compared with the SiamFC++, this variant is 2.6% and 1.6% behind in terms of success and precision, respectively. The reason for this is that the backbone of variant 1 is the lightweight ResNet-34, which inherently results in weaker feature extraction capability because of its shallow architecture.

Variant 2 utilizes the original backbone but performs the gradient-guide strategy to update the template. Although it is slightly more acceptable than variant 1, it is still weaker than SiamFC++. The reason might come from the quality branch proposed by SiamFC++, which is more beneficial for performance improvement than ours.

Variant 3 uses the backbone embedded with the SKNet but does not perform the gradient-guide strategy. Since the SKNet allows the backbone to discriminate the channels of the rough features adaptively, it slightly improves performance over the first two variants. Nevertheless, this variant is still inferior to SiamFC++.

After combining all the advancements, the performance of our proposed tracker improves in certain ways compared to that of the SiamFC++, which is 0.7% and 1.4%, respectively. This is because SKNet filters out the valueless features from the rough features, allowing the retained valuable ones to be updated by performing the gradient-guided strategy. As shown in Figure 8, our proposed SiamMLG is robust and has a more accurate bounding box.

5. Conclusion

In this paper, we argue that the current Siamese trackers still suffer from two main problems, one being that the deep neural network serves as the backbone, dragging down the Siamese trackers’ speed, and the other being that the templates are incapable of updating timely, influencing the Siamese trackers’ performance. To address these problems, we proposed a Siamese tracker named SiamMLG based on the modified two-branch SiamMask. In specific, our proposed tracker first adopts the lightweight ResNet-34 embedded with SKNet as the backbone, in which the lightweight neural network aims to enhance the tracker’s speed, and the attention mechanism aims to repay the performance loss induced by the lightweight backbone by retaining valuable features. Based on the above, our proposed tracker is able to promptly update the template by using valuable features and gradient-guided information, which allows the tracker to perform stably. Extensive comparisons and ablation experiments on four large datasets, including the VOT2016, OTB100, GOT-10k, and UAV123 datasets, show that our tracker achieves competitive performance while running at nearly 55 FPS. However, the proposed tracker can only easily track 2D objects that are inputted to the tracker in daytime scenarios. In more complex nighttime scenarios, SiamMLG tends to degrade performance.

Conflicts of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 62272063, No. 62072056, and No. 61902041); the Natural Science Foundation of Hunan Province (No. 2022JJ30617 and No. 2020JJ2029); the Standardization Project of Transportation Department of Hunan Province (B202108); the Hunan Provincial Key Research and Development Program (2022GK2019); and the Scientific Research Fund of Hunan Provincial Transportation Department (No. 202042).

Open Research

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

1 Cao D., Zeng K., Wang J., Sharma P. K., Ma X., Liu Y., and Zhou S., BERT-based deep spatial-temporal network for taxi demand prediction, IEEE Transactions on Intelligent Transportation Systems. (2022) 23, no. 7, 9442–9454, https://doi.org/10.1109/TITS.2021.3122114.
10.1109/TITS.2021.3122114
Web of Science® Google Scholar
2 Wang J., Han H., Li H., He S., Sharma P. K., and Chen L., Multiple strategies differential privacy on sparse tensor factorization for network traffic analysis in 5G, IEEE Transactions on Industrial Informatics. (2022) 18, no. 3, 1939–1948, https://doi.org/10.1109/TII.2021.3082576.
10.1109/TII.2021.3082576
Web of Science® Google Scholar
3 Cao D., Zheng B., Ji B., Lei Z., and Feng C., A robust distance-based relay selection for message dissemination in vehicular network, Wireless Networks. (2020) 26, no. 3, 1755–1771, https://doi.org/10.1007/s11276-018-1863-4, 2-s2.0-85055986509.
10.1007/s11276-018-1863-4
Web of Science® Google Scholar
4 Marvasti-Zadeh S. M., Cheng L., Ghanei-Yakhdan H., and Kasaei S., Deep learning for visual tracking: a comprehensive survey, IEEE Transactions on Intelligent Transportation Systems. (2022) 23, no. 5, 3943–3968, https://doi.org/10.1109/TITS.2020.3046478.
10.1109/TITS.2020.3046478
Web of Science® Google Scholar
5 Yan L., Ma S., Wang Q., Chen Y., Zhang X., Savakis A., and Liu D., Video captioning using global-local representation, IEEE Transactions on Circuits and Systems for Video Technology. (2022) 32, no. 10, 6642–6656, https://doi.org/10.1109/TCSVT.2022.3177320.
10.1109/TCSVT.2022.3177320
PubMed Web of Science® Google Scholar
6 Yan L., Wang Q., Ma S., Wang J., and Yu C., Solve the puzzle of instance segmentation in videos: a weakly supervised framework with spatio-temporal collaboration, IEEE Transactions on Circuits and Systems for Video Technology. (2023) 33, no. 1, 393–406, https://doi.org/10.1109/TCSVT.2022.3202574.
10.1109/TCSVT.2022.3202574
Web of Science® Google Scholar
7 Zhang J., Lv Y., Tao J., Huang F., and Zhang J., A robust real-time anchor-free traffic sign detector with one-level feature, IEEE Transactions on Emerging Topics in Computational Intelligence. (2024) 8, no. 2, 1437–1451, https://doi.org/10.1109/TETCI.2024.3349464.
10.1109/TETCI.2024.3349464
Web of Science® Google Scholar
8 Zhang J., Sun J., Wang J., Li Z., and Chen X., An object tracking framework with recapture based on correlation filters and Siamese networks, Computers & Electrical Engineering. (2022) 98, https://doi.org/10.1016/j.compeleceng.2022.107730, 107730.
10.1016/j.compeleceng.2022.107730
Web of Science® Google Scholar
9 Zhang J., Zou X., Kuang L., Wang J., Robert S., and Yu X., A more comprehensive traffic sign detection benchmark, Human-Centric Computing and Information Sciences. (2022) 12, 23.
Web of Science® Google Scholar
10 Cao D., Wu M., Gu N., Sherratt R. S., Ghosh U., and Sharma P. K., Joint optimization of computation offloading and resource allocation considering task prioritization in ISAC-assisted vehicular network, IEEE Internet of Things Journal. (2024) https://doi.org/10.1109/JIOT.2024.3360962.
10.1109/JIOT.2024.3360962
Web of Science® Google Scholar
11 Danelljan M., Bhat G., Khan F. S., and Felsberg M., ECO: efficient convolution operators for tracking, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, Honolulu, HI, USA, IEEE, 6931–6939, https://doi.org/10.1109/CVPR.2017.733, 2-s2.0-85041930184.
10.1109/CVPR.2017.733
Google Scholar
12 Bhat G., Johnander J., Danelljan M., Khan F. S., and Felsberg M., Unveiling the power of deep tracking, Proceedings of the European Conference on Computer Vison (ECCV), 2018, Munich, Germany, Springer, 493–509.
Google Scholar
13 Zhang J., He Y., Chen W., Kuang L., and Zheng B., CorrFormer: context-aware tracking with cross-correlation and transformer, Computers and Electrical Engineering. (2024) 114, 109075.
10.1016/j.compeleceng.2024.109075
Web of Science® Google Scholar
14 Zhang J., Sun J., Wang J., and Yue X. G., Visual object tracking based on residual network and cascaded correlation filters, Journal of Ambient Intelligence and Humanized Computing. (2021) 12, 8427–8440.
10.1007/s12652-020-02572-0
Web of Science® Google Scholar
15 Sauer A., Aljalbout E., and Haddadin S., Tracking holistic object representations, Journal of Visual Communication and Image Representation. (2019) 4, no. 5, 143–154.
Google Scholar
16 Zhang Z., Xie Y., Xing F., McGough M., and Yang L., MDNet: a semantically and visually interpretable medical image diagnosis, Conference on Computer Vision and Pattern Recognition (CVPR), 2017, Honolulu, HI, USA, IEEE, 3549–3557, https://doi.org/10.1109/CVPR.2017.378, 2-s2.0-85044293112.
10.1109/CVPR.2017.378
Google Scholar
17 Tao R., Gavves E., and Smeulders A. W. M., Siamese instance search for tracking, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, IEEE, 1420–1429, https://doi.org/10.1109/CVPR.2016.158, 2-s2.0-84986295247.
10.1109/CVPR.2016.158
Google Scholar
18 Cen M. and Jung C., Fully convolutional Siamese fusion networks for object trackin, 25th IEEE International Conference on Image Processing (ICIP), 2018, Athens, Greece, IEEE, 3718–3722, https://doi.org/10.1109/ICIP.2018.8451102, 2-s2.0-85062912519.
10.1109/ICIP.2018.8451102
Google Scholar
19 Zhang J., Huang H., Jin X., Kuang L., and Zhang J., Siamese visual tracking based on criss-cross attention and improved head network, Multimedia Tools and Applications. (2023) 83, 1589–1615.
10.1007/s11042-023-15429-3
Web of Science® Google Scholar
20 Zhang J., Feng W., Yuan T., Wang J., and Sangaiah A. K., SCSTCF: spatial-channel selection and temporal regularized correlation filters for visual tracking, Applied Soft Computing. (2022) 118, https://doi.org/10.1016/j.asoc.2022.108485, 108485.
10.1016/j.asoc.2022.108485
Web of Science® Google Scholar
21 Li B., Wu W., Wang Q., Zhang F., Xing J., and Yan J., SiamRPN++: evolution of Siamese visual tracking with very deep networks, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, Long Beach, CA, USA, IEEE, 4277–4286, https://doi.org/10.1109/CVPR.2019.00441.
10.1109/CVPR.2019.00441
Google Scholar
22 He K., Zhang X., Ren S., and Sun J., Deep residual learning for image, Conference on Computer Vision and Pattern Recognition (CVPR), 2016, Las Vegas, NV, USA, IEEE, 770–778, https://doi.org/10.1109/CVPR.2016.90, 2-s2.0-84986274465.
10.1109/CVPR.2016.90
Google Scholar
23 Zhu Z., Wang Q., Li B., Wu W., Yan J., and Hu W., Distractor-aware Siamese networks for visual object tracking, Proceedings of the European Conference on Computer Vision (ECCV), 2018, Springer, 101–117.
Google Scholar
24 Xu Y., Wang Z., Li Z., Yuan Y., and Yu G., SiamFC++: towards robustness and accurate visual tracking with target estimation guidelines, Proceedings of the AAAI Conference on Artificial Intelligence. (2020) 34, no. 7, 12549–12556, https://doi.org/10.1609/aaai.v34i07.6944.
10.1609/aaai.v34i07.6944
Google Scholar
25 Guo Q., Feng W., Zhou C., Huang R., Wan L., and Wang S., Learning dynamic Siamese network for visual object tracking, IEEE International Conference on Computer Vision (ICCV), 2017, Venice, Italy, IEEE, 1781–1789, https://doi.org/10.1109/ICCV.2017.196, 2-s2.0-85038571661.
10.1109/ICCV.2017.196
Google Scholar
26 Zhu Z., Wu W., Zou W., and Yan J., End-to-end flow correlation tracking with spatial-temporal attention, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, Salt Lake City, UT, USA, IEEE Xplore, 548–557.
Google Scholar
27 Yang T. and Chan A. B., Learning dynamic memory networks for object tracking, Proceedings of the European Conference on Computer Vison (ECCV), 2018, Springer, 152–167.
Google Scholar
28 Li X., Wang W., Hu X., and Yang J., Selective kernel networks, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, Long Beach, CA, USA, IEEE, 510–519.
Google Scholar
29 Li P., Chen B., Ouyang W., Wang D., Yang X., and Lu H., GradNet: gradient-guide network for visual object tracking, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, IEEE.
Google Scholar
30 Wang L., Ouyang W., Wang X., and Lu H., Visual tracking with fully convolutional networks, 2015 IEEE International Conference on Computer Vision (ICCV), 2015, Santiago, Chile, IEEE, 3119–3127, https://doi.org/10.1109/ICCV.2015.357, 2-s2.0-84973856013.
10.1109/ICCV.2015.357
Google Scholar
31 Song Y., Ma C., Gong L., Zhang J., Lau R. W. H., and Yang M.-H., CREST: convolutional residual learning for visual tracking, IEEE International Conference on Computer Vision (ICCV), 2017, Venice, Italy, IEEE, 2574–2583.
Google Scholar
32 Krizhevsky A., Sutskever I., and Hinton G., ImageNet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems. (2012) 25, no. 2.
PubMed Google Scholar
33 Li B., Yan J., Wu W., Zhu Z., and Hu X., High performance visual tracking with Siamese region proposal network, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, Salt Lake City, UT, USA, IEEE, 8971–8980.
Google Scholar
34 Fan H. and Ling H., Siamese cascaded region proposal networks for real-time visual tracking, IEEE/CVE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, Long Beach, CA, USA, IEEE, 7944–7953.
Google Scholar
35 Chen B. X. and Tsotsos J., Fast visual object tracking using ellipse fitting for rotated bounding boxes, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019, Seoul, Korea (South), IEEE, 2281–2289.
Google Scholar
36 Guo D., Wang J., Cui Y., Wang Z., and Chen S., SiamCAR: Siamese fully convolutional classification and regression for visual tracking, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, Seattle, WA, USA, IEEE, 6268–6276.
Google Scholar
37 Chen Z., Zhong B., Li G., Zhang S., Ji R., Tang Z., and Li X., SiamBAN: target-aware tracking with Siamese box adaptive network, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2023) 45, no. 4, 5158–5173.
PubMed Web of Science® Google Scholar
38 Shen Q., Qiao L., Guo J., Li P., Li X., Li B., Feng W., Gan W., Wu W., and Ouyang W., Unsupervised learning of accurate Siamese tracking, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, New Orleans, LA, USA., IEEE, 8091–8100.
Google Scholar
39 Yin T., Zhou X., and Krähenbühl P., Center-based 3D object detection and tracking, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, Nashville, TN, USA, IEEE, 11779–11788, 2021.
Google Scholar
40 Zheng C., Yan X., Zhang H., Wang B., Cheng H.-S., Cui S., and Li Z., Beyond 3D Siamese tracking: a motion-centric paradigm for 3D single object tracking in point clouds, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, IEEE, 8111–8120.
Google Scholar
41 Guzov V., Mir A., Sattler T., and Pons-Moll G., Human POSEitioning system (HPS): 3D human pose estimation and self-localization in large scenes from body-mounted sensors, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, IEEE, 4316–4327.
Google Scholar
42 Chen Y., Tang Y., Yuan Q., and Zhang L., REPS: rotation equivariant Siamese network enhanced by probability segmentation for satellite video tracking, International Journal of Applied Earth Observation and Geoinformation. (2024) 128, https://doi.org/10.1016/j.jag.2024.103741, 103741.
10.1016/j.jag.2024.103741
Web of Science® Google Scholar
43 Voigtlaender P., Luiten J., Torr P. H. S., and Leibe B., Siam R-CNN: visual tracking by re-detection, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, Seattle, WA, USA, IEEE, 6577–6587.
Google Scholar
44 Ye J., Fu C., Zheng G., Paudel D. P., and Chen G., Unsupervised domain adaptation for nighttime aerial tracking, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, IEEE, 1–10.
Google Scholar
45 Cao D., Dai R., Wang J., Ji B., Alfarraj O., Tolba A. M., Sharma P. K., and Zhu M., Fast visual tracking with squeeze and excitation region proposal network, Human-Centric Computing and Information Sciences. (2023) 13, no. 7, 1–20.
Web of Science® Google Scholar
46 Sandler M., Howard A., Zhu M., Zhmoginov A., and Chen L.-C., MobileNetV2: inverted residuals and linear bottlenecks, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, Salt Lake City, UT, USA, IEEE, 4510–4520.
Google Scholar
47 Zhang L., Gonzalez-Garcia A., Weijer J. V. D., Danelljan M., and Khan F. S., Learning the model update for Siamese trackers, IEEE/CVF International Conference on Computer Vision (ICCV), 2019, Seoul, Korea (South), IEEE, 4009–4018.
Google Scholar
48 Hu W., Wang Q., Zhang L., Bertinetto L., and Torr P. H. S., SiamMask: a framework for fast online object tracking and segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2023) 45, no. 3, 3072–3089.
PubMed Web of Science® Google Scholar
49 Yu F. and Koltun V., Multi-Scale Context Aggregation by Dilated Convolutions, 2016, ICLR.
Google Scholar
50 Hu J., Shen L., and Sun G., Squeeze-and-excitation networks, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, Salt Lake City, UT, USA, IEEE, 7132–7141.
Google Scholar
51 Čehovin L., Leonardis A., and Kristan M., Visual object tracking performance measures revisited, IEEE Transactions on Image Processing. (2016) 25, no. 3, 1261–1274, https://doi.org/10.1109/TIP.2016.2520370, 2-s2.0-84962781218.
10.1109/TIP.2016.2520370
PubMed Web of Science® Google Scholar
52 Wu Y., Lim J., and Yang M.-H., Object tracking benchmark, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2015) 37, no. 9, 1834–1848, https://doi.org/10.1109/TPAMI.2014.2388226, 2-s2.0-84939235624.
10.1109/TPAMI.2014.2388226
PubMed Web of Science® Google Scholar
53 Mueller M., Smith N. G., and Ghanem B., A benchmark and simulator for UAV tracking, European Conference on Computer Vision. (2016) 9905.
Google Scholar
54 Huang L., Zhao X., and Huang K., GOT-10k: a large high-diversity benchmark for generic object tracking in the wild, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2021) 43, no. 5, 1562–1577, https://doi.org/10.1109/TPAMI.2019.2957464.
10.1109/TPAMI.2019.2957464
PubMed Web of Science® Google Scholar
55 Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., Huang Z., Karpathy A., Khosla A., Bernstein M., Berg A. C., and Fei-Fei L., ImageNet large scale visual recognition challenge, International Journal of Computer Vision (IJCV). (2015) 115, no. 3, 211–252, https://doi.org/10.1007/s11263-015-0816-y, 2-s2.0-84947041871.
10.1007/s11263-015-0816-y
Web of Science® Google Scholar
56 Lin T., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Dollar P., and Zitnick C. L., Microsoft COCO: common objects in context, Lecture Notes in Computer Science. (2014) 8693.
Google Scholar
57 Fan H., Lin L., Fan Y., Chu P., Deng G., Yu S., Bai H., Xu Y., Liao C., and Ling H., LaSOT: a high-quality benchmark for large-scale single object tracking, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, Long Beach, CA, USA, IEEE, 5369–5378.
Google Scholar
58 Zhang Z., Peng H., Fu J., Li B., and Hu W., Object-aware anchor-free tracking, European Conference on Computer Vison (ECCV). (2020) 12366, 771–787.
Google Scholar

All articles

Fast Visual Tracking with Enhanced and Gradient-Guide Network

Abstract

1. Introduction

2. Related Works

2.1. Siamese Network-Based Trackers

2.2. Model Updating in Visual Tracking

3. Proposed Tracker

3.1. Basic Tracker

3.2. Siamese Lightweight Backbone

3.3. The Embedded Attention Mechanism

3.3.1. Split Stage

3.3.2. Fuse Stage

3.3.3. Select Stage

3.4. Gradient-Guide Strategy

3.4.1. Initial Embedding Stage

3.4.2. Gradient Calculation Stage

3.4.3. Template Updating Stage

4. Experiments and Analysis

4.1. Training Dataset

4.2. Implementation Details

4.2.1. Computer Configuration

4.2.2. Operation Settings

4.3. Comparison Experiments

4.3.1. VOT-2016 Dataset

4.3.2. OTB100 Dataset

4.3.3. GOT-10k Dataset

4.3.4. UAV123 Dataset

4.4. Ablation Study

5. Conclusion

Conflicts of Interest

Acknowledgments

Open Research

Data Availability

References

Figures

References

Related

Information