International Journal of Intelligent Systems

Volume 2025, Issue 1 7390905

Research Article

Open Access

MalFSLDF: A Few-Shot Learning-Based Malware Family Detection Framework

Wenjie Guo,

Wenjie Guo

orcid.org/0009-0004-8544-4686

School of Computer Science and Technology , Beijing Institute of Technology , Beijing , 100081 , China , bit.edu.cn

Search for more papers by this author

Jingfeng Xue,

Jingfeng Xue

orcid.org/0000-0002-3087-9701

School of Computer Science and Technology , Beijing Institute of Technology , Beijing , 100081 , China , bit.edu.cn

Search for more papers by this author

Yuxin Lin,

Yuxin Lin

orcid.org/0009-0000-8716-0450

School of Computer Science and Technology , Beijing Institute of Technology , Beijing , 100081 , China , bit.edu.cn

Search for more papers by this author

Wenbiao Du,

Wenbiao Du

orcid.org/0009-0003-1551-7125

School of Computer Science and Technology , Beijing Institute of Technology , Beijing , 100081 , China , bit.edu.cn

Search for more papers by this author

Jingjing Hu,

Jingjing Hu

orcid.org/0000-0002-3220-621X

School of Computer Science and Technology , Beijing Institute of Technology , Beijing , 100081 , China , bit.edu.cn

Search for more papers by this author

Ning Shi,

Ning Shi

orcid.org/0009-0006-6504-425X

Hebei Key Laboratory of IoT Blockchain Integration , Shijiazhuang University , Shijiazhuang , China , sjzc.edu.cn

Search for more papers by this author

Weijie Han,

Corresponding Author

Weijie Han

[email protected]

orcid.org/0000-0003-3793-8222

School of Space Information , Space Engineering University , Beijing , 101416 , China

Search for more papers by this author

Wenjie Guo,

Wenjie Guo

orcid.org/0009-0004-8544-4686

School of Computer Science and Technology , Beijing Institute of Technology , Beijing , 100081 , China , bit.edu.cn

Search for more papers by this author

Jingfeng Xue,

Jingfeng Xue

orcid.org/0000-0002-3087-9701

School of Computer Science and Technology , Beijing Institute of Technology , Beijing , 100081 , China , bit.edu.cn

Search for more papers by this author

Yuxin Lin,

Yuxin Lin

orcid.org/0009-0000-8716-0450

School of Computer Science and Technology , Beijing Institute of Technology , Beijing , 100081 , China , bit.edu.cn

Search for more papers by this author

Wenbiao Du,

Wenbiao Du

orcid.org/0009-0003-1551-7125

School of Computer Science and Technology , Beijing Institute of Technology , Beijing , 100081 , China , bit.edu.cn

Search for more papers by this author

Jingjing Hu,

Jingjing Hu

orcid.org/0000-0002-3220-621X

School of Computer Science and Technology , Beijing Institute of Technology , Beijing , 100081 , China , bit.edu.cn

Search for more papers by this author

Ning Shi,

Ning Shi

orcid.org/0009-0006-6504-425X

Hebei Key Laboratory of IoT Blockchain Integration , Shijiazhuang University , Shijiazhuang , China , sjzc.edu.cn

Search for more papers by this author

Weijie Han,

Corresponding Author

Weijie Han

[email protected]

orcid.org/0000-0003-3793-8222

School of Space Information , Space Engineering University , Beijing , 101416 , China

Search for more papers by this author

First published: 24 April 2025

https://doi.org/10.1155/int/7390905

Academic Editor: Said El Kafhali

Share a link

Email
Wechat
Bluesky

Abstract

The evolution of malware has led to the development of increasingly sophisticated evasion techniques, significantly escalating the challenges for researchers in obtaining and labeling new instances for analysis. Conventional deep learning detection approaches struggle to identify new malware variants with limited sample availability. Recently, researchers have proposed few-shot detection models to address the above issues. However, existing studies predominantly focus on model-level improvements, overlooking the potential of domain adaptation to leverage the unique characteristics of malware. Motivated by these challenges, we propose a few-shot learning-based malware family detection framework (MalFSLDF). We introduce a novel method for malware representation using structural features and a feature fusion strategy. Specifically, our framework employs contrastive learning to capture the unique textural features of malware families, enhancing the identification capability for novel malware variants. In addition, we integrate entropy graphs (EGs) and gray-level co-occurrence matrices (GLCMs) into the feature fusion strategy to enrich sample representations and mitigate information loss. Furthermore, a domain alignment strategy is proposed to adjust the feature distribution of samples from new classes, enhancing the model’s generalization performance. Finally, comprehensive evaluations of the MaleVis and BIG-2015 datasets show significant performance improvements in both 5-way 1-shot and 5-way 5-shot scenarios, demonstrating the effectiveness of the proposed framework.

1. Introduction

The exponential growth of malware poses a significant threat to cybersecurity. According to a recent AV-TEST report [1], more than 307,000 new malware samples are generated daily, including numerous newly evolved malware families. This issue is further exacerbated by the increasing complexity of modern computing environments, where diverse devices and systems are interconnected. Many devices, particularly those with limited computational resources and inadequate security measures, are highly vulnerable to malware attacks. Therefore, accurate malware detection is critical for ensuring robust cybersecurity.

Traditional malware detection methodologies predominantly employ static, dynamic, and hybrid analyses techniques to extract discriminative features from samples, which are subsequently classified utilizing machine learning or deep learning models [2]. Nevertheless, static analysis is inherently limited by its vulnerability to obfuscation techniques, including packing and encryption, whereas dynamic analysis demands substantial computational resources and time. Recent advances in visualization-based classification [3, 4] have introduced a novel paradigm by converting malware binaries into visual representations, enabling direct analysis through neural networks. The visualization-based approach [5] not only simplifies the classification processes but also circumvents many limitations associated with traditional methods. However, many detection models [6–9] require large-scale datasets to achieve high accuracy, which is often impractical given the limited availability of labeled malware samples. This situation exemplifies the few-shot learning (FSL) problem and is a key challenge in modern malware detection.

FSL, as outlined in research [10], addresses data scarcity by leveraging prior knowledge, such as domain expertise or data from related tasks, and employs techniques such as data augmentation, model regularization, and pretraining followed by fine-tuning to enhance classification performance. Current FSL approaches focus on three key components: data, models, and algorithms [11]. Data augmentation methods, including translation, flipping, cropping, and rotation, as well as advanced techniques such as label propagation [12] and generative adversarial networks (GANs) [13], are commonly used to generate synthetic samples. Transfer learning approaches [14] train models on a source domain with abundant labeled data, using learned model parameters or features as prior knowledge to initialize and fine-tune the model in a target domain with limited labeled data. Algorithmic limitations are mainly addressed using meta-learning approaches [15] involving feature parameter optimization, model architecture refinement, and similarity measurement techniques. In malware family classification, FSL can extract essential feature patterns from labeled datasets and apply them to small sets of newly captured malware, facilitating feature extraction and the classification of new malware samples into corresponding families [16–19].

Despite significant advancements in FSL methods, several limitations persist that impede their effectiveness in malware detection. First, many commonly used FSL data augmentation methods face domain misadaptation issues in the malware field. For instance, flipping, cropping, rotation, and mixup often compromise the semantic coherence of malware grayscale images [20]. This distortion obscures critical family-specific patterns, undermining the model’s ability to discern discriminative features and resulting in suboptimal classification performance. Moreover, existing data preprocessing techniques that standardize sample sizes frequently lead to the loss of critical information, thereby impairing the model’s capacity to capture and learn intrinsic data characteristics. Furthermore, existing meta-learning–based FSL methods typically assume that the distribution of pretraining samples with known labels aligns with that of new class samples [21, 22]. This assumption, known as distribution consistency, often fails in practical applications, especially in few-shot malware classification, where the distribution of new class samples is unknown in advance. These issues collectively degrade the generalization performance of FSL methods.

To solve the abovementioned questions, we propose a few-shot learning-based malware family detection framework (MalFSLDF). Specifically, our approach addresses domain misadaptation by introducing a novel method for malware representation using structural features and a feature fusion strategy. We propose a contrastive learning (CL)–based feature extraction method for malware grayscale images under few-shot conditions. We also employ augmentation techniques such as Gaussian blur to retain semantic information while improving the model’s ability to identify malware variants. In addition, we introduce a multifeature construction and fusion strategy to mitigate information loss during data processing, extracting features such as entropy graphs (EGs) and gray-level co-occurrence matrices (GLCMs) from original malware grayscale images. These features are fused with representations obtained through CL, enriching the information of the samples and facilitating the model’s ability to capture the salient characteristics of malware. Finally, we propose a domain alignment (DA) method to address distribution inconsistency. This method adjusts the feature distribution of new class malware samples to align with that of known samples from the training set, thereby improving the model’s generalization performance.

The main contributions of this research are as follows:

•
We propose a novel few-shot visualization-based malware family detection framework, MalFSLDF, which improves existing FSL methods by incorporating malware-specific enhancements in data augmentation and malware representation, thus achieving efficient few-shot malware detection.
•
We propose a novel malware representation method based on structural features. Specifically, we employ CL to extract the texture features of malware grayscale images and integrate GLCM and EG through a feature fusion strategy, achieving a comprehensive representation of malware samples.
•
We utilize a DA strategy to adjust the feature distribution of samples from new classes, thereby enhancing the model’s generalization performance.
•
Comprehensive evaluations on the MaleVis and BIG-2015 datasets demonstrate significant performance improvements in both 5-way 1-shot and 5-way 5-shot scenarios, validating the effectiveness of the proposed framework.

The rest of this paper is organized as follows. Section 2 summarizes related work. Section 3 provides preliminaries and problem formulas. Section 4 details the proposed method. Section 5 presents experimental results. Section 6 presents the discussion of the study. Section 7 concludes the study and discusses future work.

2. Related Work

2.1. Visualization-Based Malware Detection

Traditional malware detection methods utilize static, dynamic, or hybrid analysis techniques to extract features, applying deep learning or machine learning models to classify malware families. Detection techniques that rely on dynamic and static analyses depend heavily on known behavioral patterns, making it challenging to detect variants of malware with unique behaviors. In contrast, visual techniques provide an advantage by transforming complex patterns into visual representations. This allows for quick and effective identification of various malware patterns, making them particularly useful for detecting malware variants.

Nataraj et al. [5] pioneered research in this domain by observing that malware grayscale images from the same family share similar layouts and textures. Based on these visual similarities, they proposed an image-based method for malware family classification. Xiao et al. [23] introduced colored label boxes (CoLabs) to mark different segments of PE files, highlighting the distribution of certain parts within transformed malware images. Jeon et al. [6] used Shannon entropy to detect obfuscation, followed by dynamic analysis to extract application programming interface (API) call sequences, which were converted into images for detection. Later, they proposed Mal3S [7], extracting bytes, opcodes, API calls, strings, and dynamic-link libraries (DLLs) via static analysis to generate five image types, trained with a Multi–SPP-net model for malware detection.

Sharma et al. [8] introduced MIGAN, a GAN–based model, to address class imbalance by generating high-quality synthetic malware images, achieving strong classification performance. Vasan et al. [24] combined visualization, feature decomposition, and broad learning systems to improve malware detection, leveraging random vector functional link neural networks for enhanced classification. He et al. [25] developed an attention mechanism based on the ResNeXt model, improving classification accuracy by capturing channel-specific fields in malware images. Research [26] extracted features from assembly files, converting them into binary form and creating images from the generated binary representation. Zhong et al. [9] proposed a method that utilizes self-similarity techniques to extract local semantics and similarities within binary malware blocks while maintaining correlations between these blocks. Zhang et al. [27] designed a deep polarized network based on ResNet50, maximizing the Hamming distance between hash values of malware samples from different families.

In summary, existing research demonstrates technical strengths in image processing, feature extraction, and model algorithms. However, significant challenges persist, including high computational resource demands, processing complexity, and limited scalability, particularly when addressing obfuscated malware. In addition, deep learning models rely heavily on large labeled datasets for effective classifier training, raising concerns about overfitting with limited labeled samples. Therefore, this research focuses on malware detection in few-shot scenarios.

2.2. Few-Shot Malware Detection

Most existing few-shot malware detection research directly employs classical FSL methods [28–30], such as prototypical networks (ProtoNets) [16], matching networks [17], Siamese networks [18], and memory-augmented networks. However, these methods lack generalizability, showing inconsistent performance across different datasets.

Some recent approaches combine various data augmentation and meta-learning techniques. Chai et al. [21] proposed a dynamic prototype network based on sample adaptation, using dynamic convolution for sample-adaptive dynamic feature extraction. Class features (prototypes) are defined as the mean of dynamically embedded malware samples in the support set. The method also introduces a dual-sample dynamic activation function to minimize the impact of irrelevant features between samples, and a metric-based approach calculates the distance between query samples and prototypes for malware detection. Conti et al. [19] introduced a few-shot malware classification method utilizing malware feature visualization, representing malware binaries as three-channel images. They employed convolutional Siamese neural networks and shallow FSL architectures to address the retraining issues of traditional deep learning classifiers.

Barros et al. [22] proposed Malware-SMELL, which incorporates two data representation spaces: latent feature space and similarity space (S-Space). In S-Space, a kernel-based discrete Cauchy distribution is used to quantify the similarity between input pairs, measuring the distance between labels and new data representations to address the challenges of the FSL paradigm. Liu et al. [31] proposed A2-CLM, a self-supervised malware detection framework that integrates adversarial heterogeneous graph augmentation with CL. By evaluating the security semantics of each behavior, A2-CLM constructs a heterogeneous graph of malware execution contexts. Multiple adversarial attacks, including attribute masking attacks, meta-graph–guided sampling attacks, direct system call attacks, and obfuscation attacks, are designed to generate more robust contrastive pairs. In addition, a momentum strategy is proposed to train multiple graph encoders, mitigating the workload associated with CL.

In summary, current few-shot malware detection methods still face several limitations. First, existing approaches necessitate extensive feature engineering and substantial expert knowledge. Second, conventional FSL techniques that are not directly applicable to malware detection may disrupt the texture features of malware samples, leading to confusion, introducing noise, and reducing model accuracy. Furthermore, most model improvement strategies do not adapt to the unique characteristics of malware, leading to inconsistent performance across datasets and limited generalization capabilities. Hence, this study proposes enhancements in data augmentation, parameter optimization, and similarity measurement.

3. Preliminaries and Problem Formulas

3.1. Processing of the Malware Grayscale Image

This section presents the conversion of source binary malware files into grayscale images. As shown in Figure 1, the original binary file is stored in byte format, with the data starting address on the left side of each line in 8 bits and the file content represented in hexadecimal on the right.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Bytes file format.

The process of converting bytecode to grayscale images is depicted in Figure 2. Specifically, hexadecimal numbers are initially converted to binary numbers, subsequently grouped into 8-bit vectors, and then transformed into decimal integers within the range of 0–255. This process generates a grayscale image, where 0 corresponds to black and 255 corresponds to white.

As illustrated in Figure 3, malware samples from the same family appear visually similar in grayscale images, whereas those from different families are distinct. This phenomenon is due to the reuse of malware construction patterns within families, with minimal changes to texture features. Thus, image classification can effectively distinguish malware categories based on grayscale images.

In addition, grayscale malware images must be standardized to be adapted for neural network analysis. Motivated by the work [32], this study uses bilinear interpolation to resize images to 224 × 224 pixels, which optimally preserves texture features. Figure 4 shows the transformation effect of the Dontovo family images, where A₁, A₂, and A₃ are the original images with sizes of 64 × 456, 128 × 244, and 128 × 320, respectively, and B₁, B₂, and B₃ are the corresponding transformed grayscale images, all with a size of 224 × 224.

3.2. CL

CL is a critical form of self-supervised learning, which learns effective data representations by constructing and comparing positive and negative sample pairs. Positive samples are typically different views or augmented versions of the same data, whereas negative samples come from views of different data points. The training objective is to bring positive samples closer in the feature space while pushing negative samples further apart. This method enables the model to learn features that differentiate between categories, allowing it to acquire useful data representations without labeled data.

Specifically, for each original image x, two augmented samples x_i and x_j are generated, and a neural network, such as a convolutional neural network, is used as an encoder f to encode the two augmented samples into corresponding feature representations h_i and h_j. Further projection layers enhance the quality of these learned features, producing more compact and discriminative feature representations. Contrastive loss ensures that the representations of the two augmented samples are close in the feature space. This approach does not require label information and relies entirely on the intrinsic structure of the input data to learn meaningful representations.

3.3. Problem Formula

In traditional deep learning frameworks, datasets are divided into training and test sets, which share the same category space. However, in FSL scenarios, the training and test sets contain different category spaces. The test set is further divided into support and query sets, which share a common category space. As shown in Figure 5, different colored squares represent different malware families.

The core of FSL lies in using a large labeled training set and a small labeled support set to classify unlabeled query samples. In the context of malware classification, the training set consists of malware samples from known categories. In contrast, the support set of the test set includes a small number of new category samples, and the query set contains the new samples to be classified.

First, the model learns transferable knowledge from the training set, such as a feature extraction method f_x, and then leverages this knowledge to extract features from new test samples. It combines the small sample set in the support set (S) and their labels to classify the samples in the query set (Q). In typical FSL settings, if the support set consists of N categories and K samples per category, this is called an N-way K-shot problem. Generally, N and K are small values, reflecting the nature of FSL, where the model must recognize N categories based on a limited set of N∗K samples.

In deep learning, classification tasks are typically approached with two learning paradigms: inductive learning and transductive learning. Inductive learning focuses on constructing and training models from training datasets to achieve effective generalization on unseen test data. Conversely, transductive learning leverages the distributional information of test data during training, providing additional guidance to the model despite the absence of labels, but necessitates continuous model updates with new samples. Considering the unpredictability of malware categories and their distribution, coupled with the resource expenditure associated with frequent model updates, this study posits that inductive learning is more suitable for classifying malware.

4. Methodology

4.1. Overview of MalFSLDF

To address the challenges of malware family classification in few-shot scenarios, this paper introduces MalFSLDF. As shown in Figure 6, MalFSLDF comprises four core modules: feature extraction, feature fusion, DA, and classification, each serving as a key component for effective malware detection.

Feature Extraction Module. This module focuses on processing grayscale images of malware, generating a richer representation by profiling malware from multiple dimensions. A CL method tailored for malware is proposed, which enhances the family level feature representation of grayscale images. The original grayscale images’ EG and GLCM are also constructed as auxiliary features.

Feature Fusion Module. This module assigns weights to the three types of features extracted from the feature extraction module, ensuring that the model captures the family level traits of malware samples comprehensively. In addition, a custom loss function is designed to allow backpropagation of feature loss, optimizing the backbone network and generating more accurate feature representations.

DA Module. This module addresses the distribution mismatch between training and test samples, ensuring that knowledge extracted from the source domain can be transferred effectively to the target domain. A feature mapping approach is introduced to assess the correlation between test sample features and the prototype features from the training samples, reducing the gap between their feature distributions.

Classification Module. This module begins by extracting the class mean feature representations of the support set samples. Then, it calculates the distance between the query sample features and the class centers in the support set. Each query sample is classified based on proximity to the nearest class center.

4.2. CL Enhanced Feature Extraction

In order to preserve the texture features of malware grayscale images, this paper introduces a novel data augmentation strategy, which leverages Gaussian blur and noise injection to generate high-quality samples essential for CL. Specifically, a CL branch is added to the backbone network to capture the features of malware family samples. This enhances the accuracy of feature representations, thereby improving the overall performance of the system.

4.2.1. Data Augmentation Strategy

In order to preserve the texture features of malware grayscale images, this paper introduces a novel data augmentation strategy, which leverages Gaussian blur and noise injection to generate high-quality samples essential for CL. Specifically, three enhancement techniques are utilized: horizontal flipping, color jittering, and Gaussian blur. Horizontal flipping preserves the vertical texture patterns of the original images.

Color jittering is a data augmentation technique characterized by stochastic adjustments to brightness and contrast [33, 34]. Malware variants often exhibit minor payload alterations, such as instruction modifications or the addition/removal of functional modules, which manifest as subtle variations in pixel intensity within grayscale image representations. By applying color jittering, we simulate these pixel-level changes, enhancing the model’s ability to generalize across different malware instances with similar structures but slight differences. Figure 7 illustrates examples of color jittering, featuring a 50% increase in both brightness and contrast, using samples from the Ramnit and Androm families in the BIG-2015 and MaleVis datasets. These examples demonstrate that color jittering induces subtle changes in pixel intensity, reflecting the minor payload modifications observed in malware variants, while preserving the overall structural integrity of the grayscale representation.

Furthermore, Gaussian blur reduces image noise by applying a Gaussian filter, helping the model focus more on the overall structure of the image rather than on fine details. This allows the model to handle images of varying quality in real-world scenarios, reduces reliance on high-quality images, lowers the risk of overfitting, and improves model performance during testing.

The Gaussian blur implementation is outlined in Algorithm 1. The core computation employs a convolution operation where each pixel’s blurred value derives from a Gaussian-weighted summation of its neighbors. We first initialize a zero matrix of the same size I according to the size of each malware image. With each pixel (x, y) in the original image as the center, the Gaussian weight of the neighborhood of k × k is calculated according to the standard deviation σ, and the Gaussian weight is multiplied and summed with the pixels in the corresponding position in the neighborhood and the pixel value I^′(x, y) after fuzzy processing is obtained.

Algorithm 1: Gaussian blur algorithm implementation.

Require:I: Original image, k: size of the Gaussian kernel, σ: standard deviation
Ensure:I^′: Blurred image
1.
image_size ← I.size( )
2.
I^′ ← zero_array(image_size)
3.
for(x, y) ∈ Ido
4.
temp ← 0
5.
for(i, j) ∈ [−(k/2), k/2] × [−(k/2), k/2]do
6.
w ← (1/(2πσ²))exp(−(i² + j²)/(2σ²))
7.
if(x + i, y + j) ∈ Ithen
8.
temp ← temp + w × I(x + i, y + j)
9.
end if
10.
end for
11.
I^′(x, y) ← temp
12.
end for
13.
returnI^′

Specifically, boundary regions require special treatment as incomplete neighborhoods disrupt standard convolution. First, a mirroring technique is employed, where the pixels near the boundary are reflected, creating a virtual extension of the image. This process ensures that the Gaussian filter has neighboring pixels to perform the convolution. Second, the edge pixels are handled by adjusting the weight distribution to minimize the impact of missing neighbors, ensuring that the boundary areas are smoothly blurred without introducing artificial artifacts. These mechanisms ensure artifact-free blurring across all image regions while maintaining the Gaussian kernel’s mathematical integrity.

4.2.2. Design of the CL Network

Pretext tasks such as rotation and crop prediction in classical CL are ineffective for malware classification because they disrupt key texture features in grayscale images, resulting in unstable feature representations. To address these limitations, we propose a novel CL network architecture that incorporates both supervised and unsupervised learning. The network architecture and layer-specific parameters are illustrated in Figure 8. ResNet12 [35] is adopted as the backbone architecture. Specifically, a classification head, structurally identical to the projection head, is added at the same hierarchical level.

Consistent with classical methods, the projection head maps features extracted by the backbone network into a lower-dimensional space, enabling the accurate measurement of similarities and differences between samples. In addition, the classification head leverages limited malware category labels for supervised learning, enhancing the model’s ability to capture structural features and improve classification performance. In addition, unlike conventional CL that relies on two augmented samples as inputs, our method incorporates the original sample as one of the inputs. This configuration not only more accurately reflects the structural characteristics of the real samples but also allows for the utilization of original sample labels for supervised learning.

The proposed network facilitates the simultaneous execution of CL and classification tasks. The projection head conducts CL by maximizing the contrastive loss between different samples while minimizing it for identical sample pairs, thereby optimizing feature representations. Furthermore, the classification head enhances the model’s capability for classification by minimizing the classification loss. This approach empowers the model to develop more comprehensive and robust feature representations.

4.2.3. Design of the Loss Function

In CL, the objective is to minimize the distance between similar samples and maximize the distance between dissimilar ones. In practice, augmented variants of the same sample are considered positive pairs due to their semantic consistency. Conversely, negative pairs are constructed from other samples within the same batch, as these samples typically belong to different classes and thus provide effective contrast.

The total loss of the model is composed of supervised loss and contrastive loss. The supervised loss uses cross-entropy, as defined in the following equation:

()

where N denotes the number of samples, C is the number of classes, y_ij indicates whether the i-th sample belongs to the j-th class, and p_ij is the probability that the model assigns sample i to class j.

The contrastive loss is calculated using InfoNCELoss [36], as shown in the following equation:

()

where q represents the query sample, k⁺ is the positive sample, k_i is the negative sample, and sim(q, k) is the similarity function used to calculate the similarity between the query and another sample. Common similarity functions include cosine similarity and the dot product. K is the number of negative samples, i.e., the number of other samples in the same batch, excluding the positive one. τ is the temperature parameter, controlling the “smoothness” of the loss. A higher τ makes the loss function smoother, treating all negative samples equally and leading to less targeted learning. A lower τ makes the function sharper, focusing more on difficult-to-distinguish negative samples, which could sometimes be potential positive samples, hindering model convergence and generalization.

The total loss of the model is a weighted sum of supervised and contrastive loss. The classification head output is denoted as f(x) and the outputs of the two projection heads as p(x) and p^′(x^′). The total loss is calculated as

()

where parameter α adjusts the balance between the capture of supervised and intrinsic sample information, and in this work, both branches are treated equally, with α set to 0.5.

We utilize the nearest class mean (NCM) classifier during the training phase. In this process, the features from the support set are employed to calculate class centers, while the features from the query set are used to predict class affiliations. The test sample features z are obtained from the ensemble backbone network, and then centered and normalized to map them onto a hypersphere. First, z is centered by subtracting the training samples’ mean feature

and then normalized by dividing by the L2 norm to map onto the hypersphere. The NCM algorithm is used for classification. Let the features of class i(0 ≤ i ≤ n) in the support set, denoted as z_ij(0 ≤ j ≤ k), form the set S_i. Then, the class center

is calculated as

()

For each query sample feature z_q, the predicted class C_q is calculated as

()

4.3. Malware Multifeature Fusion (MF) Module

To learn richer feature representations of malware, the multifeature construction and fusion process is introduced. This process includes EG generation, GLCM calculation, and feature fusion.

4.3.1. Construction of Malware EG

Malware commonly employs packing and compression techniques to conceal its true structure and functionality, complicating static analysis and evading detection by security mechanisms. Compared to benign code, packed and compressed sections typically exhibit higher entropy values, making entropy analysis an effective tool for rapidly identifying packed and encrypted malware samples.

The process of constructing the EG of a grayscale image consists of three steps: first, segmenting the original malware grayscale image, then calculating the information entropy for each segment using Shannon’s formula, and finally, plotting the calculated entropy values as a line graph, which results in the EG.

As mentioned in Section 3.1, when converting malware binary samples into grayscale images, different grayscale image widths are empirically assigned based on the file size of the malware sample to emphasize its texture features. In the segmentation process for constructing an EG, each row is treated as a natural segment, and the row’s entropy is calculated as the segment’s information entropy. Specifically, grayscale values are treated as random variables, and the frequency of each grayscale value within the row is computed. Suppose the probability of a grayscale value j occurring in the i-th row is p_j(0 ≤ j ≤ 255). In that case, the entropy y_i of the i-th row is calculated using Shannon’s formula, as shown in the following equation:

()

To ensure comparability of the EGs, the map’s height is fixed to correspond to the maximum possible entropy when all grayscale values occur with equal probability.

As shown in Figure 9, EGs derived from the same malware family display a high degree of similarity, whereas those from distinct families exhibit clear differences.

4.3.2. Construction of GLCM

Malware grayscale images exhibit distinct vertical texture features, and common texture feature extraction algorithms include GLCM, local binary pattern, and Gabor transform [37]. GLCM has the lowest computational complexity, making it an ideal choice for this paper as an auxiliary feature for malware grayscale images.

Algorithm 2: GLCM construction.

Required:I_gray: Grayscale image, Levels: number of quantized gray levels, Distance: distance parameter, Direction: direction parameter
Ensure:GLCM, Features
1.
Levels_before = max(I_gray) + 1
2.
R = (Levels/Levels_before) = L^″/L
3.
for(x, y) ∈ I_graydo
4.
gray₁ = I_gray(x, y)
5.
I_gray(x, y) = R·gray₁
6.
end for
7.
GLCM = np.zeros(Levels, Levels)
8.
for(x, y) ∈ I_graydo
9.
gray₁ = I_gray(x, y)
10.
Use Distance and Direction to find (x^′, y^′)
11.
gray₂ = I_gray(x^′, y^′)
12.
GLCM(gray₁, gray₂) = GLCM(gray₁, gray₂) + 1
13.
end for
14.
Calculate Features of GLCM
15.
returnGLCM, Features

As shown in Algorithm 2, the construction of GLCM involves three main steps: grayscale compression, co-occurrence matrix generation, and feature computation. First, pixel values are extracted from the grayscale image to form a grayscale matrix, and grayscale compression is applied to the original matrix to reduce dimensionality and computational load. If the original grayscale level is L and the target level is L^″, then the compression ratio R is L^″/L. Each pixel in the original image is multiplied by the compression ratio and rounded down to obtain the compressed grayscale value. Next, adjacent pixels are identified in the compressed image using a defined direction and distance. The grayscale levels of the current and adjacent pixels are used as indices to update the co-occurrence matrix. For instance, if the grayscale level of the current pixel is i and the adjacent pixel is j, the value at position (i, j) in the co-occurrence matrix is incremented by 1. Considering the substantial manual effort required for feature selection when employing these metrics, this paper will utilize the GLCM derived directly from the original grayscale levels for model training. The neural network automatically captures the latent features within the GLCM. Since the initial grayscale level is 256, the size of the GLCM is set to 256 × 256.

4.3.3. Fusion of Multifeature

This paper integrates EGs and GLCM of malware grayscale images as auxiliary features to construct a feature fusion network, producing fused feature representations that enhance classification performance.

As depicted in Figure 10, the proposed feature fusion strategy constructs three classification subnetworks, each linked by its respective loss function. The classification losses from auxiliary features are backpropagated to influence the feature representation generation in the backbone network. The backbone network’s output is then utilized as the fused feature representation. EG and GLCM are processed through convolutional and pooling layers to generate feature vectors, which are then mapped through an MLP for loss calculation. The MLP consists of two linear layers with batch normalization and ReLU activation, followed by a final linear layer for category prediction. Original and augmented samples are fed into the backbone network ResNet12 [35] to produce grayscale feature representations. These representations are passed through MLPs to compute projection features, and the loss is calculated using InfoNCELoss, as detailed in equation (2). The total loss combines contrastive loss, feature representation classification loss, EG loss, and GLCM loss, as defined in the following equation:

()

where p_o refers to the projection feature of the original sample, p_a refers to the projection feature of the augmented sample, y_Fea is the sample category predicted from the feature representation, y_true is the true category, y_EG is the category predicted from the EG, and y_GLCM is the category predicted from the GLCM. In this fusion framework, the EG and GLCM influence the generation of feature representations in the backbone network through backpropagation via the loss function, enabling the model to learn generalized feature representations that better capture the fundamental characteristics of malware.

To evaluate the efficacy of MF, this paper employs a straightforward feature concatenation approach as a baseline. Three distinct feature types are concatenated directly for comparative analysis. In this approach, the total loss consists solely of contrastive loss and fused feature classification loss, as defined in the following equation:

()

4.4. DA–Based Feature Distribution Adjustment Strategy

This paper introduces a feature mapping–driven DA strategy that aims to adjust the distribution of test data during the testing phase to align it more closely with the training data distribution, thereby improving the classification accuracy of test data.

First, prototype features derived from the training samples are used as the feature basis for the test samples. The training feature representations are denoted as

, where N_train is the number of training samples and D_fea is the feature dimension. The prototype features

are calculated as the classwise mean of F_train, as shown in the following equation:

()

where

denotes the set of training features belonging to class j and C_train is the number of categories in the training set.

After deriving the prototype features, the correlation between the test sample feature representations and the training sample prototypes is calculated. Specifically, the test feature matrix is denoted as

, where N_test is the number of test samples and D_fea is the feature dimension. Subsequently, the correlation matrix

is computed as the inner product between the test features and prototype features, as shown in the following equation:

()

where each element C_i,j in the correlation matrix represents the inner product between the i-th test sample (1 ≤ i ≤ N_test) and the j-th prototype (1 ≤ j ≤ C_train). This inner product indicates the degree of linear correlation between the test sample feature representation and the training prototype.

Subsequently, a softmax operation is applied to each row of C (i.e., for each test sample) to normalize the correlation coefficients, as shown in equation (11). This normalization ensures that the sum of the correlation coefficients for each test sample across all prototypes equals 1. Consequently, the category inclination of the test samples becomes more pronounced, facilitating clearer classification decisions.

()

Upon acquiring the normalized correlation coefficient matrix, we further multiply it by the prototype features of the training set, thereby obtaining the adjusted test sample feature representations

, as expressed by equation (12). This operation maps the test features to the feature space of the training prototypes, aligning the distributions of the source and target domains.

()

After adjusting the feature distribution, the feature representations of the test sample align closely with the distribution of the training samples. Spatially, this manifests as the test samples clustering around the training prototypes, as depicted in Figure 11.

5. Evaluation

5.1. Experimental Settings

5.1.1. Dataset

This study evaluates the model on two publicly accessible Windows malware datasets: MaleVis [42] and BIG-2015 [43]. MaleVis contains 26 categories (1 benign and 25 malware families) with RGB images of 224 × 224 and 300 × 300 resolutions, comprising 9100 training and 5126 validation images. A balanced subset of 25 malware classes is used for evaluation, partitioned by the FSL method into 15 training, 5 validation, and 5 testing classes. BIG-2015, released by Microsoft, contains 21,741 samples from 9 malware families, with 10,868 labeled and 10,873 unlabeled samples. Due to imbalanced sample distribution, this study focuses on four categories, Ramnit, Lollipop, Kelihos_Ver3, and Gatak, as the training set, while the remaining five categories are reserved for the test set.

Baseline models (BLs) and improved approaches are applied to both the MaleVis and BIG-2015 datasets to validate their effectiveness. Since the MaleVis dataset comprises RGB images, they are treated as three overlapping grayscale images, and the model’s input channels are adjusted from 1 to 3 to handle the RGB format. As for the BIG-2015 dataset, bilinear interpolation is used to standardize the sizes of samples, and random or oversampling is employed to balance the number of samples in each category to 500, thereby mitigating the effects of data imbalance on the final results.

5.1.2. Experimental Setup

This study divides the dataset into training, validation, and test sets. The validation set is used to assess the model’s generalization performance, and the model achieving the highest validation accuracy is preserved for final testing. Training is set to a default of 300 epochs, and early stopping is triggered if validation accuracy ceases to improve after 50 epochs to prevent overfitting. The batch size is set to 12 during training, and the stochastic gradient descent (SGD) optimizer is employed with an initial learning rate of 0.1. Cosine annealing is used to dynamically adjust the learning rate, preventing improper learning rate adjustments. During testing, 15 samples are queried each time, and the process is repeated 10,000 times, with the average taken as the final evaluation result.

5.1.3. Few-Shot Scenarios and Evaluation Metrics

This paper employs classification accuracy in 5-way 1-shot and 5-way 5-shot scenarios as the primary evaluation metrics. In a 5-way 5-shot task, each class has 5 training samples, whereas, in a 5-way 1-shot task, each class has only 1 training sample. These evaluation approaches effectively reflect the model’s generalization and learning capacity in limited data.

The malware family classification problem addressed in this paper, designed for FSL, is a multiclass classification problem, so standard multiclass evaluation metrics such as precision, recall, and F1-score are considered.

5.2. Results and Analysis

5.2.1. Comparison With Other Works

To verify the performance of MalFSLDF, we select several state-of-the-art works as the BLs and make a comprehensive comparison during the experiments.

5.2.1.1. BLs

ProtoNet [38] introduced a method that leverages the feature representations of a small number of samples in the support set, calculating the mean feature representation of each class for classification. New query samples are categorized based on the similarity or distance between their features and the prototypes of each class.

Relation network (RN) [39] designed an embedding module that embeds samples into a low-dimensional feature space to extract their deep feature representations. A few-shot classification is accomplished by comparing the deep nonlinear distance metrics between query samples and support set samples calculated by the relation module.

RFS [40] trained a high-performance feature extractor on a large-scale dataset to obtain robust embedded representations. These pretrained embeddings are employed directly for few-shot classification, eliminating the need for additional adjustments.

DeepEMD [41] partitioned images into multiple patches and introduced earth mover’s distance (EMD) as a novel metric to calculate the optimal matching flow between patches in the query and support set images. Classification is performed by comparing the similarity of different local regions within the images.

DeepBDC [32] used neural networks to learn the embedded features of samples, calculating the Brownian distance covariance (BDC) distance between these features to measure dependencies among samples, thereby capturing subtle distinctions between them.

EASY [20] enhanced the initial samples using a feature generator, generating diversified virtual samples to augment the training data. An adaptive feature aggregation mechanism effectively merges the feature information from both original and augmented samples, improving the model’s generalization capabilities.

5.2.1.2. Comparison With BLs

This study evaluates the classification performance of existing models and the proposed MalFSLDF on the MaleVis and BIG-2015 datasets under the 5-way 1-shot and 5-way 5-shot configurations. As shown in Table 1, the MalFSLDF framework achieves a significant enhancement in accuracy, ranging from an increase of 4.71% to 14.31% in the 5-way 1-shot classification paradigm and from 4.47% to 19.3% in the 5-way 5-shot classification paradigm. The comparative experimental results across both datasets confirm the effectiveness of the MalFSLDF, which outperforms BLs and demonstrates strong generalization capability.

Table 1. Comparison results with other models on MaleVis and BIG-2015.

Models	MaleVis								BIG-2015
	5-way 1-shot				5-way 5-shot				5-way 1-shot				5-way 5-shot
	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1
ProtoNet [38]	65.39	66.71	65.39	66.04	68.72	70.55	68.72	69.63	55.44	57.54	55.44	56.47	62.85	64.83	62.85	63.82
RN [39]	59.95	59.07	59.95	59.51	66.31	65.98	66.31	66.14	54.98	50.37	54.98	52.23	61.14	56.17	61.14	58.26
RFS [40]	68.88	70.81	68.88	69.83	73.69	75.07	73.69	74.38	50.35	53.50	50.35	51.64	65.42	67.77	65.42	66.54
DeepEMD [41]	67.94	69.91	67.94	68.91	71.47	73.42	71.47	72.43	51.27	52.46	51.27	51.86	66.18	67.32	66.18	66.75
DeepBDC [32]	68.75	70.74	68.75	69.73	72.90	74.95	72.90	73.91	52.40	55.33	55.33	53.63	68.14	70.19	68.14	69.15
EASY [20]	65.19	65.92	65.19	65.55	69.25	70.67	69.25	69.96	56.43	53.73	56.43	54.68	67.53	69.79	67.53	68.61
MalFSLDF	74.26	74.29	74.26	74.27	87.56	87.63	87.56	87.52	64.33	64.57	64.33	64.45	73.92	73.97	73.92	73.94

The classification accuracies of malware families in the BIG-2015 and MaleVis datasets under various few-shot settings are shown in Figures 12 and 13, respectively. It is observed that the performance under the 5-way 5-shot setting is markedly superior to that under the 5-way 1-shot setting. In addition, it is noted that the classification performance for some families is comparatively poor. This phenomenon can be attributed to the fact that these families are categorized within the same category, which induces similarity in their grayscale image texture features. For instance, in the BIG-2015 dataset, both “Vundo” and “Tracur” are categorized as Trojans, and in the MaleVis dataset, both “Allaple.A” and “Hlux!IK” are classified as worms.

By introducing a CL module, the model proposed in this paper is capable of effectively capturing the texture features of malicious sample images, thereby achieving a more accurate preliminary representation of their global structural features. Unlike traditional methods such as ProtoNet and RN, which rely solely on single-scale feature matching, and the RFS method that uses only single-view pretrained features, the MF mechanism proposed in this paper captures more complementary information from multiple analytical perspectives, thus achieving a comprehensive representation of the structural information of malicious samples. This multidimensional feature fusion significantly enhances the model’s ability to recognize complex samples. In addition, the sample augmentation strategy proposed in this paper, compared to traditional feature generation methods such as EASY, avoids disrupting the structural information embedded in malicious sample images. This strategy generates effective sample variants while preserving the original structural features of the samples, thereby enhancing the model’s learning ability in data-scarce situations and making it more suitable for malicious code detection tasks. Compared to traditional methods such as DeepEMD and DeepBDC, which rely on fixed distance metrics for sample matching, the dynamic feature adaptation mechanism based on DA proposed in this paper adjusts the features of detected samples dynamically according to the feature distribution of known samples.

Khan et al. [44] developed a RN–based framework for FSL malware detection, comprising a discriminative feature extractor for grayscale images and a relation module for similarity assessment. Our approach presents three fundamental advancements: First, our feature fusion strategy generates more robust feature representations compared to their raw grayscale feature construction. Second, our model incorporates DA to dynamically characterize test sample features for malware classification. Third, while their work focuses on binary detection, our framework addresses the more challenging task of multifamily classification.

Notably, our proposed model exhibits real-time processing capabilities. On the one hand, our approach directly converts binary malware bytecode into grayscale images for feature extraction and detection, eliminating the need for static or dynamic analysis processes. This approach significantly improves processing efficiency while reducing latency. Moreover, our proposed model’s inductive learning mechanism provides enhanced generalization capability, enabling effective handling of emerging novel samples in evolving malware scenarios.

5.2.2. Effectiveness of CL

This study evaluates the effectiveness of an improvement strategy based on CL. In this context, the BL is represented as BL, and CL is represented as CL.

Table 2 illustrates that on the MaleVis dataset, the CL–based improvement strategy significantly enhances classification accuracy compared to the BL. In the 5-way 1-shot setting, the improvement achieved a 3.11% increase in accuracy, and in the 5-way 5-shot setting, the improvement margin was 1.29%. These outcomes demonstrate the effectiveness of contrastive learning–based improvements, particularly in accurately classifying unlabeled samples when each class has only one labeled example.

Table 2. Comparison of CL performance on MaleVis and BIG-2015 datasets.

Method	MaleVis dataset								BIG-2015 dataset
	5-way 1-shot				5-way 5-shot				5-way 1-shot				5-way 5-shot
	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1
Baseline (BL)	69.55	69.71	69.55	69.63	83.09	83.19	83.09	83.14	56.43	53.73	56.43	54.68	67.53	69.79	67.53	68.61
Contrastive learning (CL)	72.66	72.79	72.66	72.72	84.38	84.47	84.38	84.43	60.77	61.04	60.77	60.91	70.61	70.65	70.61	70.63

Table 2 also presents the experimental results for the BIG-2015 dataset. Since this dataset relies on the disassembly tool IDA_pro to generate hexadecimal values converted into grayscale images, it contains many unknown values, resulting in a decrease in classification accuracy. Nevertheless, the CL–based improvement strategy outperforms the BL across all metrics. Notably, in the 5-way 1-shot configuration, the improvement strategy’s precision exceeds that of the BL by 7.31%, underscoring its low misclassification rate when accurately identifying specific malware categories.

5.2.3. Effectiveness of MF

This work proposes an improved approach based on multifeature construction and fusion at the data level. Specifically, two feature fusion strategies are presented: The first strategy employs a separate network structure to process the EG and GLCM, and the features are fused by optimizing the total loss via backpropagation. The baseline strategy directly concatenates these two features with CL representations, followed by classification using a multilayer perceptron.

As shown in Table 3, it is evident that adding the EG alone significantly boosts the model’s classification accuracy; performance is further improved by incorporating the GLCM. Specifically, in the MaleVis dataset, the classification accuracy increased by 1.77% and 2.69% in the 5-way 1-shot and 5-way 5-shot settings, respectively. On the BIG-2015 dataset, the accuracy increased by 4.86% and 3.22% under the same configurations. Moreover, the fusion strategy demonstrated a more notable effect on the BIG-2015 dataset, particularly in the 5-way 1-shot setting. This improvement could be attributed to the high proportion of unknown values in the dataset, where including the EG enhances the model’s sensitivity to information entropy in unknown values.

Table 3. Comparison of feature fusion and concatenation on MaleVis and BIG-2015 datasets.

Method	Features	MaleVis dataset								BIG-2015 dataset
		5-way 1-shot				5-way 5-shot				5-way 1-shot				5-way 5-shot
		Acc	Pre	Rec	F1	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1
Single feature	CL	72.66	72.79	72.66	72.72	84.38	84.47	84.38	84.43	60.77	61.04	60.77	60.91	70.61	70.65	70.61	70.63
Concatenation	CL + EG	68.42	68.45	68.42	68.43	78.62	78.71	78.62	78.67	51.97	52.09	51.97	52.03	61.59	61.66	61.59	61.62
Concatenation	CL + EG + GLCM	71.46	71.54	71.46	71.50	82.37	82.43	82.37	82.40	56.14	56.43	56.14	56.28	61.32	61.43	61.32	61.37
Fusion	CL + EG	73.79	73.91	73.79	73.84	85.97	86.02	85.97	86.01	63.51	63.59	63.51	63.55	72.81	72.83	72.81	72.82
Fusion	CL + EG + GLCM	74.43	74.56	74.43	74.49	87.07	87.12	87.07	87.09	65.63	65.73	65.63	65.68	73.83	73.96	73.83	73.90

Specifically, the baseline fusion strategy yielded suboptimal performance. According to Table 3, experiments on both datasets demonstrate that this approach leads to a decline in model performance, especially in the 5-way 5-shot configuration on the BIG-2015 dataset, where classification accuracy experienced a decline of 9.29%. This suggests that the concatenation-based fusion method may cause interference between the auxiliary features and those learned via CL.

5.2.4. Effectiveness of DA

We tackle the problem of distribution discrepancies between known and novel malware samples in few-shot scenarios by proposing a feature distribution adjustment method based on DA. Experiments on the MaleVis and BIG-2015 datasets confirmed the effectiveness of the DA–based feature distribution adjustment method, with the improved MF model serving as the comparison baseline. The results of the experiments are displayed in Table 4, marked as MF and DA.

Table 4. Comparison of domain alignment performance on MaleVis and BIG-2015 datasets.

Method	MaleVis dataset								BIG-2015 dataset
	5-way 1-shot				5-way 5-shot				5-way 1-shot				5-way 5-shot
	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1
Multifeature fusion (MF)	74.43	74.56	74.43	74.49	87.07	87.12	87.07	87.09	65.63	65.73	65.63	65.68	73.83	73.96	73.83	73.90
MF + domain alignment (DA)	74.26	74.32	74.26	74.29	87.56	87.48	87.56	87.52	64.33	64.57	64.33	64.45	73.92	73.97	73.92	73.94

The DA method exhibits a differential impact on classification accuracy across varying FSL settings. Specifically, on the MaleVis dataset, the method resulted in a marginal decline in accuracy under the 5-way 1-shot configuration, yet it achieved a 0.49% improvement in the 5-way 5-shot setting. This dichotomy suggests that the DA method more effectively captures class-specific features when a greater number of labeled samples are available. On the BIG-2015 dataset, the DA module induced fluctuations in the 5-way 1-shot setting but yielded a 0.09% improvement in the 5-way 5-shot setting. The limited number of categories in the BIG-2015 dataset (9 in total, with only 4 used for training) may account for the diminished impact observed.

Figure 14 illustrates the adjustments in feature distribution before and after DA using the t-SNE algorithm. The visualization reveals that samples from Class 1 coalesced from two distinct clusters into a single, more cohesive cluster, thereby enhancing the differentiation from other classes and subsequently improving classification efficiency. The DA method operates by mapping the features of test samples to the class prototype features derived from the training set, thereby aligning the feature distribution of the test set to more closely resemble that of the training set.

5.2.5. Analysis of Parameter Sensitivity

In the MF strategy proposed in Section 3, distinct weights are assigned to the different branches constituting the total loss, as illustrated in (7). Specifically, α represents the weight of CL loss, β signifies the weight of classification loss based on contrastive feature representations, γ denotes the weight of classification loss derived from EG features, and δ indicates the weight of classification loss derived from GLCM features. This section investigates the impact of weight parameter combinations on the classification performance of few-shot malware using the MaleVis dataset.

As shown in Table 5, the first parameter set assigns balanced weights across branches as a baseline. Increasing the supervised loss weight β from the first to the second set improves classification performance, with a 0.32% increase in 5-way 1-shot accuracy and a 1.98% increase in 5-way 5-shot accuracy. However, further increasing β to 0.5 reduces accuracy by 0.98% in 5-way 1-shot and 0.31% in 5-way 5-shot scenarios, indicating potential overfitting with excessive supervision weight.

Table 5. Comparison of loss with different weights on MaleVis.

Setting of weights	5-way 1-shot				5-way 5-shot
Setting of weights	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1
α = 0.3, β = 0.3, γ = 0.2, δ = 0.2	73.60	73.69	73.60	73.65	85.45	85.57	85.45	85.51
α = 0.2, β = 0.4, γ = 0.2, δ = 0.2	73.92	73.98	73.92	73.95	87.43	87.48	87.43	87.46
α = 0.3, β = 0.5, γ = 0.1, δ = 0.1	72.94	73.03	72.94	72.98	87.12	87.18	87.12	87.15
α = 0.3, β = 0.4, γ = 0.2, δ = 0.1	74.11	74.27	74.11	74.19	85.73	85.80	85.73	85.76

Experiments show that setting the supervised loss weight β to 0.4 improves classification performance in both 5-way 1-shot and 5-way 5-shot configurations. It is inferred that the combination of contrast learning and multifeature extraction can solve the overfitting problem. In addition, the study finds that among the three features, CL contributes most to model accuracy, followed by EG features and GLCM features. Fine-tuning the weights (fixing supervised loss β at 0.4, increasing CL loss by 0.1, and reducing GLCM loss by 0.1) significantly enhances accuracy in the 5-way 1-shot scenario. Above all, this paper sets the final branch loss weights as α = 0.2, β = 0.4, γ = 0.2, and δ = 0.2, which yield favorable classification performance in both the 5-way 1-shot and 5-way 5-shot configurations.

6. Discussion

To better understand the contributions of each module, we conducted a comprehensive ablation study, with the results presented in Table 6. Here, “w/o” indicates the module that has been removed for analysis.

Table 6. Comparison of ablation experiment results on MaleVis and BIG-2015 datasets.

Setting of ablation	MaleVis dataset								BIG-2015 dataset
	5-way 1-shot				5-way 5-shot				5-way 1-shot				5-way 5-shot
	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1
w/o CL&MF&DF	69.55	69.71	69.55	69.63	83.09	83.19	83.09	83.14	56.43	53.73	56.43	54.68	67.53	69.79	67.53	68.61
w/o MF&DF	72.66	72.79	72.66	72.72	84.38	84.47	84.38	84.43	60.77	61.04	60.77	60.91	70.61	70.65	70.61	70.63
w/o DF	74.43	74.56	74.43	74.49	87.07	87.12	87.07	87.09	65.63	65.73	65.63	65.68	73.83	73.96	73.83	73.90
MalFSLDF	74.26	74.32	74.26	74.29	87.56	87.48	87.56	87.52	64.33	64.57	64.33	64.45	73.92	73.97	73.92	73.94

In terms of individual module contributions, the CL module significantly enhanced the model’s performance in scenarios with very few labeled samples, proving to be particularly effective in fine-grained feature learning at the family level. By analyzing the similarities and differences between samples, CL helps the model capture more discriminative texture features. In addition, training the model on multiple backbones forces it to overcome biases between perspectives, further improving the transferability of the learned features.

The MF module demonstrated its strength when applied to larger datasets, leading to a substantial improvement in accuracy. This illustrates the robustness of the MF module in enhancing model performance under different few-shot conditions.

In addition, the DA module showed only a modest improvement in both datasets, suggesting that DA provides incremental benefits in scenarios where the model is already performing well. Despite this, the DA module is still valuable, as it helps normalize feature distributions across distinct malware families and categories, improving the model’s adaptability in few-shot settings.

In summary, the CL module is the most impactful component, enhancing the model’s ability to identify malware families by deeply exploring feature representations. The MF strategy complements this by improving robustness against obfuscated or unknown samples. At the same time, the DA module offers incremental improvements, particularly in scenarios with significant feature domain discrepancies across malware families.

7. Conclusion and Future Work

This paper presents MalFSLDF for the classification of few-shot malware families. This model optimizes feature extraction through a CL strategy and enhances its ability to recognize the characteristics of malware families by adopting MF technology. In addition, the model can adapt to the feature distribution of new types of malware through a DA feature distribution adjustment strategy, further enhancing its generalization capability. Experiments have verified the superior performance of MalFSLDF in the field of malware classification.

Future work can be pursued from two perspectives: on the one hand, more auxiliary features can be explored from the perspective of feature engineering to achieve a more comprehensive representation of malware samples; on the other hand, more advanced domain adaptation methods can be investigated to improve the model’s generalization ability further.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

This work was supported by the National Natural Science Foundation of China (Grant no. 62172042) and the CCF-NSFOCUS “Kunpeng” Research Fund (Grant no. CCF-NSFOCUS 2023002).

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant no. 62172042) and the CCF-NSFOCUS “Kunpeng” Research Fund (Grant no. CCF-NSFOCUS 2023002).

Open Research

Data Availability Statement

Data are available upon request from the corresponding author.

References

1 AvTest, Malware Statistics, 2020, https://www.av-test.org/en/-statistics/malware/.
Google Scholar
2 Gopinath M. and Sethuraman S. C., A Comprehensive Survey on Deep Learning Based Malware Detection Techniques, Computer Science Review. (2023) 47, https://doi.org/10.1016/j.cosrev.2022.100529.
10.1016/j.cosrev.2022.100529
Google Scholar
3 Shaukat K., Luo S., and Varadharajan V., A Novel Deep Learning-Based Approach for Malware Detection, Engineering Applications of Artificial Intelligence. (2023) 122, https://doi.org/10.1016/j.engappai.2023.106030.
10.1016/j.engappai.2023.106030
Web of Science® Google Scholar
4 Kumar S. and Kumar A., Image-Based Malware Detection Based on Convolution Neural Network With Autoencoder in Industrial Internet of Things Using Software Defined Networking Honeypot, Engineering Applications of Artificial Intelligence. (2024) 133, https://doi.org/10.1016/j.engappai.2024.108374.
10.1016/j.engappai.2024.108374
Google Scholar
5 Nataraj L., Karthikeyan S., Jacob G., and Manjunath B. S., Malware Images: Visualization and Automatic Classification, Proceedings of the 8th International Symposium on Visualization for Cyber Security, 2011, Association for Computing Machinery, New York.
10.1145/2016904.2016908
Google Scholar
6 Jeon J., Jeong B., Baek S., and Jeong Y.-S., Hybrid Malware Detection Based on Bi-LSTM and Spp-Net for Smart Iot, IEEE Transactions on Industrial Informatics. (2022) 18, no. 7, 4830–4837, https://doi.org/10.1109/tii.2021.3119778.
10.1109/TII.2021.3119778
Google Scholar
7 Jeon J., Jeong B., Baek S., and Jeong Y.-S., Static Multi Feature-Based Malware Detection Using Multi Spp-Net in Smart Iot Environments, IEEE Transactions on Information Forensics and Security. (2024) 19, 2487–2500, https://doi.org/10.1109/tifs.2024.3350379.
10.1109/TIFS.2024.3350379
Google Scholar
8 Sharma O., Sharma A., and Migan A. K., Gan for Facilitating Malware Image Synthesis With Improved Malware Classification on Novel Dataset, Expert Systems with Applications. (2024) 241.
10.1016/j.eswa.2023.122678
Google Scholar
9 Zhong F., Hu Q., Jiang Y., Huang J., Zhang C., and Wu D., Enhancing Malware Classification via Self-Similarity Techniques, IEEE Transactions on Information Forensics and Security. (2024) 19, 7232–7244, https://doi.org/10.1109/tifs.2024.3433372.
10.1109/TIFS.2024.3433372
Google Scholar
10 Song Y., Wang T., Cai P., Mondal S. K., and Sahoo J. P., A Comprehensive Survey of Few-Shot Learning: Evolution, Applications, Challenges, and Opportunities, ACM Computing Surveys. (2023) 55, no. 13s, 1–40, https://doi.org/10.1145/3582688.
10.1145/3582688
Web of Science® Google Scholar
11 Wang Y., Yao Q., Kwok J. T., and Ni L. M., Generalizing From a Few Examples: A Survey on Few-Shot Learning, ACM Computing Surveys. (2020) 53, no. 3, 1–34, https://doi.org/10.1145/3386252.
10.1145/3386252
Web of Science® Google Scholar
12 Douze M., Szlam A., Hariharan B., and Jégou H., Low-Shot Learning With Large-Scale Diffusion, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 3349–3358.
10.1109/CVPR.2018.00353
Google Scholar
13 Goodfellow I. J., Pouget-Abadie J., Mirza M. et al., Generative Adversarial Nets, Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014, MIT Press, Cambridge, MA, USA, 2672–2680.
Google Scholar
14 Escudero García D., DeCastro-García N., and Muñoz Castañeda A. L., An Effectiveness Analysis of Transfer Learning for the Concept Drift Problem in Malware Detection, Expert Systems With Applications. (2023) 212, https://doi.org/10.1016/j.eswa.2022.118724.
10.1016/j.eswa.2022.118724
Google Scholar
15 Hospedales T. M., Antoniou A., Micaelli P., and Storkey A. J., Meta-Learning in Neural Networks: A Survey, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2021) 44, no. 9, 1–5169, https://doi.org/10.1109/tpami.2021.3079209.
10.1109/TPAMI.2021.3079209
Google Scholar
16 Tran T. K., Sato H., and Kubo M., Image-Based Unknown Malware Classification With Few-Shot Learning Models, 2019 Seventh International Symposium on Computing and Networking Workshops (CANDARW), 2019, 401–407.
10.1109/CANDARW.2019.00075
Google Scholar
17 Ale L., Li L., Kar D., Zhang N., and Palikhe A., Few-Shot Learning to Classify Android Malwares, 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP), 2020, 1001–1007.
10.1109/ICSIP49896.2020.9339429
Google Scholar
18 Bai Y., Xing Z., Li X., Feng Z., and Ma D., Unsuccessful Story About Few Shot Malware Family Classification and Siamese Network to the Rescue, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), 2020, 1560–1571.
10.1145/3377811.3380354
Google Scholar
19 Conti M., Khandhar S., and Vinod P., A Few-Shot Malware Classification Approach for Unknown Family Recognition Using Malware Feature Visualization, Computers and Security. (2022) 122, https://doi.org/10.1016/j.cose.2022.102887.
10.1016/j.cose.2022.102887
Google Scholar
20 Bendou Y., Hu Y., Lafargue R. et al., Easy—Ensemble Augmented-Shot-Y-Shaped Learning: State-of-the-Art Few-Shot Classification With Simple Components, Journal of Imaging. (2022) 8, no. 7, https://doi.org/10.3390/jimaging8070179.
10.3390/jimaging8070179
PubMed Google Scholar
21 Chai Y., Du L., Qiu J., Yin L., and Tian Z., Dynamic Prototype Network Based on Sample Adaptation for Few-Shot Malware Detection, IEEE Transactions on Knowledge and Data Engineering. (2023) 35, no. 5, 4754–4766.
Google Scholar
22 Barros P. H., Chagas E. T. C., Oliveira L. B., Queiroz F., and Ramos H. S., Malware-Smell: A Zero-Shot Learning Strategy for Detecting Zero-Day Vulnerabilities, Computers and Security. (2022) 120, https://doi.org/10.1016/j.cose.2022.102785.
10.1016/j.cose.2022.102785
Google Scholar
23 Xiao M., Guo C., Shen G., Cui Y., and Jiang C., Image-Based Malware Classification Using Section Distribution Information, Computers and Security. (2021) 110, https://doi.org/10.1016/j.cose.2021.102420.
10.1016/j.cose.2021.102420
Google Scholar
24 Vasan D., Hammoudeh M., and Alazab M., Broad Learning: A Gpu-Free Image-Based Malware Classification, Applied Soft Computing. (2024) 154, https://doi.org/10.1016/j.asoc.2024.111401.
10.1016/j.asoc.2024.111401
Google Scholar
25 He Y., Kang X., Yan Q., and Li E., Resnext+: Attention Mechanisms Based on Resnext for Malware Detection and Classification, IEEE Transactions on Information Forensics and Security. (2024) 19, 1142–1155, https://doi.org/10.1109/tifs.2023.3328431.
10.1109/TIFS.2023.3328431
Google Scholar
26 Darem A., Abawajy J., Makkar A., Alhashmi A., and Alanazi S., Visualization and Deep-Learning-Based Malware Variant Detection Using Opcode-Level Features, Future Generation Computer Systems. (2021) 125, 314–323, https://doi.org/10.1016/j.future.2021.06.032.
10.1016/j.future.2021.06.032
Web of Science® Google Scholar
27 Zhang Y., Liao Z., Zhang N. et al., Deep Hashing for Malware Family Classification and New Malware Identification, IEEE Internet of Things Journal. (2024) 11, no. 16, 26837–26851, https://doi.org/10.1109/jiot.2024.3353250.
10.1109/JIOT.2024.3353250
Google Scholar
28 Snell J., Swersky K., and Zemel R., Prototypical Networks for Few-Shot Learning, Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, Curran Associates Inc, Red Hook, NY, USA, 4080–4090.
Google Scholar
29 Vinyals O., Blundell C., Lillicrap T., Kavukcuoglu K., and Wierstra D., Matching Networks for One Shot Learning, Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016, Curran Associates Inc, Barcelona, Spain, 3637–3645.
Google Scholar
30 Sung F., Yang Y., Zhang L., Xiang T., Torr P. H. S., and Hospedales T. M., Learning to Compare: Relation Network for Few-Shot Learning, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 1199–1208.
10.1109/CVPR.2018.00131
Google Scholar
31 Liu C., Li B., Zhao J., Feng W., Liu X., and Li C., A2-CLM: Few-Shot Malware Detection Based on Adversarial Heterogeneous Graph Augmentation, IEEE Transactions on Information Forensics and Security. (2024) 19, 2023–2038, https://doi.org/10.1109/tifs.2023.3345640.
10.1109/TIFS.2023.3345640
Google Scholar
32 Xie J., Long F., Lv J., Wang Q., and Li P., Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 7972–7981.
10.1109/CVPR52688.2022.00781
Google Scholar
33 Connor S. and Khoshgoftaar T. M., A Survey on Image Data Augmentation for Deep Learning, Journal of Big Data. (2019) 6, no. 1, 1–48.
Google Scholar
34 Sengupta P., Mehta A., and Rana P. S., Enhancing Performance of Deep Learning Models With a Novel Data Augmentation Approach, 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), 2023, 1–7.
10.1109/ICCCNT56998.2023.10308298
Google Scholar
35 He K., Zhang X., Ren S., and Sun J., Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016, 770–778.
10.1109/CVPR.2016.90
Google Scholar
36 van den Oord A., Li Y., and Vinyals O., Representation Learning With Contrastive Predictive Coding, 2018.
Google Scholar
37 Haralick R. M., Shanmugam K. S., and Dinstein I., Textural Features for Image Classification, IEEE Transactions on Systems, Man, and Cybernetics. (1973) 3, no. 6, 610–621, https://doi.org/10.1109/tsmc.1973.4309314, 2-s2.0-0015680481.
10.1109/TSMC.1973.4309314
Google Scholar
38 Snell J., Swersky K., and Zemel R., Prototypical Networks for Few-Shot Learning, Advances in Neural Information Processing Systems. (2017) 30.
Google Scholar
39 Sung F., Yang Y., Zhang L., Xiang T., Torr P. H. S., and Hospedales T. M., Learning to Compare: Relation Network for Few-Shot Learning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, 1199–1208.
10.1109/CVPR.2018.00131
Google Scholar
40 Tian Y., Wang Y., Krishnan D., Tenenbaum J. B., and Isola P., Rethinking Few-Shot Image Classification: A Good Embedding Is All You Need?, Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, 2020, Springer, 266–282.
10.1007/978-3-030-58568-6_16
Google Scholar
41 Zhang C., Cai Y., Lin G., and Shen C., Deepemd: Few-Shot Image Classification With Differentiable Earth Mover’s Distance and Structured Classifiers, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, 12203–12213.
10.1109/CVPR42600.2020.01222
Google Scholar
42 Selman Bozkir A., Ogulcan Cankaya A., and Aydos M., Utilization and Comparision of Convolutional Neural Networks in Malware Recognition, 2019 27th Signal Processing and Communications Applications Conference (SIU), 2019, IEEE, 1–4.
10.1109/SIU.2019.8806511
Google Scholar
43 Microsoft, Microsoft Malware Classification Challenge (Big 2015), 2015.
Google Scholar
44 Khan F. B., Durad M. H., Khan A., Khan F. A., Chauhdary S. H., and Alqarni M., Detection of Data Scarce Malware Using One-Shot Learning With Relation Network, IEEE Access. (2023) 11, 74438–74457, https://doi.org/10.1109/access.2023.3293117.
10.1109/ACCESS.2023.3293117
Google Scholar

All articles