International Journal of Intelligent Systems

Volume 2025, Issue 1 4450460

Research Article

Open Access

LDSGAN: Unsupervised Image-to-Image Translation With Long-Domain Search GAN for Generating High-Quality Anime Images

Hao Wang,

Hao Wang

orcid.org/0000-0003-4139-8193

School of Computer and Software Engineering , Huaiyin Institute of Technology , Huai’an , China , hyit.edu.cn

Search for more papers by this author

Chenbin Wang,

Chenbin Wang

School of Computer and Software Engineering , Huaiyin Institute of Technology , Huai’an , China , hyit.edu.cn

Search for more papers by this author

Xin Cheng,

Xin Cheng

School of Computer and Software Engineering , Huaiyin Institute of Technology , Huai’an , China , hyit.edu.cn

Search for more papers by this author

Hao Wu,

Hao Wu

State Key Laboratory of Mathematical Engineering and Advanced Computing , Zhengzhou , China , lsec.cc.ac.cn

Search for more papers by this author

Jiawei Zhang,

Jiawei Zhang

School of Cyber Security and Information Law , Chongqing University of Posts and Telecommunications , Chongqing , China , cqupt.edu.cn

Search for more papers by this author

Jinwei Wang,

Corresponding Author

Jinwei Wang

[email protected]

orcid.org/0000-0002-9366-5671

College of Cyber Science , Nankai University , Tianjin , China , nankai.edu.cn

Search for more papers by this author

Xiangyang Luo,

Xiangyang Luo

orcid.org/0000-0003-3225-4649

State Key Laboratory of Mathematical Engineering and Advanced Computing , Zhengzhou , China , lsec.cc.ac.cn

Search for more papers by this author

Bin Ma,

Bin Ma

Shandong Provincial Key Laboratory of Computer Networks , Jinan , China

Shandong Computer Science Center , Jinan , China , scsc.cn

School of Cyber Security , Qilu University of Technology , Jinan , China , qlu.edu.cn

Search for more papers by this author

Hao Wang,

Hao Wang

orcid.org/0000-0003-4139-8193

School of Computer and Software Engineering , Huaiyin Institute of Technology , Huai’an , China , hyit.edu.cn

Search for more papers by this author

Chenbin Wang,

Chenbin Wang

School of Computer and Software Engineering , Huaiyin Institute of Technology , Huai’an , China , hyit.edu.cn

Search for more papers by this author

Xin Cheng,

Xin Cheng

School of Computer and Software Engineering , Huaiyin Institute of Technology , Huai’an , China , hyit.edu.cn

Search for more papers by this author

Hao Wu,

Hao Wu

State Key Laboratory of Mathematical Engineering and Advanced Computing , Zhengzhou , China , lsec.cc.ac.cn

Search for more papers by this author

Jiawei Zhang,

Jiawei Zhang

School of Cyber Security and Information Law , Chongqing University of Posts and Telecommunications , Chongqing , China , cqupt.edu.cn

Search for more papers by this author

Jinwei Wang,

Corresponding Author

Jinwei Wang

[email protected]

orcid.org/0000-0002-9366-5671

College of Cyber Science , Nankai University , Tianjin , China , nankai.edu.cn

Search for more papers by this author

Xiangyang Luo,

Xiangyang Luo

orcid.org/0000-0003-3225-4649

State Key Laboratory of Mathematical Engineering and Advanced Computing , Zhengzhou , China , lsec.cc.ac.cn

Search for more papers by this author

Bin Ma,

Bin Ma

Shandong Provincial Key Laboratory of Computer Networks , Jinan , China

Shandong Computer Science Center , Jinan , China , scsc.cn

School of Cyber Security , Qilu University of Technology , Jinan , China , qlu.edu.cn

Search for more papers by this author

First published: 12 April 2025

https://doi.org/10.1155/int/4450460

Academic Editor: Stefano Cirillo

Share a link

Email
Wechat
Bluesky

Abstract

Image-to-image (I2I) translation has emerged as a valuable tool for privacy protection in the digital age, offering effective ways to safeguard portrait rights in cyberspace. In addition, I2I translation is applied in real-world tasks such as image synthesis, super-resolution, virtual fitting, and virtual live streaming. Traditional I2I translation models demonstrate strong performance when handling similar datasets. However, when the domain distance between two datasets is large, translation quality may degrade significantly due to notable differences in image shape and edges. To address this issue, we propose Long-Domain Search GAN (LDSGAN), an unsupervised I2I translation network that employs a GAN structure as its backbone, incorporating a novel Real-Time Routing Search (RTRS) module and Sketch Loss. Specifically, RTRS aids in expanding the search space within the target domain, aligning feature projection with images closest to the optimization target. Additionally, Sketch Loss retains human visual similarity during long-domain distance translation. Experimental results indicate that LDSGAN surpasses existing I2I translation models in both image quality and semantic similarity between input and generated images, as reflected by its mean FID and LPIPS scores of 31.509 and 0.581, respectively.

1. Introduction

As cyberspace activities become increasingly frequent, cyberspace portraits raise many security issues. Therefore, it is crucial to protect the privacy of portraits [1, 2]. Image-to-image (I2I) translation has become a popular research topic in recent years, and it aims to learn image-projecting functions from source to target domains. I2I protects portrait privacy and prevents facial recognition technology in cyberspace from infringing on personal identity by converting personal portraits into anime or other portraits. With the development of Generative Adversarial Networks (GANs) [3] and diffusion models [4], the powerful fitting ability of the models has further advanced the application of I2I in real-life applications. In addition to portrait protection, image coloring [5–7], makeup migration [8], fashion editing [9], style migration [10], and game character face generation [11] also involve I2I technology.

Existing I2I translation methods are mainly categorized as supervised [7–9, 12–14] and unsupervised. Supervised I2I methods use paired data to train models. However, it is tough to obtain paired data in realistic environments, so unsupervised I2I methods have received widespread attention. For example, CycleGAN [15] introduces two mirror-symmetric GANs for establishing constraints between the source and target domains and preserving the image information through skip connections for cross-domain image transformation. Similarly, Drit++ [16] proposed a multidomain unsupervised I2I model that uses an encoder to extract the content and feature information from an image and exploits it as input to the CycleGAN.

However, these models assume that the images in the source and target domains have high similarity in shape or texture. When in an I2I task with close inter-domain distances, shapes and textures can be considered as domain-independent features and preserved by skip-connected structures. Once in an I2I task with large inter-domain distances, shape and texture are no longer considered as domain-independent features due to the notable differences between the source and target domains. When using skip connections, the source domain image in long-distance domain translation (LDDT) cannot be transferred to the target domain correctly, and artifacts of the source domain may remain on the generated image (as shown in Figure 1(b)). The image generated by the proposed method is shown in Figure 1(c). Figures 1(d), 1(e), and 1(f) demonstrate the feature projecting passed by skip connection, which contains too much texture in the source domain, resulting in a degradation of the quality of the generated image. In addition, I2I models based on encoder–decoder [16–18] often struggle to accurately match the source image in the target domain due to severe information loss. Therefore, it is difficult for the existing methods to achieve satisfactory performance on the LDDT.

Details are in the caption following the image — **Figure 1 (a)**
Open in figure viewer PowerPoint

Examples of I2I translation results and corresponding feature maps under skip connection. The generated image exhibits degradation in quality due to excessive texture from the source domain introduced by the skip connections. (a) The input image. (b) The image generated by using a skip connection during generation. (c) The image generated by the proposed method. (d–f) Feature projecting passed by skip connection.

In contrast to the above approach, UI2I-via-StyleGAN2 [19] divides the I2I task into short-distance domain translation (SDDT) and LDDT by inter-domain distance. The target domain model is trained by fine-tuning the source domain model to minimize the effect of model distance. Meanwhile, the layer-swap approach directly swaps the high-level convolutional feature projecting from the source domain to the target domain, thereby preserving more features in the source domain. However, the color arrangement of the generated image still cannot match the input image completely, and there is still potential for further improvement of the image quality.

To solve the problem in LDDT, we propose an unsupervised I2I translation method which is called Long-Domain Search GAN (LDSGAN). It introduces a novel Real-Time Route Search (RTRS) module for generating high-quality images. RTRS module searches for a representation of the color arrangement and texture in the source domain. It projects it to a representation of the target domain, thus helping the model preserve domain-independent features and map to improve the domain’s domain-related features to the target domain’s nearest projected points. In addition, we propose Sketch Loss, which helps the model maintain the similarity between the generated image and the source domain image. However, it does not enforce that the generated image is identical to the source domain image in terms of distance.

Thus, the contributions of this paper can be summarized as follows:

•
We proposed LDSGAN, which is capable of achieving higher generation quality in LDDT while maintaining the similarity between the input image and the generated image.
•
We proposed RTRS module, which helps the model to find the nearest neighbor projection points of the source domain image in the target domain, thus improving the upper bound of the network fit.
•
We proposed Sketch Loss to improve the similarity between the generated image and the source domain image in visual quality. This loss function does not limit the distance between the generated image and the source domain image.

The remaining sections of the thesis are organized as follows. In Section 2, the application of GAN to I2I is outlined. In Section 3, the proposed method for I2I is proposed and described in detail. In Section 4, we present the experimental setup, verify the detection performance of the proposed method on different datasets, and discuss the experimental results. Finally, a conclusion is made in Section 5 and future work is indicated.

2. Related Work

In recent years, with the development of GANs [3, 20], I2I translation research has achieved impressive synthetic image results. On the one hand, in the supervised I2I translation method, Pix2pix [13] utilizes the UNet [12] architecture. Experiments have shown that introducing skip connections in image translation tasks can significantly improve translation performance. In addition, supervised I2I methods can be customized to specific scenarios, such as grayscale image coloring, fashion editing, and makeup transfer. Supervised-based I2I mostly uses a similarity-based strategy that retains most of the information in the source domain and edits the target region.

On the other hand, among the unsupervised methods, CycleGAN [15] proposed a method to learn the input image to output image projecting without pairing samples. Drit++ [16] proposed a multidomain I2I translation model that achieves better results in terms of image quality by using multiple encoders to encode the context and features of the image.

To further enhance the quality of the generated images, the researchers explored the application of StyleGAN [21] to I2I translation as it is capable of generating impressive and high-quality images. For example, UI2I-via-StyleGAN2 [19] introduces a method that encodes images end-to-end into the latent space of StyleGAN [21]. It defines model distances based on StyleGAN2 [22] and proposes a GAN embedding-based inversion approach to achieve higher-quality results in long-range inter-model image-to-image (I2I) translation.

Although the above supervised networks have made significant progress in I2I translation, they require paired data. When using unpaired data, unsupervised networks may be unable to generate high-quality color images in LDDT. In addition, in the I2I task, many studies have assumed a high degree of consistency in the distribution of images in the source and target domains over the content space. These studies usually use texture-based detectors such as LPIPS [23] as loss functions or similarity evaluation metrics but ignore the differences in image texture types in LDDT.

To overcome the above problems, we propose LDSGAN, which has a wider search space in the target domain and thus can map the source domain image to the target domain more efficiently. It achieves feature aggregation by constructing RTRS modules to characterize the search for color alignments and textures in the source domain. Without limiting the distance between the generated image and the source domain image, Sketch Loss is utilized to improve the visual similarity between the generated image and the source domain image, which in turn enables LDSGAN to generate high-quality and more appealing color images during the long-distance domain transformation process.

3. The Proposed Method

The proposed unsupervised learning LDSGAN searches for images close to the source domain in the target domain space and improves the performance of long-domain translation. As shown in Figure 2, the main components of LDSGAN include the source distill network (SDNet), the long-distance transform network (LDTNet), and the target domain search network (TDSNet).

Firstly, in SDNet, the model reduces the dimension of the source domain image while preserving the details, projecting it to a stream shape closer to the target domain. Secondly, LDTNet discards textures that are present in the source domain but not in the target domain by separating features and weakening domain-related features in the source domain and then translating the image to the target domain. Finally, TDSNet uses the RTRS module to search for the nearest neighbor projection points of the source domain image in the target domain, which improves the upper fitting limit of the network (shown in Figure 3). With Sketch Loss, the network preserves consistency between the edited image and the human-generated visual image, resulting in a high-quality generated image in the target domain that closely resembles the original.

3.1. SDNet

It is more workable to map a low-dimensional image to the target domain than map a high-dimensional image directly to the target domain. Lower dimensional images have less detail, so the source domain image is closer to the target domain in lower dimensional space. However, reducing the image from high to low dimensions leads to information loss in the source domain. We first apply Rich Init to preserve the high-frequency features, ensuring their transfer to the target domain’s high-frequency features. Therefore, SDNet can effectively preserve the source domain information while reducing the image dimension.

SDNet uses successive convolutional layers as the encoder architecture and does not contain a decoder part. Inspired by the model in [24], we use a convolutional layer to enhance the input image. Then, the feature projecting is performed N times downsampling to make it closer to the low-dimensional target domain. The value of N depends on prior knowledge. Usually, the greater the distance between domains, the higher the value of N. As shown in Figure 3, the output feature projecting of SDNet is input into LDTNet for further source-to-target domain transformation.

3.2. LDTNet

In LDDT, significant differences in texture and detailed shapes exist between the source and target domain images. For example, when a human face portrait is transformed into an anime-style portrait, there are substantial differences in the detailed shapes of the facial features. The low-dimensional feature maps of the source domain obtained by SDNet cannot be directly used to generate images in the target domain. For this purpose, we introduce LDTNet, which transforms the source domain feature projecting into a feature projecting suitable for the target domain in a low-dimensional space.

LDTNet uses multiple residual blocks [25] to transform the low-dimensional feature maps from the source domain to the target domain. It ensures that the information content of feature maps before and after conversion is the same without changing the number and size of feature maps. x_s denotes the source domain image extracted by SDNet and y_s denotes the target domain image corresponding to the source domain. LDTNet tries to learn a projecting function f such that , where is the nearest neighbor map of the f(x_s) in the target domain.

3.3. TDSNet

To restore the feature projecting to the high dimensional space and to make its projecting location close to the nearest neighbor projection point in the target domain, TDSNet consists of N times RTRSs and N times Search Application Blocks (SABlocks). Each RTRS and its corresponding SABlock projects the low-dimensional feature maps to a high-dimensional feature space with appropriate routes, and these routes are applied to each upsampling layer. The rich maps contain the shape and texture information of the source domain, while the low-dimensional feature maps are a low-dimensional representation of the target domain and contain less shape and texture information. The combination of the two generates a Route Code, which is used to control the strength of the convolutional kernel so that the convolutional layer can find the best path to project the low-dimensional feature maps to the nearest projection point. Thus, textures that are not decoupled and discarded in SDNet can be recovered by RTRS.

The structure of RTRS is shown in Figure 3. RTRS utilizes the feature maps of the TDSNet layers and the rich maps in SDNet to extract Route Codes. Driven by the target function, these Route Codes can capture domain-dependent features in the source domain image. Instead of downsampling the feature mapping to a size of 1 × 1, we encode the Route Code using the PatchGAN discriminator to obtain features from the spatial information. Downsampling the feature mapping to 1 × 1 focuses more on the confidence of a particular shape or texture, whereas the PatchGAN encoder focuses more on the layout of colors in space. Features should be extracted from all three channels to include information from all three channels of RGB. However, given the effectiveness of the PatchGAN discriminator, we extracted the features from only one channel and used it to encode the full information of the color image. In addition, to extend the route search, we propose a Scale Block to increase the variance of the Route Code. Scale Block can be represented as

()

where C_i is the final Route Code of i-th RTRS, F_fc1i, F_fc2i, and F_fc3i are three full-connection layers of i-th RTRS, and x^″ is getting by flatting the result of PatchGAN encoder.

The full RTRS structure can be represented as

()

where x^′ represents the rich maps in SDNet,

represents ith feature maps output by SABlock, i ∈ {0, 1, 2, …, N}, and

is the output of LDTNet.

SABlock customizes the convolution kernel using the routing code obtained by RTRS to generate the target image. The structure of SABlock is shown in Figure 3. Inspired by the tunable convolution kernel in StyleGAN2, SABlock is used to customize the image generation route. During the generation process, noise is no longer added, as it would lead to inaccurate optimization of the route search. To make the route search process more focused on the color of the RGB channel, the RGB mapping obtained through ToRGB is combined with the routing code extracted by RTRS. In SABlock, using the same search code does not deviate from the translation route due to the short translation path of the feature mapping. Therefore, we use the same search code for all modulated convolutional kernels to reduce computational overhead.

Given the i-th input

, RGB_i of SABlock, Search Apply Block can be represented as

()

where

, RGB_i+1 are the output of SABlock of this layer and also the input of the next layer, i ∈ {0, 1, 2, …, N}. RGB₀ is obtained by converting

through ToRGB.

After each convolutional layer of the generator, a Leaky ReLU [26] with a slope of 0.2 is used as the activation function. In addition, to avoid checkerboard artifacts, we use a sampled convolutional kernel similar to StyleGAN2.

3.4. Loss Function

Due to the differences in image textures and the diversity of colors and strokes in LDDT, network training becomes challenging. To stabilize the training process, the network employs a multiobjective weighting strategy.

LDSGAN uses the loss function in WGAN-GP [27] as its adversarial loss:

()

where x denotes the input images, x denotes the generated images, and

denotes the gradient of random samples.

In addition, in the generator, pixel-level

loss is used to limit the proximity of the regional hue of the input image x and the generated image G(x).

()

Finally, because LPIPS serves as a metric that incorporates both texture and structure, most of the textures in the target and source domains do not coincide, and to maintain the image distribution in the LDDT, we employ Sketch Loss instead of LPIPS. We propose Sketch Loss which can be divided into two parts. The first is Soft-Sketch Loss (

), which drives the generated image to have sketch complexity in the target domain. Soft-Sketch Loss can be expressed as follows:

()

where F is a pretrained sketch extractor in [28].

The other loss is Hard-Sketch Loss (

), which aims to constrain the sketch of the generated image to have similar complexity to the target domain image and can be expressed as

()

Overall, the loss of discriminator L_D and the loss of generator L_G in the proposed network can be expressed as follows:

()

where λ₁, λ₂, λ₃, and λ₄ are hyperparameters to balance the relationship between different losses. By default, we set λ₁ = 1, λ₂ = 10, λ₃ = 10, and λ₄ = 1.

4. Experiments

4.1. Baseline and Dataset

To evaluate the proposed LDSGAN model, we have selected several state-of-the-art works as baselines. CycleGAN [15] introduces Cycle Consistency Loss (CCL) to ensure that the identity of the input and output images remains consistent when trained with unpaired data. Drit++ [16] supports I2I conversion through high-quality multimodal translation. UI2I-via-StyleGAN2 [19] proposes a GAN embedding-based approach to obtain higher image quality and similarity in I2I tasks by fine-tuning the pretrained StyleGAN2.

This experiment is evaluated in two I2I translation scenarios with large differences between their domains. For the Yellow2Anime task, the Netflix face dataset and the Danbooru2018 [29] are used. The Netflix face dataset contains 136,723 images of 512 × 512 size. For the Sketch2Anime task, we used the DCS dataset collected in [7]. In the Sketch2Anime task, the image is cropped to a square with short side lengths and then scaled to 512 × 512 size.

4.2. Experimental Settings

The framework for the experiments is implemented in PyTorch [30]. All experiments are performed with an NVIDIA Tesla V100 GPU. The method uses the Adam optimizer with the learning rate set to 1e-4 and a total number of iterations of 400k. The momentum parameters of the generator and discriminator are set to β₁ = 0 and β₂ = 0.99, respectively.

The generator’s output is the nearest neighbor estimate of the input source domain image in the target domain. To accelerate convergence, we use WGAN-GP [27] to initialize the generator and the discriminator and impose no gradient penalty on the generator. The gradient penalty of the discriminator is executed every 16 iterations and the penalty strength is amplified 16 times to reduce the network training time. In addition, in all convolutional and fully connected layers, we use Kaiming normal initialization [31] to improve convergence speed while avoiding the problem of gradient explosion during training. Similar to NAH’s approach [32], we did not use any normalization layer in the network.

4.3. Comparison Experiment

To evaluate the quality of the images generated by the method in this paper and their similarity to the input images, we used the following metric: Frechet Inception Distance (FID) [33], which is used to calculate the distance of the feature vectors between the real image and the generated image to measure the quality of the output image. LPIPS [23] is a depth feature–based metric for evaluating structural and textural similarity between images. We use LPIPS to measure the structural similarity of images despite the textural differences between source and target domain images in LDDT.

To evaluate the performance of LDSGAN, we compared it with several other methods in the Yellow2Anime task, including CycleGAN [15], Drit++ [16], UI2I-via-StyleGAN2 (UI2I-StGAN2) [19], AttentionGAN [34], and E² GAN [35]. Specifically, we retrained these models on the same training dataset using the same settings. For CycleGAN [15] and Drit++ [16], the training period is set to 10 and the decay period is set to 5. Then, the performance of these models is evaluated by FID and LPIPS. The experimental results are shown in Table 1, where 2000 test images are randomly selected as inputs. Figure 4 shows some comparison examples.

Table 1. Comparative results of this method with existing methods.

Method	Yellow2Anime		Sketch2Anime
Method	FID↓	LPIPS↓	FID↓	LPIPS↓
CycleGAN [15]	185.812	0.547	153.61	0.509
Drit++ [16]	60.156	0.605	59.995	0.583
UI2I-StGAN2 LS = 1 [19]	46.771	0.670	44.781	0.631
UI2I-StGAN2 LS = 3 [19]	71.361	0.644	63.056	0.627
UI2I-StGAN2 LS = 5 [19]	102.405	0.618	81.551	0.594
AttentionGAN [34]	62.983	0.586	65.276	0.598
E² GAN [35]	73.03	0.603	72.615	0.618
LDSGAN (ours)	31.437	0.582	29.582	0.579

Note: Best results are in italics. Suboptimal results are in bold.

As shown in Table 1, the FID scores of this paper’s method are better than other methods. Although some methods perform better on LPIPS, they mainly rely on the direct migration of textures from input images in nontarget domains. As can be seen in Figure 4, the images generated by CycleGAN and Drit++ contain too much texture of the input image, which is the advantage of skip connections in the SDDT task but is a major drawback in the LDDT task. The results of UI2I-via-StyleGAN2 perform poorly when the number of layer exchanges is 3 or 5. The main reason is that the resulting image is more prone to corruption when swapping more advanced convolutional layers, as the swapped features need to be mapped through more convolutional layers that are not in the original model. Therefore, the FID score is lower when the number of layer exchanges is 3 or 5. For the result of layer-swap = 1, although the generated image is similar to the input image in terms of facial orientation and expression, it fails to accurately preserve the color layout of the input image and loses the decorations in the input image. In summary, LDSGAN obtains the best image quality and effectively maintains the similarity between the input and output images.

4.4. Ablation Study and Analysis

4.4.1. Ablation Study of RTRS

To analyze the role of RTRS in LDSGAN, we designed w/o RTRS and one RTRS and compared their effectiveness with full RTRS. w/o RTRS means that the model does not use RTRS and SABlock, so the model is equivalent to an autoencoder. One RTRS represents the generation of all the search codes at the first call to RTRS, whereas full RTRS represents the complete method proposed earlier. The results of the experiments are shown in Table 2, and some examples from Sketch2Anime are illustrated in Figure 5.

Table 2. The ablation results of RTRS.

	Yellow2Anime		Sketch2Anime
	FID↓	LPIPS↓	FID↓	LPIPS↓
w/o RTRS	47.662	0.664	34.571	0.609
One RTRS	34.621	0.604	46.156	0.628
Full RTRS	31.437	0.582	29.582	0.579

The italic values indicate the highest level and bold values represent the second highest level.

As can be seen in Table 2, in the Yellow2Anime task, the w/o RTRS model receives the lowest FID and LPIPS scores, while the one RTRS model scores in the middle of the pack and full RTRS model scores the highest.

In addition, the trend of LPIPS scores in the Sketch2Anime task is the same as in the Yellow2Anime task. One RTRS model significantly improves the FID score by utilizing the StyleGAN2 architecture of the network. In contrast, the full RTRS model exclusively improves the similarity between the input and generated images by searching the layout. For the Sketch2Anime task, the FID scores are relatively stable across the three modes because the input image does not have a complex color layout.

4.4.2. Ablation Study of Sketch Loss

The effectiveness of using Sketch Loss to maintain image consistency is analyzed by ablation studies. First, we conduct comparative experiments by excluding both the Soft-Sketch Loss and Hard-Sketch Loss. The comparison results are shown in Table 3, and their visualization is shown in Figure 5.

Table 3. The ablation results of Sketch Loss.

	Yellow2Anime		Sketch2Anime
	FID↓	LPIPS↓	FID↓	LPIPS↓
w/o +	36.931	0.631	36.028	0.594
	32.566	0.668	28.043	0.592
	33.676	0.649	34.752	0.609
+	31.437	0.582	29.582	0.579

Table 3 shows that although and have higher FID scores than w/o + , they perform worse on LPIPS scores. This result is because the optimization process tries to direct the generated image to the target domain, which is further away from the source domain. We noticed the phenomenon that in our experiments, the LPIPS score is lower at the beginning of the optimization and increased at the later stages. As shown in Figure 6, the LPIPS is low at the beginning of the optimization but increased in the later stages.

The main reason for the above phenomenon is that either or is used, both of which cause the network to focus on only some of the features in the source domain. The network may become overly focused on these specific features as the number of training sessions increases. Therefore, to overcome this shortcoming, we finally used a combination of + to constrain the network training.

5. Conclusion

In this paper, we propose a novel LDSGAN method to translate images to the target domain further and generate high-quality images while maintaining the similarity between the input and generated images. The proposed LDSGAN uses SDNet to distill the input images’ message while leaving the rich maps of input images. Then LDTNet is adapted to further translate the feature maps to the target domain. Furthermore, the applied TDSNet restores the color layout of input images by searching for rich maps and generating maps in the target domain. Finally, Sketch Loss is used to maintain the image identity between input and generated images. Comprehensive comparisons demonstrate that the proposed LDSGAN generates more vivid images in LDDT and maintains the color and layout, which shows that the proposed method has obvious superiority over state-of-the-art work in LDDT. Since the proposed method suffers from the detailed shape dim problem, we aim to improve the coincidence of detailed shapes on input and generated images in future work.

Nomenclature

ANA: Anti-nuclear antibodies
APC: Antigen-presenting cells
IRF: Interferon regulatory factor.

Conflicts of Interest

The authors declare no conflicts of interest.

Author Contributions

Hao Wang: conceptualization, methodology, software, formal analysis, and writing – original draft.

Chenbin Wang: conceptualization, supervision, formal analysis, writing – review and editing, and supervision.

Xin Cheng: conceptualization, supervision, formal analysis, writing – review and editing, and supervision.

Hao Wu: conceptualization, supervision, formal analysis, writing – review and editing, and supervision.

Jiawei Zhang: methodology, validation, investigation, writing – review and editing, and supervision.

Jinwei Wang: project administration, methodology, validation, investigation, writing – review and editing, and supervision.

Xiangyang Luo: funding acquisition, resources, and methodology.

Bin Ma: data curation and visualization.

Funding

This work was supported by the National Key R and D Program of China (Grant No. 2021QY0700), National Natural Science Foundation of China (Grant Nos. 62472229, 62072250, U23A20305, U23B2022, 62371145, 62072480, 62172435, 62302249, 62272255, 62302248, and U20B2065), Zhongyuan Science and Technology Innovation Leading Talent Project of China (Grant No. 214200510019), Open Foundation of Henan Key Laboratory of Cyberspace Situation Awareness (Grant No. HNTS2022002), and Graduate Student Scientific Research Innovation Projects of Jiangsu Province (Grant No. KYCX24_1513).

Acknowledgments

Open Research

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

1 Zhang H., Chen B., Wang J., and Zhao G., A Local Perturbation Generation Method for gan-generated Face Anti-forensics, IEEE Transactions on Circuits and Systems for Video Technology. (2023) 33, no. 2, 661–676, https://doi.org/10.1109/tcsvt.2022.3207310.
10.1109/TCSVT.2022.3207310
Web of Science® Google Scholar
2 Yue P., Chen B., and Fu Z., Local Region Frequency Guided Dynamic Inconsistency Network for Deepfake Video Detection, Big Data Mining and Analytics. (2024) 7, no. 3, 889–904, https://doi.org/10.26599/bdma.2024.9020030.
10.26599/BDMA.2024.9020030
Web of Science® Google Scholar
3 Goodfellow I., Pouget-Abadie J., Mirza M. et al., Generative Adversarial Nets, Advances in Neural Information Processing Systems. (2014) 27.
Google Scholar
4 Rombach R., Blattmann A., Lorenz D., Esser P., and Ommer B., High-resolution Image Synthesis with Latent Diffusion Models, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 10684–10695.
Google Scholar
5 Zhang R., Zhu J. Y., Isola P. et al., Real-time User-Guided Image Colorization with Learned Deep Priors, ACM Transactions on Graphics. (2017) 36, no. 4, 1–11, https://doi.org/10.1145/3072959.3073703, 2-s2.0-85030753700.
10.1145/3072959.3073703
CAS Web of Science® Google Scholar
6 Zhang R., Isola P., and Efros A. A., Colorful Image Colorization, Computer Vision–ECCV 2016: 14th European Conference, 2016, Amsterdam, the Netherlands, Springer, 649–666.
Google Scholar
7 Dou Z., Wang N., Li B., Wang Z., Li H., and Liu B., Dual Color Space Guided Sketch Colorization, IEEE Transactions on Image Processing. (2021) 30, 7292–7304, https://doi.org/10.1109/tip.2021.3104190.
10.1109/TIP.2021.3104190
PubMed Web of Science® Google Scholar
8 Jiang W., Liu S., Gao C. et al., Psgan: Pose and Expression Robust Spatial-Aware gan for Customizable Makeup Transfer, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, 5194–5202.
Google Scholar
9 Dong H., Liang X., Zhang Y. et al., Fashion Editing with Adversarial Parsing Learning, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, 8120–8128.
Google Scholar
10 Huang X. and Belongie S., Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization, Proceedings of the IEEE International Conference on Computer Vision, 2017, 1501–1510.
Google Scholar
11 Kang S., Ok Y., Kim H., and Hahn T., Image-to-image Translation Method for Game-Character Face Generation, 2020 IEEE Conference on Games (CoG), 2020, IEEE, 628–631.
Google Scholar
12 Ronneberger O., Fischer P., and Brox T., U-Net: Convolutional Networks for Biomedical Image Segmentation, Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, 2015, Munich, Germany, Springer, 234–241.
Google Scholar
13 Isola P., Zhu J. Y., Zhou T., and Efros A. A., Image-to-image Translation with Conditional Adversarial Networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 1125–1134.
Google Scholar
14 Richardson E., Alaluf Y., Patashnik O. et al., Encoding in Style: a Stylegan Encoder for Image-To-Image Translation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, 2287–2296, https://doi.org/10.1109/cvpr46437.2021.00232.
10.1109/cvpr46437.2021.00232
Google Scholar
15 Zhu J. Y., Park T., Isola P., and Efros A. A., Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks, Proceedings of the IEEE International Conference on Computer Vision, 2017, 2223–2232.
Google Scholar
16 Lee H. Y., Tseng H. Y., Huang J. B., Singh M., and Yang M. H., Diverse Image-To-Image Translation via Disentangled Representations, Proceedings of the European Conference on Computer Vision (ECCV), 2018, 35–51.
Google Scholar
17 Liu M. Y., Breuel T., and Kautz J., Unsupervised Image-To-Image Translation Networks, Advances in Neural Information Processing Systems. (2017) 30.
10.1007/978-3-319-70139-4
PubMed Google Scholar
18 Huang X., Liu M. Y., Belongie S., and Kautz J., Multimodal Unsupervised Image-To-Image Translation, Proceedings of the European Conference on Computer Vision (ECCV), 2018, 172–189.
Google Scholar
19 Huang J., Liao J., and Kwong S., Unsupervised Image-To-Image Translation via Pre-trained Stylegan2 Network, IEEE Transactions on Multimedia. (2022) 24, 1435–1448, https://doi.org/10.1109/tmm.2021.3065230.
10.1109/TMM.2021.3065230
Web of Science® Google Scholar
20 Mirza M. and Osindero S., Conditional Generative Adversarial Nets, 2014.
Google Scholar
21 Karras T., Laine S., and Aila T., A Style-Based Generator Architecture for Generative Adversarial Networks, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, 4401–4410.
Google Scholar
22 Karras T., Laine S., Aittala M., Hellsten J., Lehtinen J., and Aila T., Analyzing and Improving the Image Quality of Stylegan, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, 8110–8119.
Google Scholar
23 Zhang R., Isola P., Efros A. A., Shechtman E., and Wang O., The Unreasonable Effectiveness of Deep Features as a Perceptual Metric, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, 586–595.
Google Scholar
24 Rao Y. and Ni J., A Deep Learning Approach to Detection of Splicing and Copy-Move Forgeries in Images, 2016 IEEE International Workshop on Information Forensics and Security (WIFS), 2016, IEEE, 1–6.
Google Scholar
25 He K., Zhang X., Ren S., and Sun J., Deep Residual Learning for Image Recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, 770–778.
Google Scholar
26 Maas A. L., Hannun A. Y., Ng A. Y. et al., Rectifier Nonlinearities Improve Neural Network Acoustic Models, 30, Proc. Icml, 2013, Atlanta, GA.
Google Scholar
27 Gulrajani I., Ahmed F., Arjovsky M., Dumoulin V., and Courville A. C., Improved Training of Wasserstein Gans, Advances in Neural Information Processing Systems. (2017) 30.
Google Scholar
28 Li M., Lin Z., Mech R., Yumer E., and Ramanan D., Photo-sketching: Inferring Contour Drawings from Images, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019, IEEE, 1403–1412.
Google Scholar
29 Branwen G. and Gokaslan A., Danbooru2019: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset, Danbooru2017. (2019) 6.
Google Scholar
30 Paszke A., Gross S., Chintala S. et al., Automatic Differentiation in Pytorch, 2017.
Google Scholar
31 He K., Zhang X., Ren S., and Sun J., Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification, Proceedings of the IEEE International Conference on Computer Vision, 2015, 1026–1034.
Google Scholar
32 Nah S., Hyun Kim T., and Mu Lee K., Deep Multi-Scale Convolutional Neural Network for Dynamic Scene Deblurring, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 3883–3891.
Google Scholar
33 Heusel M., Ramsauer H., Unterthiner T., Nessler B., and Hochreiter S., Gans Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, Advances in Neural Information Processing Systems. (2017) 30.
Google Scholar
34 Tang H., Liu H., Xu D., Torr P. H., and Sebe N., Attentiongan: Unpaired Image-To-Image Translation Using Attention-Guided Generative Adversarial Networks, IEEE Transactions on Neural Networks and Learning Systems. (2023) 34, no. 4, 1972–1987, https://doi.org/10.1109/tnnls.2021.3105725.
10.1109/TNNLS.2021.3105725
PubMed Web of Science® Google Scholar
35 Gong Y., Zhan Z., Jin Q. et al., E 2 gan: Efficient Training of Efficient Gans for Image-To-Image Translation, Forty-first International Conference on Machine Learning, 2024.
Google Scholar

All articles

LDSGAN: Unsupervised Image-to-Image Translation With Long-Domain Search GAN for Generating High-Quality Anime Images

Abstract

1. Introduction

2. Related Work

3. The Proposed Method

3.1. SDNet

3.2. LDTNet

3.3. TDSNet

3.4. Loss Function

4. Experiments

4.1. Baseline and Dataset

4.2. Experimental Settings

4.3. Comparison Experiment

4.4. Ablation Study and Analysis

4.4.1. Ablation Study of RTRS

4.4.2. Ablation Study of Sketch Loss

5. Conclusion

Nomenclature

Conflicts of Interest

Author Contributions

Funding

Acknowledgments

Open Research

Data Availability Statement

References

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley