Volume 7, Issue 7 e70287
RESEARCH ARTICLE
Open Access

The Subword-Character Multi-Scale Transformer With Learnable Positional Encoding for Machine Translation

Wenjing Yao

Corresponding Author

Wenjing Yao

Department of Foreign Languages, Jinzhong University, Jinzhong, China

Correspondence: Wenjing Yao ([email protected])

Contribution: Writing - review & editing, Writing - original draft, Validation, Data curation, Resources

Search for more papers by this author
Wei Zhou

Wei Zhou

Department of Information Technology and Engineering, Jinzhong University, Jinzhong, China

Contribution: Writing - original draft, Writing - review & editing, Software, Formal analysis

Search for more papers by this author
First published: 12 July 2025

Funding: This work was supported by The program of High-quality Sharing course Cross cultural communication in Jinzhong University of Shanxi Province in 2024 (Grant K2024402); Teaching reform and Innovation project of colleges and universities in Shanxi Province “Research and Application of VR Immersive Cross Cultural Communication Teaching and Training Platform: A Case Study of Jinzhong University” from Shanxi Provincial Education Department (Grant J20241332).

ABSTRACT

The transformer model addresses the efficiency bottleneck caused by sequential computation in traditional recurrent neural networks (RNN) by leveraging the self-attention mechanism to parallelize the capture of global dependencies. The subword-level modeling units and fixed-pattern positional encoding adopted by mainstream methods struggle to adequately capture fine-grained feature information in morphologically rich languages, limiting the model's flexible learning of target-side word order patterns. To address these challenges, this study innovatively constructs a subword-character multi-scale transformer architecture integrated with a learnable positional encoding mechanism. The model abandons traditional fixed-pattern positional encodings, enabling autonomous optimization of the positional representation space for source and target languages through end-to-end training mechanisms, significantly enhancing dynamic adaptability in cross-linguistic positional mapping. While preserving the global semantic modeling advantages of subword units, the framework introduces a lightweight-designed character-level branch to supplement fine-grained features. For the fusion of subword and character branches, we employ context-aware cross-attention to enable dynamic integration of linguistic information at different granularities. Our model achieves notable improvements in BLEU scores on the WMT'14 English-German (En-De), WMT'17 Chinese-English (Zh-En), and WMT'16 English-Romanian (En-Ro) benchmark tasks. These results demonstrate the synergistic effects of fine-grained multi-scale modeling and learnable positional encoding in enhancing translation quality and linguistic adaptability.

1 Introduction

Machine translation (MT), as a cornerstone of natural language processing (NLP), aims to eliminate language barriers by automatically converting source language text into target language. In recent years, the exceptional performance of transformers in MT [1, 2] has significantly improved translation quality, driving a series of innovations in transformer architectures. For the source sentence (German): “Die Katzen fangen die Mäuse ein.” Target sentence (English): “The cats catch the mice.” German exhibits rich morphological variations (e.g., noun plural “Mäuse”), while traditional transformers employ Byte Pair Encoding (BPE)-based subword segmentation. If BPE splits “Mäuse” into “Mäu” and “se,” models overly reliant on subword positional encoding may fail to dynamically adjust word order. For the source sentence (Chinese): “ 我喜欢吃苹果。” Target sentence (English): “I like eating apples.” The Chinese character “吃” (chī, “eat”) is mapped to an independent subword in segmentation. When “吃” neighbors contextual characters like “苹” (píng, “apple”), models may misinterpret its semantic role, causing fine-grained semantic association disruptions.

To address the limitations of traditional transformers [3] in handling long sequences or complex-structured data, multi-scale transformer models have been proposed to capture features at different hierarchical levels, better preserving both local and global information. Regarding research on multi-scale transformers, Refs. [4-6] proposed fixed-length window-based approaches that reduce computational complexity by restricting the scope of self-attention mechanisms while retaining local context; Ref. [7] introduced linguistically inspired local patterns to enhance the capture of language-specific features. However, these methods suffer from limited long-range dependency modeling due to fixed window sizes. Reference [8] designed a hybrid model integrating convolutional neural networks (CNNs) with self-attention, leveraging convolutional operations for local feature extraction followed by fusion with self-attention to achieve multi-scale feature modeling; Ref. [9] proposed a parallel architecture enabling simultaneous operation of convolution and self-attention to balance local and global information extraction. These hybrid models exhibit high structural complexity, making it challenging to balance complexity with computational efficiency. Other studies on multi-scale transformers generate sequences through subword segmentation methods such as Byte Pair Encoding (BPE) [10], WordPiece (WP) [11], and SentencePiece (SP) [12]. However, sequences generated by different subword methods vary in granularity (character, subword, word), posing challenges for cross-unit interaction modeling. SentencePiece [13] improved sequence generation by combining the advantages of subword segmentation but relied solely on subword units, neglecting character-level fine-grained features. To address this, we employ a multi-scale transformer with dual-path subword-character encoding to extract semantic information at different granularities, enabling dynamic interaction between subword and character-level fine-grained features.

Character-level features are influenced by subword structure and position, while the encoded sequential information of subwords can enhance translation quality. Therefore, positional encoding plays a crucial role in word order control. Traditional transformers inject positional signals by adding fixed positional encoding to word embeddings, assuming strict alignment between the absolute positions in the source language and ideal positions in the target language. Large language models (LLMs) [14] not only inherit sinusoidal positional encoding but also introduce trainable embedding parameters to enable dynamic positional weight adjustment. Shaw et al. [15] pioneered relative positional encoding but incurred substantial computational and memory overhead for long sequences while ignoring contextual dynamics. To address this, RoFormer [16] proposed Rotary Positional Encoding (RoPE), which integrates positional information through rotation angles mapped from position indices into geometric transformations in complex space, making attention scores explicitly reflect relative positional differences between tokens. However, while computationally efficient, the fixed-frequency rotation mechanism may restrict dynamic adaptation to complex positional patterns, particularly under non-uniform or semantically sensitive positional relationships, thereby limiting performance in fine-grained contextual tasks. To overcome this, we propose learnable positional encoding (LPE), which learns reordered positional information of target languages to dynamically adjust word order between source and target languages, further improving translation quality.

Traditional transformer models based on single subword segmentation face significant challenges when processing morphologically rich languages (e.g., German, Russian) or languages with complex character structures (e.g., Chinese). While subword segmentation alleviates out-of-vocabulary issues, its inherent coarse-grained nature may lead to the loss of morphological features such as roots and affixes (e.g., when the German plural noun “Mause” is split into “Mau” and “se,” models struggle to capture the semantic impact of morphological variations). Additionally, for languages like Chinese, single subword units may fragment semantic associations between characters. To address these limitations, we propose a dual-path architecture: the subword branch inherits transformer's global modeling capability to capture long-range semantic dependencies, while the character branch supplements fine-grained morphological features through local graph convolutional networks. The two paths collaborate through dynamic cross-attention to achieve multi-scale semantic fusion from local to global levels, effectively resolving information loss and structural ambiguity inherent in single-path approaches. Furthermore, this work embeds LPE into the dual-path character-subword multi-scale transformer architecture. The contributions of this article are as follows:
  1. This article adopts a dual-path subword-character multi-scale transformer (SCMT). Through its dual-path design with parallel subword and character branches, the architecture breaks through the limitations of traditional single-granularity encoding, compensating for the fine-grained information loss caused by subword segmentation dependency in existing methods.
  2. We propose integrating LPE with trainable mechanisms to enhance positional modeling. By incorporating dedicated neural modules between subword/character embeddings and transformer encoders, we dynamically infuse target language word order reordering signals into the source language encoding process, effectively resolving translation deviations caused by cross-linguistic word order misalignment.

2 Related Work

2.1 Machine Translation

MT is an important research direction in the field of NLP. Statistical machine translation (SMT) became mainstream, with its core idea being to learn translation models from large-scale bilingual corpora. The proposal of neural machine translation (NMT) marked the entry of machine translation into the era of deep learning. NMT adopts an end-to-end neural network architecture, directly mapping source language sentences to target language sentences. The sequence-to-sequence (Seq2Seq) model based on the attention mechanism, proposed by Bahdanau et al. [17], dynamically captures the alignment relationship between the source and target languages through the attention mechanism, addressing the issue of long-range dependencies. Subsequently, the transformer model proposed by Vaswani et al. [18] further revolutionized NMT. It is entirely based on the self-attention mechanism and feedforward networks, abandoning the traditional recurrent structure, significantly improving translation performance and reducing training costs. The transformer and its derivative models (such as BERT [19] and GPT) have not only excelled in machine translation but also have driven the rapid development of other NLP tasks.

2.2 Multi-Transformer

In the field of machine translation, research on multi-scale transformers aims to enhance the modeling capability of complex linguistic structures by capturing multi-level semantic information, thereby improving translation quality. Traditional transformer models face limitations when processing long sequences or complex structural data, while multi-scale transformers effectively address this deficiency through collaborative modeling of local and global features. Local attention methods based on fixed window sizes reduce computational complexity by restricting the scope of self-attention. For instance, Shaw et al. [15] proposed a local self-attention mechanism that confines computations within fixed windows to reduce processing overhead for long sequences. Yang et al. [20] further introduced dynamic window adjustment mechanisms to enhance local feature capture capabilities. Hao et al. [5] designed a multi-granularity self-attention approach that assigns different heads to handle hierarchical phrase structures. Hybrid architecture methods achieve multi-scale collaborative modeling by integrating CNNs with self-attention mechanisms. For example, Zhao et al. [9] developed a CNN-transformer hybrid model that extracts local features through convolutional operations and subsequently fuses them with self-attention to capture global contextual information. Similarly, Li et al. [7] proposed the UMST model, which redefines multi-scale correlations among subwords, words, and phrases. However, such models often encounter challenges in balancing structural complexity and computational efficiency.

Recent studies demonstrate that hybrid granularity modeling can significantly enhance the representation of complex linguistic structures. For instance, Ref. [21] employs subword-phrase multi-level attention to capture syntactic hierarchies, but its fixed-level partitioning limits the utilization of character-level features. Reference [22] adopts a CNN-transformer hybrid architecture to extract local–global features, yet fixed convolutional kernel sizes result in long-range dependency loss. In contrast, our dual-path design achieves dynamic multi-granularity interaction through differentiable cross-attention, preserving the global modeling strengths of subwords while enhancing local morphological awareness via character-level GCNs, thereby extending the parallel convolution-self-attention architecture established in [9].

2.3 Positional Encoding

Positional encoding, as a core component of the transformer architecture, directly influences the model's capability to model word order and syntactic structures. The traditional transformer employs absolute positional encoding [18], which generates fixed vectors for each position using sine/cosine functions to directly inject positional information into word embeddings. However, this static encoding approach struggles to adapt to the misalignment of word order between source and target languages in cross-lingual scenarios. Shaw et al. [15] proposed relative positional encoding, which dynamically adjusts attention weights by computing the relative distances between queries and keys, effectively capturing local word order relationships. Music transformer [23] integrated dependency syntactic distances into relative positional encoding to enhance the model's perception of syntactic structures. Wang et al. [24] introduced the Dynamic Position Reordering mechanism (DynPOS), which maps source-language positions to ideal target-language positions through learnable affine transformations to alleviate translation deviations caused by word order misalignment. Additionally, Ke et al. [25] designed hierarchical positional encoding to model positional relationships at multi-granular levels of characters, subwords, and phrases, strengthening the representation capability for complex linguistic structures.

3 Methods

As shown in Figure 1, the proposed SCMT introduces innovative improvements based on the traditional encoder-decoder architecture, with its core innovation lying in the construction of a dynamically adaptive dual-granularity encoding system. In the encoder section, the model employs a dual-stream architecture with parallel subword and character branches. The subword branch inherits the original transformer encoder design, capturing global semantic associations across positions through multi-layer self-attention mechanisms—a design philosophy theoretically aligned with the deep feature interaction mechanisms emphasized in the weighted transformer proposed by [26]. The character branch incorporates neural network blocks to extract fine-grained features, distinct from the stacked structure of the multi-unit transformer by [27], effectively capturing character-level morphological variations and local neighborhood association features. The dual-path design of subwords and characters is derived from the hierarchical nature of linguistic structures: the subword branch models global semantic dependencies through self-attention mechanisms, designed to capture phrase-level relationships (e.g., verb-object dependencies), while the character branch employs lightweight graph convolutional networks (GCNs) to replace traditional self-attention. By constraining local interactions between characters through adjacency matrices, this approach effectively reduces computational complexity. Additionally, the model innovatively implements a LPE layer, which dynamically generates word order positional information for source and target languages through end-to-end training, replacing traditional fixed-pattern sinusoidal function encoding. This mechanism enables the model to autonomously adjust word order. At the feature interaction level, a context-aware cross-granularity attention gating mechanism is employed, dynamically regulating the fusion ratio of subword semantic features and character morphological features via learnable weight allocators, thereby achieving cross-granularity semantic compensation.

Details are in the caption following the image
The proposed multi-scale subword-character transformer architecture with learnable positional encoding.

3.1 Subword-Character Multi-Scale Transformer

The encoder in the proposed SCMT architecture consists of parallel subword and character branches. The subword branch, based on self-attention mechanisms and feed-forward neural network structures, achieves global dependency modeling of the input sequence. Unlike traditional transformer encoders, the subword branch in SCMT incorporates target language word order positional information in its inputs. This approach addresses the challenge faced by conventional transformers in generating ungrammatical translations when source and target languages exhibit word order discrepancies due to their inability to adaptively adjust positional relationships. It overcomes the original positional encoding's dependency on source-side absolute positions, enabling the model to more flexibly capture structural differences between languages. Given an input subword sequence X = x 1 , x 2 , · · · x n $$ X=\left\{{x}_1,{x}_{2,\cdots }{x}_n\right\} $$ , where x i $$ {x}_i $$ represents the ith subword embedding, each subword embedding is combined with positional encoding p i $$ {p}_i $$ to preserve sequential information. This step enables the model to comprehend lexical semantics while capturing positional relationships. The calculation formula is expressed as follows:
h i = x i + p i $$ {h}_i={x}_i+{p}_i $$ (1)
where h i $$ {h}_i $$ represents the enhanced embedding after superposition.
Next, using our proposed learnable positional encoding, the enhanced embedding h i $$ {h}_i $$ is input into the LPE module to effectively learn positional information in an end-to-end manner, and the learned positional information is combined with the original positional information. The calculation formula is expressed as follows:
l i = LPE h i $$ {l}_i=\mathrm{LPE}\left({h}_i\right) $$ (2)
z i = l i + h i $$ {z}_i={l}_i+{h}_i $$ (3)
where z i $$ {z}_i $$ is the input to the first layer of the transformer encoder.
In each layer of the encoder, the dependencies between elements in the input subword sequence are computed through a multi-head self-attention mechanism. For the ith layer, the input Z i = z 1 , z 2 , · · · z n $$ {Z}_i=\left\{{z}_1,{z}_{2,\cdots }{z}_n\right\} $$ is linearly projected into Query, Key, and Value matrices, with the formulas expressed as follows:
Q = Z i W Q , K = Z i W K , V = Z i W V $$ Q={Z}_i{W}_Q,K={Z}_i{W}_K,V={Z}_i{W}_V $$ (4)
where W Q $$ {W}_Q $$ , W K $$ {W}_K $$ , and W V $$ {W}_V $$ are learnable weight matrices.
The attention computation is performed on Q, K, and V, and the attention score is calculated as follows:
Attention ( Q , K , V ) = softmax Q K T d k V $$ Attention\left(Q,K,V\right)= softmax\left(\frac{Q{K}^T}{\sqrt{d_k}}\right)V $$ (5)
The feed-forward neural network consists of two linear transformations and an activation function, used to further extract features:
FFN ( x ) = max 0 , x W 1 + b 1 , 0 W 2 + b 2 $$ \mathrm{FFN}(x)=\max \left(0,x{W}_1+{b}_1,0\right){W}_2+{b}_2 $$ (6)
where W 1 $$ {W}_1 $$ and W 2 $$ {W}_2 $$ are the transformation matrices, and b 1 $$ {b}_1 $$ and b 2 $$ {b}_2 $$ are the bias matrices.
As shown in Figure 1, the character branch in the encoder of the SCMT architecture adopts a lightweight branch design. Compared with traditional multi-scale transformers that ignore character-level feature information, it can learn local fine-grained features, compensate for the shortcomings of subword tokenization, exhibit stronger tolerance for subword tokenization errors, and enhance the model's robustness. In character-based research, the effective utilization of boundary information plays a critical role. Existing studies [7] have shown that introducing lexical boundary information can significantly suppress redundant associations between subword units. However, when the same character appears in different words, the model struggles to accurately distinguish their semantic boundaries, leading to feature confusion. To address this issue, we propose a character branch structure based on the GCN method [22]. In the character branch, the multi-head self-attention in the transformer module is replaced with a character-level graph convolutional network (GCN). The GCN enhances the perception of inter-character relationships and captures local detailed information. This structure treats characters as fundamental nodes and constructs a local connection graph based on lexical boundary information. Through a multi-layer message-passing process, it gradually integrates local morphological features of characters with global semantic information. This computation can be described by the following formula:
GCN = σ B ˜ f 1 2 M ˜ f B ˜ f 1 2 · x W f g $$ \mathrm{GCN}=\sigma \left({\tilde{B}}_f^{-\frac{1}{2}}{\tilde{M}}_f{\tilde{B}}_f^{-\frac{1}{2}}\cdotp x{W}_f^g\right) $$ (7)
where M ˜ f $$ {\tilde{M}}_f $$ is the undirected graph adjacency matrix, B ˜ f $$ {\tilde{B}}_f $$ denotes the degree matrix of M ˜ f $$ {\tilde{M}}_f $$ and W f g $$ {W}_f^g $$ represents a linear transformation layer.
The subword branch takes BPE-segmented sequences as input, while the character branch processes raw character sequences, with both independently encoded to prevent feature interference. Through context-aware cross-attention [28] subword Query vectors dynamically aggregate character Key-Value pairs, enabling complementary integration of semantic and morphological features. This mechanism addresses the limitation of traditional cross-attention that solely relies on source-language sequence information for target-side decoding, innovatively integrating the historical contextual information from previously generated target-side tokens into the attention calculation process. Specifically, when generating the tth target token, the decoder not only attends to source-language encoded representations through the cross-attention mechanism but also dynamically integrates the current decoder hidden state and contextual information from the first t − 1 generated target tokens via a gated neural network. This design enables the model to adaptively adjust attention weights for source-language relevant information based on the generated semantic fragments, effectively capturing long-distance dependencies. Each position in the character branch actively attends to all Key vectors corresponding to positions in the subword branch through attention Query vectors, and performs weighted aggregation of Value vectors from the subword branch based on attention weights. The formula is as follows:
g t = σ W g h t ; c t 1 $$ {g}_t=\sigma \left({W}_g\left[{h}_t;{c}_{t-1}\right]\right) $$ (8)
c t = g t h t + 1 g t c t 1 $$ {c}_t={g}_t\bigodot {h}_t+\left(1-{g}_t\right)\bigodot {c}_{t-1} $$ (9)
Attention Q c , K s , V s = softmax Q c K s T d k V s $$ Attention\left({Q}_c,{K}_s,{V}_s\right)= softmax\left(\frac{Q_c{K}_s^T}{\sqrt{d_k}}\right){V}_s $$ (10)
O t = LayerNorm Attention Q c , K s , V s + c t $$ {O}_t= LayerNorm\left( Attention\left({Q}_c,{K}_s,{V}_s\right)+{c}_t\right) $$ (11)
where c t 1 $$ {c}_{t-1} $$ denotes the historical context vector generated in the previous step, h t $$ {h}_t $$ represents the hidden state of the decoder at time step t, W g $$ {W}_g $$ indicates the learnable parameter matrix of the gated neural network, g t $$ {g}_t $$ stands for the gating vector generated through the sigmoid function σ $$ \sigma $$ , Q c $$ {Q}_c $$ corresponds to the query vector of the character branch, K s $$ {K}_s $$ refers to the key vector of the subword branch, and V s $$ {V}_s $$ signifies the value vector of the subword branch. O t $$ {O}_t $$ represents the final context-aware cross-attention output.

The pseudo-code for the interaction mechanism between subword branches and character branches is as follows in (Algorithm 1).

ALGORITHM 1. Two-branch interactive flow.

def cross_branch_interaction (h_char, h_subword, c_prev):

 # Door Control Network

 Gate = sigmoid (W_gate@concatenate (h_char, c_prev))

 c_current = gate * h_char + (1 − gate) * c_prev

 # Cross-grained attention

 q = h_char@W_Q

 K, V = h_subword@W_K, h_subword@W_V

 attn_weights = softmax ([email protected]/sqrt (d_k))

 output = LayerNorm(attn_weights@V + c_current)

 return output, c_current

Through this cross-attention computation, the two branches can adaptively capture dependencies between character-level fine-grained features and subword-level semantic features, enabling dynamic fusion and complementary integration of contextual information.

To demonstrate the advantages of the dual-branched collaborative mechanism, we use Chinese-to-English translation as an example: When processing the Chinese compound word “银行” (yínháng, “bank”), the subword branch may correctly segment it into “银” (yín) and “行” (háng), while the character branch reinforces the distinction between “银行” (bank) and the homophonic phrase “很行” (hěn xíng, “very capable”) by leveraging the local morphological features of the “钅” radical (metal component) in “银” and the semantic context of “行.” This cross-granularity feature complementarity enables the model to dynamically resolve subword segmentation ambiguities.

3.2 Learnable Positional Encoding

The proposed LPE mechanism breaks through the static limitations of traditional sinusoidal encoding, serving as an independent neural network block composed of two-layer transformer blocks. The first transformer layer is responsible for source-language positional feature extraction, learning relative positional relationships between words in the input sequence through self-attention mechanisms while fusing the initial features of subword embeddings ( w i $$ {w}_i $$ ) and sinusoidal positional encoding ( p i $$ {p}_i $$ ). The second transformer layer (right-side network) performs cross-lingual positional remapping, dynamically learning target-language word order patterns through end-to-end training to map source-language positional encoding p i $$ {p}_i $$ to ideal target-language positions p j $$ {p}_j $$ . This layer incorporates FastAlign alignment supervision signals to enforce alignment between network outputs l i $$ {l}_i $$ and target-language positional distributions. It takes the concatenated inputs of sinusoidal positional encoding and subword embeddings, enabling the learning of reordered positional information in target languages.

Unlike traditional transformers that rigidly rely on sinusoidal positional encoding to fix word order, our approach achieves flexible adjustment of positional information through the integration of sinusoidal positional encoding with subword embeddings. When a subword token w i $$ {w}_i $$ at position i $$ i $$ in the source language optimally corresponds to position j $$ j $$ in the target language order, the input to the learnable positional encoder autonomously learns the distribution characteristics of the target position p i $$ {p}_i $$ . This learned information is then combined with p i $$ {p}_i $$ to inject cross-linguistic pre-reordering signals into w i $$ {w}_i $$ , effectively mitigating translation errors caused by source-target language order misalignment.

As illustrated in the Figure 2, the learnable positional encoder (LPE) consists of two sequentially connected transformer layers. While maintaining the parameter scale of standard transformers, the end-to-end joint optimization mechanism significantly enhances the system's ability to model syntactic ordering. Alignment supervision operates through a dual mechanism: after summing sinusoidal positional encoding with subword embeddings, the input is processed by LPE to generate learnable positional encoding l i $$ {l}_i $$ , which is forced to align with the target positional encoding p j $$ {p}_j $$ determined by FastAlign. This alignment is enforced via a mean squared error loss function:
L o = i = 1 l th F p j , l i l th $$ {L}_o=\frac{\sum_{i=1}^{\left|{l}_{th}\right|}\mathcal{F}\left({p}_j,{l}_i\right)}{\left|{l}_{th}\right|} $$ (12)
where p j $$ {p}_j $$ is the supervising position, F ( ) $$ \mathcal{F}\left(\right) $$ is the mean squared error function, and l th $$ {l}_{th} $$ is the source sequence length.
Details are in the caption following the image
Learnable positional encoding architecture. LPE is a learnable positional encoding module, and l 1 $$ {l}_1 $$ is the output representation of the subword at the first position summed with the target position at the first position after passing through the LPE module. w i $$ {w}_i $$ is the subword token or character token. p i $$ {p}_i $$ is the target position.
The final total loss function L final $$ {L}_{\mathrm{final}} $$ is obtained through a weighted summation of the transformer loss function L T $$ {L}_T $$ and L o $$ {L}_o $$ , with the calculation formula as follows:
L final = μ L T + ( 1 μ ) L o $$ {L}_{\mathrm{final}}=\mu {L}_T+\left(1-\mu \right){L}_o $$ (13)
where μ $$ \mu $$ is determined by minimizing the total loss during training.

4 Experiments

To comprehensively validate the effectiveness of the proposed model, we conduct rigorous experiments across three widely recognized English-centric machine translation benchmarks: WMT'14 English-German (En-De) [29] with 4.5 M parallel sentence pairs, WMT'17 Chinese-English (Zh-En) [30] containing 20 M bilingual segments, and WMT'16 English-Romanian (En-Ro) [31] comprising 0.6 M training instances. All datasets undergo standardized preprocessing, including byte-pair encoding (BPE) with 32 K merge operations, sentence normalization, and length-based filtering to ensure data quality.

4.1 Datasets

The WMT'14 English-German (En-De) [29] dataset comprises 4.5 million standardized bilingual sentence pairs, constructed through high-quality training set preparation steps including sentence length filtering, punctuation normalization, and duplicate data cleaning, with newstest2016 serving as the validation benchmark for hyperparameter tuning and the more challenging newstest2014 as the final performance test set.

The WMT'16 English-Romanian (En-Ro) [31] dataset employs a preprocessing framework jointly developed by [32, 33], implementing a cross-lingual Byte Pair Encoding (BPE) fusion strategy to build a vocabulary containing 40,000 shared subword units, where newsdev2016 and newstest2016 are designated for model validation and testing respectively.

The WMT'17 Chinese-English (Zh-En) [30] dataset presents greater challenges, integrating full resources from the WMT'17 competition to cover formal texts from the United Nations multilingual conference parallel corpus (15.8 million sentence pairs), multi-domain data from the CWMT corpus (9 million sentence pairs), and time-sensitive content from news commentary corpora (332,000 sentence pairs). Utilizing 18 million high-quality bilingual instances, it strictly adopts the officially partitioned newsdev2017 and newstest2017 as validation and testing benchmarks to ensure experimental result comparability and reliability.

4.2 Experimental Setup

4.2.1 Evaluation Metrics

In our experiments, we employ the BLEU score, a widely used evaluation metric in machine translation, to assess translation quality. BLEU measures the n-gram overlap between machine-translated outputs and human reference translations. Known for its computational efficiency and ease of implementation, BLEU is particularly suited for large-scale evaluations. The formula is calculated as follows:
BLEU = BP · n = 1 N w n log p n $$ \mathrm{BLEU}=\mathrm{BP}\cdotp \left(\sum \limits_{n=1}^N{w}_n\log {p}_n\right) $$ (14)
Here, p n $$ {p}_n $$ represents the n-gram precision, BP $$ BP $$ denotes the brevity penalty, and w n $$ {w}_n $$ is the n-gram weight.

4.2.2 Implementation Details

The experimental configuration utilized uniform parameter initialization with all weights set to 0.1, a value deliberately chosen to maintain numerical stability during the initial training phase and prevent gradient explosion. Computational workloads were distributed across 8 NVIDIA Tesla V100 GPUs (each with 32 GB VRAM) through synchronous data parallelism, employing optimized CUDA kernel implementations and Tensor Core-enabled mixed-precision training. We adopted the Adam optimizer [34] with its characteristic momentum parameters configured as β = ( 0.9 , 0.997 ) $$ \beta =\left(0.9,0.997\right) $$ , slightly adjusting the second-moment decay rate from the conventional 0.999 to better accommodate the statistical characteristics of multilingual training corpora. The learning rate followed a cyclic scheduling strategy with a base value of 3e−4, incorporating a 10,000-step warmup phase and a cosine annealing mechanism over 300,000 training iterations. Batch sizes were dynamically adjusted through gradient accumulation (4 steps per update) to maintain effective batch sizes, while gradient clipping at a 2.0 norm ensured stable optimization processes.

Additionally, the parameter configuration of the LPE module is as follows. The dimensionality of all input/output word embeddings is 512, with the output dimensions of attention layers and feed-forward layers also maintained at 512. Each attention layer has 8 heads, with each head having a dimensionality of 64. The dropout rate is set to 0.1 for regularization. The Adam optimizer is employed with parameters β 1 = 0.9 $$ {\beta}_1=0.9 $$ , β 2 = 0.98 $$ {\beta}_2=0.98 $$ .

4.3 Loss Function

The loss function of this model adopts a multi-task joint training strategy, consisting of the cross-entropy loss for machine translation and the character-level morphological feature reconstruction loss, achieving joint optimization of dual-granularity information through a dynamic weight coefficient. The cross-entropy loss for machine translation serves as the primary task loss, calculating the standard cross-entropy loss between the model's final prediction results and the ground-truth target sequence to evaluate the decoding quality of the subword-character fused representations. The formula is expressed as:
L 1 = t = 1 T log P y t | y < t , X sub , X char $$ {L}_1=-\sum \limits_{t=1}^T\log P\left({y}_t\mid y<t,{X}_{\mathrm{sub}},{X}_{\mathrm{char}}\right) $$ (15)
where y t $$ {y}_t $$ represents the tth target token, and X sub $$ {X}_{\mathrm{sub}} $$ and X char $$ {X}_{\mathrm{char}} $$ denote the encoded outputs from the subword branch and character branch respectively.
The auxiliary task loss enhances the model's ability to capture fine-grained information by reconstructing the character-level morphological features of the input sequence through intermediate representations from the character branch, which is computed using the mean squared error (MSE):
L 2 = 1 N h char ( i ) e char ( i ) 2 $$ {L}_2=\frac{1}{N}{\left\Vert {h}_{\mathrm{char}}^{(i)}-{e}_{\mathrm{char}}^{(i)}\right\Vert}^2 $$ (16)
where h char ( i ) $$ {h}_{\mathrm{char}}^{(i)} $$ denotes the hidden state at the ith position of the character branch, and e char ( i ) $$ {e}_{\mathrm{char}}^{(i)} $$ represents the corresponding character embedding vector.
The total loss dynamically balances the contributions of the primary task and auxiliary task by introducing a learnable weight coefficient λ $$ \lambda $$ , formulated as follows:
L 0 = L 1 + λ L 2 $$ {L}_0={L}_1+{\lambda L}_2 $$ (17)
where λ $$ \lambda $$ is automatically optimized via gradient descent to ensure the stability of dual-branch training.

5 Results

5.1 Ablation Study

To validate the effectiveness of the LPE proposed in this article, we conducted ablation experiments on LPE using the WMT'14 English-German (En-De) dataset. As shown in Table 1, the BLEU score of the Transformer Base increased by 0.8 after incorporating LPE, indicating that LPE enhances machine translation performance. Our proposed dual-path multiscale transformer based on characters and subwords achieved a BLEU score of 27.8 without LPE, and this score increased by 1.4 after adding LPE. This demonstrates that the proposed SCMT outperforms the transformer, and compared with sinusoidal positional encoding, LPE can flexibly capture positional information in the target language, improving the model's dynamic adaptation capability for translation.

TABLE 1. Ablation study to validate the effectiveness of the LPE module on the WMT'14 English-German (En-De) task.
Model Param(M) BLEU
Transformer Base (−LPE) 64.6 26.9
Transformer Base (+LPE) 70.5 27.7
SCMT(−LPE) 74.3 27.8
SCMT(+LPE) 77.8 29.2

Furthermore, as shown in Table 2, we further investigated the performance differences between the LPE and mainstream relative positional encodings on the WMT'14 En-De task. As shown in Table 2, LPE achieves a 0.7 BLEU improvement over RoPE on the standard metric and exhibits the smallest performance degradation in long-sequence scenarios (> 512 tokens). LPE outperforms ALiBi by 0.8 BLEU while also demonstrating superior performance in long-sequence contexts. T5-relative achieves a 1.0 BLEU decrease compared with LPE, exhibiting the most significant BLEU degradation in long-sequence scenarios while achieving a BLEU score of 25.3. This demonstrates that LPE effectively mitigates long-distance dependency attenuation through end-to-end learning of target-language positional information.

TABLE 2. Ablation study on LPE and mainstream relative positional encodings in the WMT'14 En-De task.
Positional encodings types Param(M) BLEU Long-sequence (> 512) BLEU
T5-relative 72.1 28.2 25.3
ALiBi 70.8 28.4 26.7
RoPE 71.3 28.5 26.9
LPE (Ours) 77.8 29.2 27.8

The SCMT proposed in this article is composed of dual branches for subwords and characters. To validate that the character branch can learn local fine-grained features and enhance model performance when combined with the subword branch, we demonstrated the effectiveness of the character branch on the challenging WMT'17 Chinese-English (Zh-En) task. As shown in Table 3, with the addition of LPE, Sub-SCMT achieved a performance improvement of 0. compared with Sub-Transformer Big. Although the BLEU score of Char-Transformer Big is lower than that of Sub-Transformer Big, it confirms the effectiveness of the character branch. When the subword branch and character branch are combined, the translation performance of the model is significantly enhanced. This indicates that the dynamic interaction between character-level and subword-level feature information is key to improving machine translation quality.

TABLE 3. Ablation study to validate the effectiveness of the character branch in the SCMT model on the WMT'17 Chinese-English (Zh-En) task.
Model Param(M) BLEU
Sub-Transformer Big 260.1 24.4
Char-Transformer Big 257.1 23.6
Sub-SCMT 265.3 25.2
Char-SCMT 263.0 24.8
SCMT(Sub + Char) 292.2 26.2

The previous experiments have demonstrated the effectiveness of the character branch. To further investigate the impact of character width on model performance, we designed ablation experiments to explore the effects of character widths of 16, 32, 64, 128, and 256 on the WMT'14 English-German (En-De) translation task. Table 4 presents the ablation results of varying character widths on model performance for the WMT'14 En-De task. When the character width is 16, the character branch captures fine-grained features, increasing the BLEU score to 28.6. As the width expands from 16 to 256, the number of model parameters grows linearly, with the peak performance achieved at a width of 32—corresponding to 77.8 M parameters and a BLEU score of 29.2, which represents a 1.5-point improvement over the baseline model. Notably, when the width further increases to 64/128/256, despite the continued growth in parameter size, the BLEU scores show a declining trend, indicating that excessive model width may lead to over-parameterization. This experiment reveals that the SCMT architecture achieves optimal performance balance at a width of 32, effectively enhancing translation quality without significantly increasing model complexity.

TABLE 4. Ablation study on the impact of varying character widths on model performance on the WMT'14 English-German (En-De) task.
Model Width Param(M) BLEU
Transformer + LPE 70.5 27.7
SCMT 16 74.9 28.6
SCMT 32 77.8 29.2
SCMT 64 79.2 28.6
SCMT 128 81.3 28.5
SCMT 256 83.5 28.4

5.2 Comparisons With Other Methods

We designed comparative experiments to evaluate the performance of SCMT against other advanced methods on the WMT'14 English-German (En-De), WMT'16 English-Romanian (En-Ro), and WMT'17 Chinese-English (Zh-En) tasks. Table 5 shows the performance comparison of models on the WMT'16 En-Ro task. The UMST model achieved the highest BLEU score of 35.5 with 61.3 M parameters, while the proposed SCMT model obtained a BLEU score of 34.8 with 71 M parameters, which, although slightly lower than UMST, significantly outperformed models like TNT (34.1) and DELIGHT (34.7). This is because Romanian exhibits flexible SVO/SOV word order with clitic doubling, necessitating dynamic positional adaptation. While our LPE mechanism operates under the target-side gradual reordering assumption, it may struggle with handling discontinuous dependencies prevalent in the language's non-configurational syntax. In contrast, UMST employs a fixed-window local attention mechanism that might better capture such local dislocation patterns through structural inductive biases. In the WMT'14 En-De task (Table 6), SCMT achieved a BLEU score of 28.7 with 75 M parameters, showing a 0.6 gap compared with the 130.3 M-parameter Multi-Unit (29.3 BLEU) but demonstrating superior parameter efficiency with 55.3 M fewer parameters. For the WMT'17 Zh-En task (Table 7), SCMT achieved state-of-the-art performance with 25.9 BLEU using 290.1 M parameters, outperforming UMST and TranSFormer by 0.7 and 0.3 BLEU points respectively.

TABLE 5. Performance comparison results of various state-of-the-art methods on the WMT'16 English-Romanian (En-Ro) task.
Model Param(M) BLEU
TNT [35] 75.3 34.1
ODE Transformer [36] 70.1 35.2
DELIGHT [7] 55.0 34.7
UMST [21] 61.3 35.5
TranSFormer [37] 60.1 35.4
SCMT 71.0 34.8
TABLE 6. Performance comparison results of various previous methods on the WMT'14 English-German (En-De) task.
Model Param(M) BLEU
UMST [21] 70.0 28.5
Multi-Unit [25] 130.3 29.3
ConvTransformer [38] 65.1 23.5
TranSFormer [37] 66.3 28.6
SCMT 75.0 28.7
TABLE 7. Performance comparison results of various previous methods on WMT'17 Chinese-English (Zh-En).
Model Param(M) BLEU
Localness [20] 307.4 25.0
UMST [7] 307.2 25.2
TranSFormer [37] 283.1 25.6
SCMT 290.1 25.9

Compared with previous methods, SCMT exhibits significant advantages due to its LPE design, which dynamically captures word order information in both source and target languages, effectively addressing the limitations of traditional fixed positional encoding in modeling word order. Additionally, the character branch design enables fine-grained multi-scale modeling, particularly demonstrating enhanced feature extraction capabilities for linguistic units with complex structures like Chinese characters. These two technologies form a synergistic effect, significantly improving the model's adaptability to different language pairs without excessively increasing parameter count (SCMT's parameter scale is controlled within ±7% of comparable models). Especially in the Chinese-English translation task, the character branch's modeling advantage for subword information such as Chinese radicals and strokes is fully unleashed, enabling the model to achieve remarkable performance improvements with only a 2.5% increase in parameters.

Our method demonstrates particularly outstanding performance in morphologically rich languages. As shown in Table 8, when the inflected form “învaţă” (he learns) of the Romanian verb “a învăța” (to learn) is segmented by BPE into “în + vaţă,” the character branch collaboratively corrects tense misjudgments caused by subword segmentation through morphological feature analysis of “ţ” (t with cedilla) and “ă” (a with breve).

TABLE 8. Comparison of translation accuracy for Romanian verb inflections.
Inflected verb form Reference translation (EN) Baseline output SCMT output
a învăța (infinitive) to learn to learn to learn
Învață (3sg present) he learns learn (missing subject) he learns
Învățau (3pl imperfect) they were learning they learned (tense error) they were learning
Învățând (gerund) learning to learn (POS error) learning
a fi învățat (perfect) to have learned to learn (aspect error) to have learned

6 Conclusion

This article proposes an innovative multi-scale transformer architecture—subword-character multi-scale transformer with learnable positional encoding (SCMT)—which achieves significant performance breakthroughs in machine translation through deep integration of linguistic features at different granularities. The core innovation of SCMT lies in its synergistically operating dual-branch structure, constructing a unique bimodal feature fusion framework. The subword branch employs the standard transformer architecture to capture global semantic relationships between words, while the character branch utilizes fine-grained modeling techniques to precisely capture local morphological features at the character level. These two branches dynamically interact through a context-aware cross-attention mechanism, enabling the exchange of character-level and subword-level feature information, thereby modeling global relationships while capturing subtle features. This effectively addresses the issue of fine-grained information loss in traditional multi-scale transformers caused by the dominance of coarse-grained features and neglect of character-level features. Notably, the architecture introduces a LPE module, which adaptively captures the positional relationships of subword sequences in both source and target languages through end-to-end training. This enhances the model's ability to model positional information of source and target words, demonstrating significant advantages in long-range dependency modeling and fine-grained perception of morphologically complex languages. The synergistic effect between SCMT's cross-granularity feature extraction mechanism and dynamic positional modeling provides a new technical paradigm for processing morphologically complex languages. Experimental results show that SCMT achieves remarkable performance improvements in mainstream machine translation tasks such as WMT'14 English-German (En-De), WMT'17 Chinese-English (Zh-En), and WMT'16 English-Romanian (En-Ro).

Author Contributions

Wenjing Yao: writing – review and editing, writing – original draft, validation, data curation, resources. Wei Zhou: writing – original draft, writing – review and editing, software, formal analysis.

Ethics Statement

The authors have nothing to report.

Conflicts of Interest

The authors declare no conflicts of interest.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

    The full text of this article hosted at iucr.org is unavailable due to technical difficulties.