Volume 2018, Issue 1 9269157
Research Article
Open Access

Efficient Nonrecursive Bit-Parallel Karatsuba Multiplier for a Special Class of Trinomials

Yin Li

Corresponding Author

Yin Li

Department of Computer Science and Technology, Xinyang Normal University, Nanhu Road 237, Xinyang, Henan, China xytc.edu.cn

Search for more papers by this author
Yu Zhang

Yu Zhang

Department of Computer Science and Technology, Xinyang Normal University, Nanhu Road 237, Xinyang, Henan, China xytc.edu.cn

Search for more papers by this author
Xiaoli Guo

Xiaoli Guo

Department of Computer Science and Technology, Xinyang Normal University, Nanhu Road 237, Xinyang, Henan, China xytc.edu.cn

Search for more papers by this author
First published: 11 January 2018
Citations: 6
Academic Editor: Junqing Sun

Abstract

Recently, we present a novel Mastrovito form of nonrecursive Karatsuba multiplier for all trinomials. Specifically, we found that related Mastrovito matrix is very simple for equally spaced trinomial (EST) combined with classic Karatsuba algorithm (KA), which leads to a highly efficient Karatsuba multiplier. In this paper, we consider a new special class of irreducible trinomial, namely, xm + xm/3 + 1. Based on a three-term KA and shifted polynomial basis (SPB), a novel bit-parallel multiplier is derived with better space and time complexity. As a main contribution, the proposed multiplier costs about 2/3 circuit gates of the fastest multipliers, while its time delay matches our former result. To the best of our knowledge, this is the first time that the space complexity bound is reached without increasing the gate delay.

1. Introduction

Efficient hardware implementation of the finite field arithmetic, especially for GF(2m), is frequently desired in coding theory and public-key cryptosystems [1, 2]. Among these arithmetic operations in GF(2m), multiplication is of the most importance, as other complicated field operations such as exponentiation and inversion can be carried out by iterative multiplications. Thus, it is necessary to design efficient multiplier.

The field elements are usually represented by a certain basis such as polynomial basis (PB), normal basis (NB), and dual basis (DB). In PB representation, the multiplication consists of multiplying two polynomials and reducing the result modulo an irreducible polynomial. The choice of such an irreducible polynomial is critical to perform the reduction operation efficiently. Irreducible trinomial is one of the most common considerations [3, 4]. During recent years, many bit-parallel multipliers using PB representation are proposed for GF(2m) defined by irreducible trinomials, some of which can be found in [3, 58]. The efficiency of the architecture is always evaluated by space and time complexity. The former one is expressed in terms of the number of logic gates (XOR and AND) and the latter one is expressed in terms of the sum of XOR and AND gates delay of the critical path. Among these multipliers, the fastest bit-parallel multipliers nowadays are proposed by Fan and Hasan [9] and Hariri and Reyhani-Masoleh [10]. If GF(2m) is defined by f(x) = xm + xk + 1,   1 < km/2, the corresponding multiplier requires m2 AND and m2 − 1 XOR gates with time delay TA + ⌈log2(2mk)⌉TX (for good fields, the time delay is TA + ⌈log2mTX), where TA and TX are the circuit delay of one AND gate and one XOR gate, respectively. Except for these multipliers for general trinomials, there are also several proposals for special types of irreducible trinomials [1113]. These multipliers usually utilize the special form of the trinomial to obtain efficient implementation.

The Karatsuba algorithm (KA) works recursively by breaking down one big multiplication into two or more submultiplications. It is a typical divide-and-conquer algorithm. Please note that the classic KA starts with a way to multiply two 2-term polynomials using three scalar multiplications. Some other variations are also investigated. More details can be found in [1416]. The KA can be adopted to design subquadratic complexity multiplier [14, 17] or hybrid multiplier [18, 19]. Specially, there is another type of hybrid multiplier, namely, nonrecursive Karatsuba multiplier, which only applies KA once in the polynomial multiplication [8, 20]. These multipliers regularly require 3/4 circuits gates compared to the fastest bit-parallel multipliers, while its time delay increased by a small number of TX. For example, Elia et al. [8] costs at least two more TX.

Recently, we proposed a novel nonrecursive Karatsuba multiplier that is based on Mastrovito approach [21]. It is shown that our multiplier only requires one more TX compared with the fastest multipliers [9, 10]. However, it costs a few more logic gates than Elia′s result. Except for the nonrecursive Karatsuba multiplier for general trinomials, Shen and Jin [13] proposed a new Karatsuba multiplier that fully exploited equally spaced trinomial and the classic KA to simplify the modular reduction. Consequently, the space complexity of their scheme matches Elia′s result. Meanwhile, the time complexity is TA + (1 + ⌈log2(m − 1)⌉)TX, which is roughly equal to the fastest results. Furthermore, we observe that the special case m = 2k of our multiplier coincides with their scheme. (Here, the trinomial xm + xk + 1  (m = 2k) is an equally spaced trinomial.)

In this paper, we explore another special case of our former scheme to obtain even more efficient nonrecursive Karatsuba multipliers. Our main idea is analogous to Shen and Jin [13], where a special type of trinomials and a KA variation are utilized to simplify the structure of corresponding Mastrovito matrix. More explicitly, we consider the irreducible trinomial xm + xm/3 + 1 and a three-term Karatsuba algorithm. It is demonstrated that the corresponding Mastrovito matrix can be simplified further under this condition. The shifted polynomial basis (SPB) [4] is also utilized to reduce the critical path delay further. Consequently, we proposed a bit-parallel multiplier that costs approximately 2/3 circuit gates of the fastest bit-parallel multipliers. On the other hand, the time complexity is TA + ⌈log2(8m/3)⌉TX, which almost matches the best known results.

The rest of this paper is organized as follows: In Section 2, we briefly review the Mastrovito approach based on SPB representation and some relevant notions. Then we introduce a three-term KA formula and investigate the structure of related Mastrovito matrix. A new bit-parallel multiplier architecture is then proposed in Section 3. Section 4 presents a comparison between the proposed multiplier and some others. Finally, some conclusions are drawn.

2. Preliminary

In this section, we briefly review some related notations and algorithms used throughout this paper. Consider the finite field GF(2m) generated with an irreducible trinomial xm + xm/3 + 1. Let x be a root of xm + xm/3 + 1 and the set M = {xm−1, xm−2, …, x, 1} constitute a polynomial basis (PB). Therefore, every element of GF(2m) can be represented as a polynomial over of degree less than m. The shifted polynomial basis (SPB) is a variation of the polynomial basis, which is obtained by multiplying the set M by certain exponentiation of x.

Definition 1 (see [4].)Let v be an integer and the ordered set M = {xm−1, …, x, 1} be a polynomial basis of GF(2m) over . The ordered set xvM≔{xiv∣0 ≤ im − 1} is called the shifted polynomial basis with respect to M.

Generally speaking, the optimal choice of v for irreducible trinomial is equal to the middle term degree or it minus one [4]. In this case, we have v = m/3 and use this denotation thereafter. It follows that the field element AGF(2m) can be expressed with respect to SPB as follows:
(1)
Given two elements of GF(2m) under SPB representation, that is, , , the field multiplication can be performed as
(2)
Obviously, the product D = AB is thus equal to
(3)
Analogous to ordinary polynomial multiplication, this product can be computed by a matrix-vector multiplication d = A · b, where b, d express the coefficient vectors of B(x) and D(x), and the matrix A is given by
(4)
The difference between the above matrix and the usual PB case [3] is simply the labels of the lines in left side, which indicate the exponent of indeterminate x for each line.
We then reduce the above matrix in view to obtain the field product expressed in SPB representation. The reduced matrix, denoted by M, is called Mastrovito matrix. Thus, the SPB field multiplication is rewritten as
(5)
where c denotes the coefficient vector of C(x). The structure of M relies on A and the modular reduction rule. In this case, we should obey the following reduction rule:
(6)
However, if we directly reduce the product matrix presented in (4) using the above formulae and perform matrix-vector multiplication, there is no difference between this computation and the general case. In the following section, we will construct a new Mastrovito matrix using a three-term Karatsuba algorithm and describe a highly efficient bit-parallel multiplier.

Moreover, one can check that the irreducible trinomial in the form of xm + xm/3 + 1 exists when m = 3 × 7i where i is a nonnegative integer [1]. Although the number of this type of irreducible trinomials is not that abundant, there still exist some trinomials in the range of interest for practical application.

In the end, we also introduce some notations pertaining to matrices and vectors, which are already proposed in [21, 23] and extensively used throughout this paper.
  • (i)

    Z(i, :) represents the ith row vector in matrix Z;

  • (ii)

    Z(:, j) represents the jth column vector in matrix Z;

  • (iii)

    Z(i, j) represents the entry with position (i, j) in matrix Z.

3. Mastrovito Multiplier Using a Three-Term Karatsuba Algorithm

The Karatsuba algorithm [2] has been applied to improve the efficiency of bit-parallel multiplier for GF(2m) generated by an AOP [20] and a trinomial [8, 13, 21]. It starts with a way to multiply two two-term polynomials using three scalar multiplications which can reduce the space complexity of the multipliers by approximately a factor of 3/4. Besides the classic algorithm, there exist several generalizations with respect to the Karatsuba algorithm [1416]. Here, we are only focus on a simple Karatsuba algorithm variation, three-term Karatsuba algorithm, which multiplies two three-term polynomials using six scalar multiplications. Given two three-term polynomials in , one can check that
(7)

In general, the Mastrovito multiplication utilizing the KA will increase the time complexity. Our former result shows that a Mastrovito multiplier using classic KA costs one more TX than the fastest ones. However, some literature sources [13] indicated that this result would be further improved for some special cases, for example, the EST xm + xm/2 + 1. In the following, we will show that for the trinomial xm + xm/3 + 1, applying the three-term Karatsuba-like formula will also simplify the reduction operation and lead to fast implementation.

Let f(x) = xm + xm/3 + 1 be an irreducible trinomial and , be two field elements in SPB representation. We partition A, B into three parts, with each part consisting of m/3 bits. In order to simplify related expressions, we denote m/3 as k. Then,
(8)
where , , for i = 0,1, 2. Then we multiply A and B using the three-term Karatsuba-like formula and do the following transformation:
(9)
where C2 = A2 + A1, C1 = A2 + A0, C0 = A1 + A0, D2 = B2 + B1, D1 = B2 + B0, D0 = B1 + B0. We divide (9) into two parts,
(10)
and compute each part modulo f(x) independently.

3.1. Computation of S1modf(x)

We first consider the computation of S1 in detail. Note that S1 actually consists of three different parts: A0B0, A1B1, A2B2 (others can be obtained by shift of these parts). When S1 is rewritten as a matrix-vector form, we have
(11)
For simplicity, we do not write the labels of the product matrix here, which indicate the degree of xi in S1. Note that these degrees are in the range [−2k, 2m − 2k − 2]. In the above expression, b0, b1, b2 represent the coefficient vectors of B0, B1, B2, respectively. 0k×k is a k × k zero matrix, Ai,L (i = 0,1, 2) are k × k lower-triangular Toeplitz matrices, and Ai,H (i = 0,1, 2) are k × k upper-triangular Toeplitz matrices. Please note that the matrix on the right side actually contains 6k = 6 · m/3 = 2m rows and the product matrix in fact contains 2m − 1 rows. However, the last row of the above matrix is 0, which does not affect the result. These submatrices have the following form:
(12)
for i = 0,1, 2. It is easy to check that the products S1 contain the terms of degrees out of the range [−k, mk − 1]; we have to perform the reduction operation for the product matrix in (21). According to Mastrovito scheme, the reduction can be regarded as the construction of product matrices from A using the reduction rule in (6). Denoted by MA, the Mastrovito matrix is related to S1. Then, we investigate the construction details for this matrix MA. We have the following proposition.

Proposition 2. The Mastrovito matrix MA can be constructed as

(13)
where
(14)

Proof. The proof is analogous with the proof of observation 3.1 in [21]. Note that the product matrix A contains 2m − 1 nonzero rows (the last row A(2m, :) is a zero vector), each of which corresponds to the polynomial degree from −2k to 2m − 2k − 2. It is easy to check that the first k rows and the last mk − 1 rows correspond to the degrees that are out of the range [−k, mk − 1]. Thus, we need to reduce these rows.

According to the reduction rule in (6), we have to reduce {−2k, −2k + 1, …, −k − 1} by adding them to the row {−k, …, −1} and {m − 2k, …, mk − 1} and reduce the rows {mk, …, 2m − 2k − 2} by adding them to the row {0, …, mk − 2} and {−k, …, m − 2k − 2}. Obviously, the first k row here is [A0,L, 0k×k, 0k×k] and the last mk − 1 rows constitute

(15)
We compare the line number and obtain the result immediately.

Based on Proposition 2, we can compute S1 as follows:
(16)
By swapping and combining some overlapped entries, expression (16) now can be rewritten as
(17)
We just compute two submatrix-vector multiplications and add them up to obtain S1. Some tricks can apply to save more logic gates. We mainly utilized the computation strategy presented in [7] and fully considered the overlapped parts of the two above matrices. The computation can be divided into two steps:
  • (i)

    Perform row-vector products:

    (18)

  • in parallel. The symbol “∗” represents only row-vector product related to Ai,L (or Ai,H) and bi, i = 0,1, 2. For example, A0,Hb0 represents computing the inner product [A0,H(i, 1) · b0, …, A0,H(i, k) · bk−1], for i = 1,2, …, k in parallel.

  • (ii)

    Sum up all the 2m entries of each row using binary XOR tree. Specially, consider some products of each row are zero; we compute the following summations:

    (19)

  • using binary XOR tree firstly and then add these results together.

Remarks 3. It is easy to see that the row-vector products (18) contain all the possible row-vector products in (17). In addition, A0,H, A1,L, A1,H, and A2,L are all triangular matrices; one can easily check that each row of both [A0,H, A1,L] and [A1,H, A2,L] consists of at most k nonzero entries. After the computation of (18) and (19), certain number of XOR gates is required to obtain the final result. Table 1 summarizes the space and time complexity of S1 for all the steps.

Table 1. Space and time complexities of S1modf(x).
Operation #AND #XOR Time delay
Inner products in (18) 3k2 - TA
Partial addition in (19) -  3k2 − 4k + 1 ⌈log2kTX
S1modf(x) - 4k − 1   2TX

3.2. Computation of S2modf(x)

Then we consider the computation of S2modf(x) in detail. Since S2 = (C2D2xk + C1D1 + C0D0xk) and Ci, Di(i = 0,1, 2) consist of k bits, we can follow similar line as the computation of S1 to obtain the result. More explicitly, we rewrite S2 in matrix-vector form:
(20)
Here, Ci,L (i = 0,1, 2) are k × k lower-triangular Toeplitz matrices and Ci,H (i = 0,1, 2) are k × k upper-triangular Toeplitz matrices, which are constructed from the coefficients of C0, C1, C2 and are similar to Ai,L and Ai,H. Vectors d0, d1, d2 represent the coefficient vectors of D0, D1, D2.
The reduction of S2 modulo f(x) is relatively simpler: we only need to eliminate the last k rows by adding them to the lines labeled with {−k, …, −2} and {0, …, k − 1}. Thus, we have
(21)
Analogous with the computation of S1modf(x), we first perform row-vector products:
(22)
in parallel. Then, we compute the following summations:
(23)
using binary XOR tree firstly, and then add related results together. Please note that each row of [C0,Hd0, C1,Ld1] and [C1,Hd1, C2,Ld2] consists of at most k nonzero entries. We can calculate C0,H · d0 + C1,L · d1 and C1,H · d1 + C2,L · d2 in ⌈logkTX. Finally, we have to add all these summations to obtain the result. It costs 2k − 2 more XOR gates with one TX delay. Related space and time complexities for the computation of S2modf(x) are summarized in Table 2.
Table 2. Space and time complexities of S2modf(x).
Operation #AND #XOR Time delay
C0, C1, C2 - 3k TX
D0, D1, D2 - 3k
Inner products in (22) 3k2 - TA
Partial addition in (23) -  3k2 − 4k + 1 ⌈log2kTX
S2modf(x) - 2k − 2 TX
From Tables 1 and 2, it is clear that the computations of S1, S2 modulo f(x) have the same time delay. So they can be implemented in parallel. Finally, another m XOR gates are needed to add the two results together, which also requires one TX delay. As a consequence, the total space and time complexity of proposed architecture are
(24)
Furthermore, if m = 2n + c where c is smaller relatively to 2n−1, we have ⌈log2(8m/3)⌉ = 1 + ⌈log2m⌉. In this case, the time delay of our architecture becomes TA + (1 + ⌈log2mTX), which is almost equal to the delay of the fastest bit-parallel multipliers [9].

4. Theoretic Comparison

Table 3 gives a comparison of different implementation methods of bit-parallel multipliers in the fields generated by trinomials xm + xm/3 + 1. From Table 3, we can see that our multiplier requires about 2/3 circuit gates compared with the previous architectures without using divide-and-conquer algorithm. On the other hand, the time complexity of the proposed multiplier is TA + ⌈log2(8m/3)⌉TX, which is very close to the fastest result. In fact, we have checked this type of trinomials with degree m = 3 · 7i, i = 1,2, …, 1000, and found that there are 585 such trinomials reaching the bound TA + (1 + ⌈log2mTX) (others require only one more TX).

Table 3. Comparison of bit-parallel multipliers for GF(2m) generated with xm + xm/3 + 1.
Multiplier #AND #XOR Time delay
Sunar and Koç [3] m2 m2 − 1 TA + (2 + ⌈log2m⌉)TX
Wu [5] m2 m2 − 1 TA + (2 + ⌈log2m⌉)TX
Wu [6] m2 m2 − 1 TA + (2 + ⌈log2m⌉)TX
Fan and Dai [4] m2 m2 − 1
Elia et al. [8] TA + (3 + ⌈log2m⌉)TX
Negre [7] m2
Fan [22] Type-A
Fan [22] Type-B
TA + ⌈log2(2m − 1)⌉TX
Li et al. [21]
This paper
  • Note. 2v−1 < m/3 ≤ 2v and W(∗) is the hamming weight of the number.

In Table 4, we give a small example of field GF(2147) defined by x147 + x49 + 1. It shows that, compared with other approaches, our architecture may be the best choice if the space and time complexity are both considered. In addition, compared with the fastest Karatsuba multiplier for general trinomials [21], it is argued that the space and time complexities can be reduced even further if special KA and irreducible polynomial are combined together.

Table 4. Complexity for practical field GF(2147).
Basis #AND #XOR Time
PB [3, 5] 21609 21608 TA + 10TX
PB [8] 16280 16838 TA + 11TX
SPB [4] 21609 21608 TA + 9TX
SPB [9] 21609 21608 TA + 8TX
SPB [7] 21609 27391 TA + 8TX
PB-CRT Type A [22] 21560 21560 TA + 9TX
PB-CRT Type B [22] 21560 21658 TA + 9TX
SPB [21] 16280 17394 TA + 9TX
SPB (this paper) 14406 14748 TA + 9TX

5. Conclusion

In this paper, a new Mastrovito multiplier architecture for trinomial of the form xm + xm/3 + 1 is proposed. We show that the space and time complexity of our former Mastrovito-Karatsuba multiplier can be further reduced for special form of trinomial combined with a KA variation. This multiplier can be used in some area-critical occasions because it has low space complexity but maintains a relatively low time delay. To find more polynomials which can use the proposed strategy will be the future work.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by the Natural Science Foundation of China (nos. 61402393, 61601396) and Shanghai Key Laboratory of Integrated Administration Technologies for Information Security (no. AGK201607).

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.