This work presents a flexible VLSI architecture to compute the N-point DCT. Since HEVC supports different block sizes for the computation of the DCT, that is, 4 × 4 up to 32 × 32, the design of a flexible architecture to support them helps reducing the area overhead of hardware implementations. The hardware proposed in this work is partially folded to save area and to get speed for large video sequences sizes. The proposed architecture relies on the decomposition of the DCT matrices into sparse submatrices in order to reduce the multiplications. Finally, multiplications are completely eliminated using the lifting scheme. The proposed architecture sustains real-time processing of 1080P HD video codec running at 150 MHz.

1. Introduction

As the technology is evolving day by day, the size of hardware is shrinking with an increase of the storage capacity. High-end video applications have become very demanding in our daily life activities, for example, watching movies, video conferencing, creating and saving videos using high definition video cameras, and so forth. A single device can support all the multimedia applications which seemed to be dreaming before, for example, new high-end mobile phones and smart phones. As a consequence, new highly efficient video coders are of paramount importance. However, high efficiency comes at the expense of computational complexity. As pointed out in [1, 2], several blocks of video codecs, including the transform stage [3], motion estimation and entropy coding [4], are responsible for this high complexity. As an example the discrete-cosine-transform (DCT), that is used in several standards for image and video compression, is a computation intensive operation. In particular, it requires a large number of additions and multiplications for direct implementation.

HEVC, the brand new and yet-to-release video coding standard, addresses high efficient video coding. One of the tools employed to improve coding efficiency is the DCT with different transform sizes. As an example, the 16-point DCT of HEVC is shown in [5]. In video compression, the DCT is widely used because it compacts the image energy at the low frequencies, making easy to discard the high frequency components. To meet the requirement of real-time processing, hardware implementations of 2-D DCT/inverse DCT (IDCT) are adopted, for example, [6]. The 2-D DCT/IDCT can be implemented with the 1-D DCT/IDCT and a transpose memory in a row-column decomposition manner. In the direct implementation of DCT, float-point multiplications have to be tackled, which cause precision problems in hardware. Hence, we propose a Walsh-Hadamard transform-based DCT implementation [7]. Then, inspired by the DCT factorizations proposed in [8, 9], we factorize the remaining rotations into simpler steps through the lifting scheme [10]. The resulting lifting scheme-based architecture, inspired by [11–13], is simplified, exploiting the techniques proposed in [9, 14] to achieve a multiplierless implementation. Other techniques can be employed to achieve multiplierless solutions, such as the ones proposed in [8, 15–18], but they are not discussed in this work. In this work, the proposed multisize DCT architecture supports all the block sizes of HEVC and is proposed for the real-time processing of 1080P HD video sequences.

The rest of this paper is organized as follows. Section 2 provides reviews of 2-D DCT transform. Section 3 shows the matrix decompositions for different DCT sizes. Section 4 presents the proposed hardware architecture. The VLSI implementation and the simulation results in Section 5. Finally, Section 6 concludes this paper.

2. Review of 2D Transform

According to [19], the DCT in the so-called II-form for an N-point block of samples x = {x₀, x₁, …, x_N−1} is obtained as

()

where

()

and k = 0, 1, 2, …, N − 1. The same result expressed in (1) can be rewritten as the product of a matrix

for x as follws:

()

where (·)^t is the transposition operator. The DCT matrix can be expressed in terms of a reduced number of angles by exploiting the symmetry properties of the trigonometric functions. For 16 points, DCT matrix can be represented as

()

where

is shown in (20), with

()

where 4 ≤ 2^q ≤ 2N and p is an odd integer such that p < 2^q−2.

In [20], it is shown that every even-odd transform can be represented in terms of any other evenodd transform through a conversion matrix. In particular, in [7] it is shown that the DCT can be expressed in terms of the Walsh-Hadamard transform (WHT) [21], and the conversion matrix has a block diagonal structure. In [22], it is shown that WHT can be implemented with a fast algorithm based on a butterfly structure. Among the possible WHT matrix representations, the Walsh-ordered one is applied by deriving it from the corresponding Hadamard-ordered matrix. The N-order Hadamard matrix can be expressed as

()

where H₁ = 1. The corresponding Walsh matrix W_N is obtained by applying a two step procedure [23] as follows:

(1)
bit reverse the order of the rows of H_N,
(2)
gray coding to the row indices.

Therefore, (4) can be written as

()

where B_N is the N-point bit reversal matrix, and

()

is a block diagonal matrix with a recursive structure, T₂ = I₂ is the 2 × 2 identity matrix and

()

It is worth noting that U_N/2 can be factorized in terms of Givens rotations, where the Givens rotation matrix for a rotation angle θ is

()

In particular U_N/2 can be decomposed in the product of permutation matrices P_N/2 and Givens rotation matrices V_N/2, q with 3 ≤ q ≤ m and m = log ₂(N) + 1 as

()

The permutation matrix P_N/2 is obtained by applying the following permutation:

()

to the rows or to the columns of I_N/2, the N/2 × N/2 identity matrix. It is worth observing that ϕ_N/2(x) can be defined recursively as

()

where ϕ₂(0) = 0 and ϕ₂(1) = 1.

The Givens rotations matrices can be described as follows: V_N/2,m contains N/4 Givens rotations disposed in N/4 concentric squares with the rotation angle increasing from the outer square to the inner one. The definition of each c_p,q is given in (5). In the outermost circle p = 1, whereas the value of p increases from the outer square to the inner one. Since p is the multiplied value in the numerator of (5), the angle of the Givens rotations increases from outer to inner squares. The general structure of V_N/2, m is shown in (15). In the next sections, the expression shown in (15) will be detailed for different values of N

()

The remaining matrices V_N/2,q, for q ≥ 3, are

()

where r = m − q + 1 and V_2, 3 = U₂. Finally, each Givens rotation can be factorized into lifting steps by the means of the lifting scheme [10] as suggested in [9] as follows:

()

3. Matrix Decompositions for Different N

Matrices for different DCT are derived using the factorization presented in Section 2. In the following paragraphs factorizations for N ranging from 4 to 32 are explicitly shown.

3.1. 4 × 4 DCT

Equation (6) can be given as

()

Equation (7) can be given as

()

where

()

where U₂ is shown in (9) and I₂ is a 2 × 2 identity matrix. W₄ is a Walsh-Hadamard matrix which is calculated using(17)

()

3.2. 8 × 8 DCT

Equation (6) can be given as

()

Equation (7) can be given as

()

where

()

where U₂ is shown in (9) and I₂ is a 2 × 2 identity matrix. W₈ is a Walsh-Hadamard matrix which is calculated using (21). According to (11), m = log ₂(8) + 1 = 4 and 3 ≤ q ≤ 4. So U₄ can be written as

()

where, according to (14)

()

and, according to (15), r = m − q + 1 = 4 − 3 + 1 = 2 and

, so

()

Using (12) and (13) the permutation is computed as

()

Finally the permutation matrix P₄ is obtained by applying the permutation, shown in (27), to the columns of 4 × 4 identity matrix as,

()

3.3. 16 × 16 DCT

Equation (6) can be given as

()

Equation (7) can be given as

()

where

()

where U₂ and U₄ are shown in (9) and (24). To calculate U₈, according to (11), m = log ₂(16) + 1 = 5 and 3 ≤ q ≤ 5. So U₈ can be written as

()

where, according to (14),

()

Similarly, to calculate V_8, 4, according to (15), r = m − q + 1 = 5 − 4 + 1 = 2 and

, so

()

where V_4, 3 is calculated in (25). For V_8, 3, r = m − q + 1 = 5 − 3 + 1 = 3 and

, so

()

Using (12) and (13) the permutation is computed as

()

Finally the permutation matrix P₈ is obtained by applying the permutation, shown in (37), to the columns of 8 × 8 identity matrix as

()

3.4. 32 × 32 DCT

Equation (6) can be given as

()

Equation (7) can be given as

()

where

()

where U₂, U₄ and U₈ are shown in (9), (24), and (32). To calculate U₁₆, according to (11), m = log ₂(32) + 1 = 6 and 3 ≤ q ≤ 56. So U₁₆ can be written as

()

where, according to (14),

()

Similarly, to calculate V_16, 5, according to (15), r = m − q + 1 = 6 − 5 + 1 = 2 and

, so

()

where V_8, 3 is calculated in (33). For V_16, 4, r = m − q + 1 = 6 − 4 + 1 = 3 and

, so

()

where V_4, 4 is calculated in (25). For V_16, 3, r = m − q + 1 = 6 − 3 + 1 = 4 and

, so

()

Using (12) and (13) the permutation is computed as

()

finally the permutation matrix P₁₆ is obtained by applying the permutation, shown in (47), to the columns of 16 × 16 identity matrix as

()

4. Proposed Architecture

The complete hardware architecture of the DCT is shown in Figure 1. Each frame is loaded in the input frame memory. The complete frame is divided into N × N blocks. The control unit reads the rows of each block from the input memory. At the same time, the control unit passes a “1” to the input multiplexers. Also the address and other control signals are passed to the DCT block. After complete calculation of the DCT, the transformed row is input to the transpose memory, along with its corresponding address. In this way, for the first N clock cycles, the rows from the input memory are input to the DCT and are written on the corresponding addresses in the transpose memory. After the N clock cycles, the control unit passes a “0” for the input multiplexers for the next N clock cycles. So in this way, each column from the transpose memory is input to DCT block, and the outputs of the DCT block are written back in the transpose memory, on the same location from where they are read. When all the columns are read and processed by the DCT, the control unit again starts reading the next N × N block from the input memory and at the same time, each row from the transpose memory is written to the output transformed memory. In this way all the N × N blocks are read, processed, and written in the output transformed memory.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Top level hardware architecture of DCT.

When the last row is processed through the DCT, it is written to the transpose memory. At the same time, the first column from the transpose memory is read in order to be processed through DCT block. As the last row was not written, so the last data of the first column is not valid. So “Data0” multiplexer is used for forwarding. In this way, the first output of last transformed row of a N × N block is forwarded to the input to DCT and also written to the transpose memory.

4.1. DCT Block

DCT block is the main block of the complete architecture. The DCT block takes the input data, the corresponding control signals, and the corresponding addresses. The internal architecture of the DCT block is shown in Figure 2.

DCT block has 4 pipeline stages. The data is passed through the Hadamard block. The Hadamard block is designed with a fully parallel architecture. The Hadamard block takes 32 data at its inputs and passes to the butterfly_32, while the first 16 are input to the butterfly_16, the first 8 are input to the butterfly_8, and the first 4 are input to the butterfly_4 as well. Multiplexers are placed at the inputs of each different size butterfly in order to have correct result from Hadamard block. The select signals for the multiplexers are controlled by the control unit. The Hadamard block has 32 outputs. To have a Walsh transform from a Hadamard one, the bit_reversal and gray_code blocks are placed after the Hadamard block.

In the bit_reversal block, the data at input port number X is moved to output port number Y, where the output port number Y is determined by representation the input port number X and reversing the bits. For example, in case of DCT_16, the Hadamard block will produce 16 valid outputs. So the 16 inputs to the bit_reversal block are shuffled according to bit_reversal rule, for example, X = 0 means X = “0000" and the bit_reversal is also Y = “0000". So the first input port will be connected to the first output port. Similarly, for X = “0001", the bit reversal will Y = “1000" which means that the second input port is connected to the eight output port. In this way all the inputs are connected to the outputs according to bit reversal rule. As the architecture supports four different sizes of DCT, it means that the bit reversal rule will be different for each DCT size. For example, for DCT_4, X = “01" will be connected to Y = “10", that is, 2nd output port, while in case of DCT_16, it will be connected to the 8th output port. So multiplexers are placed in order to support all the DCT sizes in the bit reversal block.

Gray code block works in the same principle as bit reversal block, but according to gray code law. In gray code block, the output port is determined by applying gray code on the input addresses. For example, for DCT_32 if X = “01101", Y = “01011". So the input port number 13 is connected with the output port number 11. Gray code calculation does not depend on the DCT size. For example for DCT_16, if X = “1101", Y = “1011", which means that input port 13 is connected to the output port number 11, which is same as that for DCT_32.

The architecture of mem.block is shown in Figure 3. The memory block connects the first 16 inputs directly to the output ports, while the last 16 outputs are multiplexed with the latched inputs and the direct inputs. The last 16 outputs are used in case of DCT_32, while the last 16 last inputs are bypassed in case of DCT_4, DCT_8, and DCT_16.

The permutation block is implemented using (27), (36), and (46). The block takes 32 inputs, and it sends 16 of the inputs to the outputs according to the permutation law. The first 16 inputs to the permutation block are passed to the outputs in the first clock cycle, while in the second clock cycle the last 16 inputs of the permutation network are passed to the outputs. In case of DCT < 32, the selection line of the multiplexers is always set to “0”, while in case of DCT = 32, the selection line remains “0” in first clock cycle, while remains “1” in the next clock cycle.

The lifting scheme is implemented using (19), (23), (31), and (40). The lifting block is implemented with a folded architecture. Where the fully parallel lifting block is used for DCT sizes of 4, 8, and 16, while DCT_32, the block is reused. Each row of DCT_32 takes 2 clock cycles for completion. During the first clock cycle, the upper 16 inputs are processed by the lifting scheme and are stored in the memory block. In the next clock cycle, the lower 16 inputs are processed by the lifting block and the result along with the previously calculated stored values is forwarded to the next block. The lifting block is shown in Figure 4.

Lifting scheme is designed for 15 Givens rotations. The basic lifting structure, shown in Figure 5, is implemented using (16), where

()

as suggested in [9].

The lifting structures takes two values at the inputs. For each Givens rotation, the a, b, m, and n are approximated integer values, in order to have the approximated DCT equal to the actual DCT. As a and b are integers, the multiplications are implemented using adders and shift operations. The result of each lifting structure is quantized to 16-bits resolution to have a reasonable PSNR value. So the final outputs of the lifting block are 16-bit wide. The results of some lifting structures are bypassed using the multiplexers at their outputs. In fact, the select line for the multiplexers will always remain “0” for DCT_4, DCT_8, and DCT_16, while for DCT_32, the select line remains “0” for first clock cycle and “1” the for the next clock cycle.

In case of DCT_32, the Hadamard block produces 32 results in parallel. The outputs of the Hadamard block are fed into the bit reversal block, gray code block, and in the following bit reversal block, and the output

of the bit reversal block is fed into the memory block. The memory block passes the upper 16 inputs directly to the permutation block, while the lower 16 inputs are stored in the registers. The permutation block forwards the upper 16 inputs to the lifting scheme, where the select lines for the multiplexers in the lifting scheme are set to “0”, and the lifting scheme outputs the results and the results are stored in the following memory block. In the next clock cycle the lower 16 values stored in the first memory block are passed to the lifting scheme, through the permutation block. The selection line for the multiplexers in the lifting scheme are set to “1”. The 16 results are calculated and are passed to the memory block. At the same time, the memory block forwards the previously stored values along with the new arrived ones.

In case of DCT_4, DCT_8, and DCT_16, the lower 16 values from the first memory blocks are invalid and never used. So the valid upper 16 inputs are fed into the lifting scheme via permutation network. The selection line for the lifting scheme multiplexers is always set to “0” in case of DCT < 32.

DCT < 32 takes one clock cycle to calculate one row or one column, while DCT = 32 takes 2 clock cycles. The outputs of the second memory block are passed to the third bit reversal block, passing through a fully parallel permutation network. The outputs of the bit reversal are then divided by square root of N, where N is the DCT size. The square roots are calculated as

()

The hardware architecture of the one square root block is shown in Figure 6. The input is divided by and the calculated values are fed into the output multiplexer, where the valid result is sent to output depending on the DCT size. Finally, the outputs from the square root block are quantized from 16 bits to 13 bits using the Q block.

4.2. Transpose Buffer

Transpose buffer is designed using registers. The buffer is designed to support maximum DCT size, that is, DCT_32. So the buffer is of size N × N × B, where N = 32 and B = 13, where B is the width of each data. So a total of 13 kbits memory is utilized to implement transpose buffer. The inputs of the buffer are the clock, reset, transpose signal, the row number, the column number, read enable signal, and the write enable signal. During the direct cycle, all the rows from the input frame memory are transformed through DCT block, and the results are stored on the corresponding rows in the transpose buffer. When all the rows of a input frame memory are transformed and written to the transform buffer, the columns of the transform buffer are read and the columns are transformed via DCT block, and the results are again written back to the transform buffer in transpose way, that is, on each column. When all the columns of the transform buffer are read, transformed, and written back to the buffer, the rows of the input frame memory are read row wise, and at the same time the rows of the transform buffer are written to the output frame memory. In this way, the complete frame is transformed and written to the output memory.

4.3. Input MUXes Block

The DCT transforms the rows of the block and the results are stored in the buffer. So the select signal for the input MUXes block is set to “1”. After N clock cycles, where N is the size of DCT, the select signal of the input MUXes is set to “0”, so that the columns from the transform buffer is fed into the DCT block. So, input MUXes block switches the inputs for the DCT block for the direct or transformed cycles.

4.4. Data0 Multiplexer

During the direct cycles, the rows from the input memory are transformed and the results are written in the transform buffer. When the last row from the input memory is transformed, first column is read in the next clock cycle from the transform memory. At this point, the last data of the first column is not the valid one, as the last transformed row has not yet been written to the memory. So data0 multiplexer is used for forwarding, where the first data out of the 32 transformed dates is selected from the data0 memory. The select line of the data0 multiplexer is set to “1” for just one clock cycle during transformation of N × N block, that is, when the first column is read from transform buffer and the last row is transformed via DCT block, otherwise the select signal is always set to “0”.

4.5. Control Unit

Control Unit controls the activities of all the blocks in each clock cycle. This unit is responsible for a correct sequence of operations. Control unit is designed using 4 memories, where each memory contains the control signals for each DCT size. There are 4 counters in the unit, where each counter produces the addresses for its corresponding memories. In response to the addresses, the memories output the control signals. The outputs of the memories are multiplexed, where the selection line of the multiplexer decides which input to go out. The hardware architecture of the control unit is shown in Figure 7. MEM_CU_4 is a 8 × 128 = 1 kbits size, MEM_CU_8 is a 16 × 128 = 2 kbits size, and MEM_CU_16 is a 32 × 128 = 16 kbits bits size while MEM_CU_32 is a 128 × 128 = 16 kbits size. The memories contains N control signals for direct cycle and N clock cycles for the transpose cycle, where N is the DCT size. But MEM_CU_32 contains 2(N + N) control signals, because each row or column takes two clock cycles for completion in case of DCT_32. So the control unit generates 128-bits wide control signals, for all the functioning blocks of the complete DCT in each clock cycle.

5. Results

The computation of N-point DCT by the means of the WHT factorization requires the following.

(1)
(N/2) · log ₂(N) 2-inputs/2-output butterfly for the N-point WHT.
(2)
1 + (N/2) · [log ₂(N) − 2] 2-input/2-output lifting-based (3-lifting-step) structures for the Givens rotations.

The WHT is implemented with fully parallel architectures using maximum number of resources, while the lifting scheme is implemented with a folded architecture. Hence, 80 2-input/2-output butterflies are used to implement WHT for N = 32. The number of adders required to implement the 80 2-input/2-output butterflies is 160. The total number of 2-input/2-output lifting structures to implement the 15 Givens rotation is 49, but with folded architecture we have reused the data path and reduced the number of lifting structures to 32.

The factorization of the matrices is applied to H.265 DCT. The lifting coefficients are approximated with the following condition:

()

where Q is the N-point DCT obtained from MATLAB function dctmtx, scaled with 2⁷. Table 1 shows the approximated values, calculated from the conditions in (48), of all the coefficients and the number of bits required to normalize the results.

Table 1. Number of adders for lifting coefficients.

a	Adders	b	Adders
51	2	−98	2
101	3	−569	2
311	3	−200	3
25	2	−50	2
152	2	−297	3
64	0	−121	2
183	3	−325	3
25	2	−50	2
19	2	−38	2
63	1	−124	1
178	3	−345	4
115	3	−219	3
71	2	−132	1
169	3	−305	3
99	3	−172	3

According to [14], multiplications for a_1, 3 = 51 and b_1, 3 = −98 can be implemented with a minimum number of additions resorting to the n-dimensional reduced adder graph (RAG-n) technique. The total number of adders required for all the coefficients is shown in Table 1.

The DCT block contains 2 × N/2 × log ₂(N) = 160 adders to implement the Hadamard block. Using Tables 1 and 2, the number of adders use to implement the DCT can be calculated. The first stage of lifting steps (π/8 rotation) requires 3 × 8 = 24 adders to implement the lifting structure, and further 6 × 8 = 48 adders are required to implement all the steps involving a_1, 3 and b_1, 3. The first stage of lifting steps (π/16 rotation) requires 3 × 4 = 12 adders to implement the lifting structure, and further 8 × 4 = 32 adders are required to implement all the steps involving a_1, 4 and b_1, 4. The first stage of lifting steps (3π/16 rotation) requires 3 × 4 = 12 adders to implement the lifting structure, and further 9 × 4 = 36 adders are required to implement all the steps involving a_3, 4 and b_3, 4. The first stage of lifting steps (π/32 rotation) requires 3 × 2 = 6 adders to implement the lifting structure, and further 6 × 2 = 12 adders are required to implement all the steps involving a_1, 5 and b_1, 5. The first stage of lifting steps (3π/32 rotation) requires 3 × 2 = 6 adders to implement the lifting structure, and further 7 × 2 = 14 adders are required to implement all the steps involving a_3, 5 and b_3, 5.

Table 2. Approximated valued of lifting structures coefficients and the number of bits required for quantization of results.

Givens rotations	m, n	a	b
π/8	8	51	−98
π/16	10	101	−569
3π/16	10	311	−200
π/32	9	25	−50
3π/32	10	152	−297
5π/32	8	64	−121
7π/32	9	183	−325
π/64	10	25	−50
3π/64	8	19	−38
5π/64	9	63	−124
7π/64	10	178	−345
9π/64	9	115	−219
11π/64	8	71	−132
13π/64	9	169	−305
15π/64	8	99	−172

The first stage of lifting steps (5π/32 rotation) requires 3 × 2 = 6 adders to implement the lifting structure, and further 2 × 2 = 4 adders are required to implement all the steps involving a_5, 5 and b_5, 5. The first stage of lifting steps (7π/32 rotation) requires 3 × 2 = 6 adders to implement the lifting structure, and further 9 × 2 = 18 adders are required to implement all the steps involving a_7, 5 and b_7, 5. The first stage of lifting steps (π/64 rotation) requires 3 adders to implement the lifting structure, and further 6 adders are required to implement all the steps involving a_1, 6 and b_1, 6. The first stage of lifting steps (3π/64 rotation) requires 3 adders to implement the lifting structure, and further 6 adders are required to implement all the steps involving a_3, 6 and b_3, 6. The first stage of lifting steps (5π/64 rotation) requires 3 adders to implement the lifting structure, and further 3 adders are required to implement all the steps involving a_5, 6 and b_5, 6. The first stage of lifting steps (7π/64 rotation) requires 3 adders to implement the lifting structure, and further 10 adders are required to implement all the steps involving a_7, 6 and b_7, 6. The first stage of lifting steps (9π/64 rotation) requires 3 adders to implement the lifting structure, and further 9 adders are required to implement all the steps involving a_9, 6 and b_9, 6. The first stage of lifting steps (11π/64 rotation) requires 3 adders to implement the lifting structure, and further 5 adders are required to implement all the steps involving a_11, 6 and b_11, 6. The first stage of lifting steps (13π/64 rotation) requires 3 adders to implement the lifting structure, and further 9 adders are required to implement all the steps involving a_13, 6 and b_13, 6. The first stage of lifting steps (15π/64 rotation) requires 3 adders to implement the lifting structure, and further 9 adders are required to implement all the steps involving a_15, 6 and b_15, 6. The square root block contains 2 adders, and there are 32 square root blocks. Therefore, 64 adders are calculating square roots in parallel. The total number of adders required for Hadamard and lifting scheme is 160 + 72 + 44 + 48 + 18 + 20 + 10 + 24 + 9 + 9 + 6 + 13 + 12 + 15 + 12 + 12 + 64 = 548.

Figure 8 shows the experiment setup carried out to calculate the PSNR. The original frames are transformed using the proposed DCT. Then the transformed data is quantized to 13 bits. The quantized coefficients are then passed through the inverse quantization block and inverse DCT. The PSNR is then calculated between the original frames and the reconstructed frames. The inverse quantization is taken using (51)

()

Table 3 shows the PSNR values for different sequences with different DCT sizes. The PSNR is calculated as shown in (52) and (53)

()

where MAX_I is the maximum possible value of the image, and MSE, mean square error, can be defined as

()

where I(i, j) and K(i, j) are the input frame and the reconstructed frame, after inverse quantization and IDCT, respectively. From Table 3, it is quite clear that the DCT is showing great efficiency with respect to PSNR. PSNR of Y frames is very close to 50 dB.

Table 3. PSNR (dB) of different sequences for different DCT sizes.

Sequences	DCT size
	N = 4			N = 8			N = 16			N = 32
	Y	U	V	Y	U	V	Y	U	V	Y	U	V
BQ terrace_1920 × 1080_60	47.1	44.4	44.1	48.8	45.9	44.1	48.9	45.4	45.3	48.8	45.7	44.1
BQ square_416 × 240_60	47.4	45.1	45.2	48.1	45.2	45.2	48.1	45.7	44.1	48.2	45.2	44.7
BQ mall_832 × 480_60	46.9	44.1	44.5	47.9	44.9	45.5	48.7	44.6	44.9	48.4	45.7	45.3
Basketball drive_1920 × 1080_50	47.8	44.5	44.1	48.2	44.8	45.7	48.5	45.1	45.7	48.3	45.1	45.3
Basketball drill_832 × 480_50	47.0	45.4	45.6	48.6	45.0	44.8	48.2	45.9	45.0	48.7	45.4	45.5

In Tables 4, 5, and 6, the number of multiplications, additions, and shifts, required to calculate different sizes of DCT, are shown. The proposed architecture has no multiplications, where all the multiplications are implemented using shifts and adds. As it can be observed, the number of additions required to compute the 32-point DCT with the proposed architecture is less than the original DCT implementation and the other proposed ones.

Table 4. Proposed architecture.

N	A	S
4	17	5
8	74	39
16	232	132
32	548	249

^*N is the DCT size.
^*M is the number of multiplications.
^*A is the number of additions.
^*S is the number of shifts.

Table 5. Comparison for N = 32.

	M	A	S
Original	1024	992	0
Proposed	0	548	249

Table 6. Comparison for N = 16.

	M	A	S
[5]	0	242	58
Proposed	0	232	132

^*N is the DCT size.
^*M is the number of multiplications.
^*A is the number of additions.
^*S is the number of shifts.

The net list is written in VHDL language. Synopsys Design Vision is used for synthesis purpose. The code is synthesized on 90 nm standard cell library at a clock frequency of 150 MHz. Table 7 shows the results of the synthesis.

Table 7. Synthesis results.

Parameter	Value
Technology	90 nm
Frequency	150 MHz
Area	0.42 mm²
Power	884.1 μW
Memory	36 kbits

The time required by the proposed architecture to completely process an N × N macro block is

()

where δ = 2 if N = 32 and δ = 1 otherwise. Thus, the total time to process one W × H pixel frame is

()

where K accounts for the chroma subsampling, for example, K = 3 for 4 : 4 : 4, K = 2 for 4 : 2 : 2, and K = 1.5 for 4 : 2 : 0. So from (55), taking H = 1920, W = 1080, and K = 1.5 we obtain T = 2.8 ms and T = 20.7 ms for N = 32 and N = 4, respectively. As a consequence, in the worst case (N = 4) the proposed architecture sustains up to 48 frames per second.

6. Conclusion

In this work, a dynamic N-point DCT for HEVC is proposed. A partially folded architecture is adopted to maintain speed and to save area. The DCT supports 4, 8, 16, and 32 points. The simulation results show that the PSNR is very close to 50 dB, which is reasonably good. Multiplications are removed from the architecture by introducing lifting scheme and approximating the coefficients.

References

1 Wiegand T., Sullivan G. J., Bjøntegaard G., and Luthra A., Overview of the H.264/AVC video coding standard, IEEE Transactions on Circuits and Systems for Video Technology. (2003) 13, no. 7, 560–576, 2-s2.0-0042631515, https://doi.org/10.1109/TCSVT.2003.815165.
10.1109/TCSVT.2003.815165
Web of Science® Google Scholar
2 Ugur K., Andersson K., Fuldseth A., Bjøntegaard G., Endresen L. P., Lainema J., Hallapuro A., Ridge J., Rusanovskyy D., Zhang C., Norkin A., Priddle C., Rusert T., Samuelsson J., Sjöberg R., and Wu Z., High performance, low complexity video coding and the emerging hevc standard, IEEE Transactions on Circuits and Systems for Video Technology. (2010) 20, no. 12, 1688–1697, 2-s2.0-79551588102, https://doi.org/10.1109/TCSVT.2010.2092613.
10.1109/TCSVT.2010.2092613
Web of Science® Google Scholar
3 Song S., Seo C., and Kim K., A unified transform unit for H.264, Proceedings of the International SoC Design Conference (ISOCC ′08), November 2008, 130–133, 2-s2.0-67650683025, https://doi.org/10.1109/SOCDC.2008.4815701.
10.1109/SOCDC.2008.4815701
Google Scholar
4 Saponara S., Martina M., Casula M., Fanucci L., and Masera G., Motion estimation and CABAC VLSI co-processors for real-time high-quality H.264/AVC video coding, Microprocessors and Microsystems. (2010) 34, no. 7-8, 316–328, 2-s2.0-78650748799, https://doi.org/10.1016/j.micpro.2010.06.003.
10.1016/j.micpro.2010.06.003
Web of Science® Google Scholar
5 Haggag M. N., El-Sharkawy M., and Fahmy G., Efficient fast multiplication-free integer transformation for the 2-D DCT H.265 standard, Proceedings of the 17th IEEE International Conference on Image Processing (ICIP ′10), September 2010, 3769–3772, 2-s2.0-78651071383, https://doi.org/10.1109/ICIP.2010.5653484.
10.1109/ICIP.2010.5653484
Google Scholar
6 Huang Y. M., Wu J. L., and Hsu C. T., A refined fast 2-D discrete cosine transform algorithm with regular butterfly structure, IEEE Transactions on Consumer Electronics. (1998) 44, no. 2, 376–383, 2-s2.0-0032072374.
10.1109/30.681953
Web of Science® Google Scholar
7 Hein D. and Ahmed N., On a real-time Walsh-Hadamard cosine transform image processor, IEEE Transactions on Electromagnetic Compatibility. (1978) EMC-20, 453–457.
10.1109/TEMC.1978.303679
Web of Science® Google Scholar
8 Liang J. and Tran T. D., Fast multiplierless approximations of the DCT with the lifting scheme, IEEE Transactions on Signal Processing. (2001) 49, no. 12, 3032–3044, 2-s2.0-0035673741, https://doi.org/10.1109/78.969511.
10.1109/78.969511
Web of Science® Google Scholar
9 Chen Y. J., Oraintara S., and Nguyen T., Video compression using integer DCT, International Conference on Image Processing (ICIP′00), September 2000, 844–845, 2-s2.0-0034442216.
Google Scholar
10 Daubechies I. and Sweldens W., Factoring wavelet transforms into lifting steps, Journal of Fourier Analysis and Applications. (1998) 4, no. 3, 247–268, 2-s2.0-18344410543.
10.1007/BF02476026
Web of Science® Google Scholar
11 Martina M., Masera G., Piccinini G., and Zamboni M., A VLSI architecture for IWT (Integer Wavelet Transform), Proceedings of the 43rd IEEE Midwest Symposium on Circuits and Systems, August 2000, 1174–1177, 2-s2.0-0034465166.
Google Scholar
12 Martina M., Masera G., Piccinini G., and Zamboni M., Novel JPEG 2000 compliant DWT and IWT VLSI implementations, Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology. (2003) 35, no. 2, 137–153, 2-s2.0-0038514765, https://doi.org/10.1023/A:1023696430633.
10.1023/A:1023696430633
Google Scholar
13 Martina M. and Masera G., Folded multiplierless lifting-based wavelet pipeline, IET Electronics Letters. (2007) 43, no. 5, 27–28, 2-s2.0-33847689316, https://doi.org/10.1049/el:20073181.
10.1049/el:20073181
Web of Science® Google Scholar
14 Dempster A. G. and Macleod M. D., Use of minimum-adder multiplier blocks in FIR digital filters, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing. (1995) 42, no. 9, 569–577, 2-s2.0-0029374075, https://doi.org/10.1109/82.466647.
10.1109/82.466647
Web of Science® Google Scholar
15 Martina M. and Masera G., Low-complexity, efficient 9/7 wavelet filters VLSI implementation, IEEE Transactions on Circuits and Systems II: Express Briefs. (2006) 53, no. 11, 1289–1293, 2-s2.0-34547582778, https://doi.org/10.1109/TCSII.2006.883092.
10.1109/TCSII.2006.883092
Web of Science® Google Scholar
16 Martina M. and Masera G., Multiplierless, folded 9/7–5/3 wavelet VLSI architecture, IEEE Transactions on Circuits and Systems II. (2007) 54, no. 9, 770–774, 2-s2.0-34648831843, https://doi.org/10.1109/TCSII.2007.900354.
10.1109/TCSII.2007.900354
Web of Science® Google Scholar
17 Joshi R., Reznik Y. A., and Karczewicz M., Efficient large size transforms for high-performance video coding, 33rd Proceedings of the Applications of Digital Image Processing, August 2010, San Diego, Calif, USA, Proceedings of the SPIE, 2-s2.0-78649402020, https://doi.org/10.1117/12.862250.
10.1117/12.862250
Google Scholar
18 Martina M., Masera G., and Piccinini G., Scalable low-complexity B-spline discrete wavelet transform architecture, IET Circuits, Devices and Systems. (2010) 4, no. 2, 159–167, 2-s2.0-77953329106, https://doi.org/10.1049/iet-cds.2009.0185.
10.1049/iet-cds.2009.0185
Web of Science® Google Scholar
19 Britanak V., Yip P. C., and Rao K. R., Discrete Cosine and Sine Transforms: General Properties, Fast Algorithms and Integer Approximations, 2007, Elsevier.
10.1016/B978-012373624-6/50003-5
Google Scholar
20 Jones H. W., Hein D. N., and Knauer S. C., The Ratio CO/CO₂ of oxidation on a burning carbon surface, November 1978, 87–98, 2-s2.0-0018811070.
Google Scholar
21 Ahmed N. and Rao K. R., Orthogonal Transforms for Digital Signal Processing, 1975, Springer.
10.1007/978-3-642-45450-9
Google Scholar
22 Manz J. W., A sequency-ordered fast Walsh transform, IEEE Transactions on Audio Electroacoustics. (1972) AU-20, 204–205.
10.1109/TAU.1972.1162377
Google Scholar
23 Claire A. T. S. and Sabido-David C., Unified matrix treatment of the fast Walsh-Hadamad transfor, IEEE Transactions on Computers. (1976) 25, no. 11, 1142–1146, 2-s2.0-0017020136.
Google Scholar

Citing Literature

All articles

N Point DCT VLSI Architecture for Emerging HEVC Standard

Abstract

1. Introduction

2. Review of 2D Transform