A reconfigurable and compact subpipelined architecture for AES encryption and decryption

Li, Ke; Li, Hua; Mund, Graeme

doi:10.1186/s13634-022-00963-3

Research
Open access
Published: 09 January 2023

A reconfigurable and compact subpipelined architecture for AES encryption and decryption

Ke Li¹,
Hua Li¹ &
Graeme Mund¹

EURASIP Journal on Advances in Signal Processing volume 2023, Article number: 5 (2023) Cite this article

2048 Accesses
4 Citations
Metrics details

Abstract

AES has been used in many applications to provide the data confidentiality. A new 32-bit reconfigurable and compact architecture for AES encryption and decryption is presented and implemented in non-BRAM FPG in this paper. It can be reconfigured for the options of different key sizes which is very flexible for the users to apply AES for various application environments. The proposed design employs a single-round architecture and subpipeling to minimize the hardware cost. The fully composite field GF((2⁴)²)-based encryption/decryption and keyschedule lead to the lower hardware complexity and efficient subpipelining for 32-bit data path. In addition, a new subpipelined on-the-fly keyschedule over composite field GF((2⁴)²) is proposed for all standard key sizes (128-, 192-, 256-bit) which generates the roundkeys simultaneously and efficiently. This feature is very useful and efficient when the main key has been changed since AES is a symmetric-key cryptography and the session key usually changes frequently. The proposed reconfigurable and compact design has higher throughput and lower hardware cost. It achieves throughputs of 375Mbits/s with 128-bit key, 318Mbits/s with 192-bit key and 275Mbits/s with 256-bit key on VIRTEX XC4VSX25-12, and the total number of slices is 1766. The proposed reconfigurable and compact AES architecture can be efficiently applied in computing-restricted environments such as wireless and embedded devices.

1 Introduction

Advanced Encryption Standard (AES) based on Rijndael encryption algorithm has been used to replace DES in security services [1,2,3]. Hardware AES implementations are attractive because it provides better throughput as well as higher physical security. Compared with Application-Specific Integrated Circuit (ASIC), field programmable gate array (FPGA) becomes more and more popular because of its scalability, re-programmability, and fast development.

Numerous FPGA [4,5,6,7,8,9,10] and ASIC [7, 11, 12] implementations of the AES have been presented and evaluated. Other AES implementations have also been proposed such as GPU-based [13], Multicore Processor-based [14], and Rapid Single-Flux-Quantum Circuits-based [15] implementations. Fully unrolled schemes [6, 8] can achieve high throughput, but there are much more area and energy cost which only suitable for high-end applications. Another approach is only implementing a single-round unit and applies the same unit in different rounds.

In this paper, a compact and reconfigurable design of AES with low hardware cost and adequate throughput is proposed and implemented in a non-BRAM FPGA. This design applies a 32-bit single-round unit, which costs much less hardware area than the 128-bit fully unrolled schemes. In order to reduce the hardware complexity further, we convert the arithmetic operations of AES from field GF(2⁸) to field GF((2⁴)²). Unlike the previous designs in [6, 8, 12, 16] where partial-composite field AES is applied, we conduct the entire AES operations in GF((2⁴)²) to minimize the overhead of isomorphic mapping functions. In our design, only two forward mapping functions and one backward mapping function are used. In addition, subpipelining is applied to improve the throughput/area ratio.

The standard announced by NIST [2] indicates that AES is a block cipher with 128-bit block size and 128-, 192-, 256- bit key sizes. These three key sizes are specified for various security levels. The capability to deal with all key sizes makes reconfigurability an important feature of AES implementations. The previous work of [6, 8, 12, 17,18,19,20] applied the on-the-fly key generator to support instant key changing. The design in [8] made a subpipelined keyschedule, but it only supported 128-bit key size. When subpipelining on-the-fly keyschedule is employed in an AES implementation, the stages in keyschedule must be synchronized with the stages in the cipher, because they share the same clock. In this design, we propose a subpipelined on-the-fly keyschedule over field GF((2⁴)²), which supports all three key sizes.

The issue of secure communication in computing-restricted environments, such as personal digital assistants (PDAs), wireless devices, and many other embedded devices, has become more important recently. In order to apply AES in these devices, the AES implementations must be cost efficient. The objective of this research work is to design a reconfigurable and compact AES architecture which can be applied to the computing-restricted devices. The proposed architecture can be reconfigured to three different AES key sizes which is very useful when the users change the main key and also change the key sizes for different security levels because AES is a symmetric-key cryptography and the session key usually changes frequently. We also propose a subpipelined on-the-fly keyschedule for three options of key sizes that make the proposed architecture be easily implemented on non-BRAM FPGA.

The remainder of the paper is organized as the following. In Sect. 2, AES algorithm is introduced. The proposed compact and reconfigurable AES architecture is presented in Sect. 3. Implementation and performance are included in Sect. 4. Sections 5 and 6 are the conclusion and future work.

2 AES algorithm

AES is a symmetric block cipher with block size of 128-bit and three key sizes of (128-, 192-, or 256-bit) [1,2,3]. The AES parameters depend on the key size (Table 1, the size of word is 32 bits): AES runs iteratively on four transformations (inv-/Subbytes, inv-/ShiftRows, inv-/MixColumns and addroundkey) with different sequences in encryption and decryption. Figure 1 illustrates the basic architecture of AES. In the initial round (r = 0), only addroundkey is performed; in the final round (r = Nr), it skips inv-/MixColumns. The keyschedule module expands cipherkey to (Nr + 1) × 4 words of roundkeys. Each round applies a unique 128-bit roundkey in the addroundkey operation [1, 2].

Table 1 AES parameters [2]

Full size table

2.1 Subbytes

Subbytes are the only non-linear transformation in AES which is also called S-Box. S-Box is a 16 × 16 matrix containing all possible 256 8-bit values, which is used to perform a non-linear byte-by-byte substitution of the state.

Considering a byte {x₇x₆x₅x₄x₃x₂x₁x₀}, Subbytes transformation has two steps [1]:

(i)
{x’₇ x’₆ x’₅ x’₄ x’₃ x’₂ x’₁ x’₀} is its multiplicative inverse in GF(2⁸) field, modulo the irreducible polynomial m(x) = x₈ + x₄ + x₃ + x + 1; {00000000}’s multiplicative inverse in GF(2⁸) field is itself;
(ii)
An affine transformation over GF(2) is conducted on the inverse of {x₇x₆x₅x₄x₃x₂x₁x₀} (Eq. 1 [1]).
$$\left[ {\begin{array}{*{20}c} {y_{0} } \\ {y_{1} } \\ {y_{2} } \\ {y_{3} } \\ {y_{4} } \\ {y_{5} } \\ {y_{6} } \\ {y_{7} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} 1 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 & 0 & 1 & 1 & 1 \\ 1 & 1 & 1 & 0 & 0 & 0 & 1 & 1 \\ 1 & 1 & 1 & 1 & 0 & 0 & 0 & 1 \\ 1 & 1 & 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 1 & 1 & 1 & 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 & 1 & 1 & 1 & 0 \\ 0 & 0 & 0 & 1 & 1 & 1 & 1 & 1 \\ \end{array} } \right]\;\;\left[ {\begin{array}{*{20}c} {x^{\prime}_{0} } \\ {x_{1}^{^{\prime}} } \\ {x_{2}^{^{\prime}} } \\ {x_{3}^{^{\prime}} } \\ {x_{4}^{^{\prime}} } \\ {x_{5}^{^{\prime}} } \\ {x_{6}^{^{\prime}} } \\ {x_{7}^{^{\prime}} } \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} 1 \\ 1 \\ 0 \\ 0 \\ 0 \\ 1 \\ 1 \\ 0 \\ \end{array} } \right]$$
(1)

2.2 ShiftRows

This transformation circularly shifts each row of the state to the left on encryption. As in Fig. 2 [1], the top row of the state is noted as row(0), and the bottom row is noted as row(3). The ShiftRows perform i − byte circular left shift to row(i) (i = 0, 1, 2, 3).

2.3 MixColumns

This transformation treats each column of the state as a four-term polynomial over GF(2⁸) and transforms each column to a new one by multiplying it with a constant polynomial a(x) = {03}x³ + {01}x² + {01}x + {02} modulo x⁴ + 1. Equation 2 [1] is the matrix form of MixColumns.

$$\left[ {\begin{array}{*{20}c} {02} & {03} & {01} & {01} \\ {01} & {02} & {03} & {01} \\ {01} & {01} & {02} & {03} \\ {03} & {01} & {01} & {02} \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {s_{0,0} } & {s_{0,1} } & {s_{0,2} } & {s_{0,3} } \\ {s_{1,0} } & {s_{1,1} } & {s_{1,2} } & {s_{1,3} } \\ {s_{2,0} } & {s_{2,1} } & {s_{2,2} } & {s_{2,3} } \\ {s_{3,0} } & {s_{3,1} } & {s_{3,2} } & {s_{3,3} } \\ \end{array} } \right]$$

(2)

2.4 Addroundkey

The addroundkey is a simple logical XOR of the current state with a roundkey which is generated by the keyschedule.

2.5 Keyschedule

Keyschedule derives roundkeys from the cipherkey. It consists of key expansion and roundkey selections. Figure 3 [2] shows the keyschedule algorithm which generates roundkeys for AES-128, AES-192 and AES-256. The functions used in keyschedule are the following [1, 2]:

Rotword: One-byte circular left shift on a word.
Subword: Using S-Box to perform a byte substitution on each byte.
XOR with Rcon: XORing with a round constant Rcon[j], Rcon[j] = (RC[j], 0, 0, 0) with RC[1] = 1, RC[j] = 2 · RC[j − 1].

3 32-bit subpipelined reconfigurable and compact architecture for AES

In this section, the 32-bit reconfigurable and compact AES architecture is proposed. In our design, the data path is 32-bit. That is one operation, for example, s-box, will be applied four times to process the 128 bits of one block in plaintext. The subpipelined on-the-fly keyschedule for different key sizes is also presented to provide the roundkey simultaneously and efficiently. In addition, the equivalent cipher [1] is adopted to make the same data flow for encryption and decryption and to share the reusable units.

3.1 32-bit single-round unit

Roll unfolded architecture is widely used to achieve high throughput. It conducts multiple rounds on one block by implementing more than one round units on the hardware. The more round units the architecture includes, the higher the hardware cost. The alternative scheme, which is called the single-round unit architecture, can be applied to simplify the hardware complexity. Instead of unfolding all the round units in devices, it implements a single-round unit which costs approximately 1/N_r area of the unfolded scheme.

We propose a 32-bit single-round unit for a compact AES architecture. It needs iterating four times to perform a round on a block (128-bit), once every 32 bits.

3.2 Full composite field architecture with keyschedule

Many high-end FPGA devices possess Block-RAMs (BRAMs) which is efficient for the implementation of S-Box. S-Box, also referred as Subbytes, is the important and complicated operation in both encryptor/decryptor and keyschedule modules. However, these BRAM-based designs cannot be implemented in the low-cost devices which do not have BRAMs. An alternative approach for S-Box implementation is using combinational logic. But this method may lead to high hardware complexity because of the mathematical operations in AES over finite field GF(2⁸).

The key step of S-Box is calculating multiplicative inverse of each byte. Since the introduction of composite field GF((2⁴)²), the calculation of multiplicative inverse over GF((2⁴)²) has been investigated [6, 8, 12, 16]. The architectures in [5, 18, 21] applied the field GF((2⁴)²) to affine transformation in S-Box. By decomposing these operations from GF(2⁸) to its subfield GF(2⁴), the hardware complexity of S-Box can be decreased dramatically.

In Fig. 4a, in each round before S-Box, it needs an isomorphic mapping function (MAP) from GF(2⁸) to GF((2⁴)²); and the inverse mapping (MAP⁻¹) afterwards. If key size is 128 bits, it applies 10 times S-Box to the plaintext and the cipherkey, which means that it needs 20MAPs and 20 MAP⁻¹ s for the encryption of 128-bit data. In order to save the cost of MAP and MAP⁻¹, we propose a new 32-bit complete composite field approach (Fig. 4b). The GF((2⁴)²) field applies in all transformations in encryptor/decryptor and keyschedule. As illustrated in Fig. 4b, one MAP and one MAP⁻¹ are applied in encryption, and one MAP is applied in keyschedule. This is a constant overhead which is not affected by the number of rounds.

We use the composite field defined by Wolkerstorfer et al. [21]. There are two irreducible polynomials (Eqs. 3 and 4) involved in multiplication and inversion in GF((2⁴)²).

$$n\left( x \right) = x^{2} + \left\{ 1 \right\}x + \left\{ e \right\}, \left\{ e \right\} \;{\text{denotes }}\;\left\{ {1110} \right\}$$

(3)

$$m\left( x \right) = x^{4} + x + 1$$

(4)

The irreducible polynomial for the field GF(2⁸) in AES is:

$$m\left( x \right) = x^{8} + x^{4} + x^{3} + x + 1$$

(5)

The isomorphic mapping functions between field GF(2⁸) and field GF((2⁴)²) are determined by the irreducible polynomials of field GF(2⁸) (Eq. 5) and field GF((2⁴)²) (Eqs. 3 and 4). We use the following mapping formulas in [21] to convert the representations between GF(2⁸) and GF((2⁴)²).

$$\begin{gathered} a_{h} x + a_{l} = {\text{MAP}}\left( a \right), a_{h} ,a_{l} \in {\text{GF}}\left( {2^{4} } \right), a \in {\text{GF}}\left( {2^{8} } \right) \hfill \\ a_{A} = a_{1} \oplus a_{7} , a_{B} = a_{5} \oplus a_{7} , a_{C} = a_{4} \oplus a_{6} \hfill \\ a_{l0} = a_{C} \oplus a_{0} \oplus a_{5} , a_{l1} = a_{1} \oplus a_{2} , a_{l2} = a_{A} , a_{l3} = a_{2} \oplus a_{4} \hfill \\ a_{h0} = a_{C} \oplus a_{5} , a_{h1} = a_{A} \oplus a_{C} , a_{h2} = a_{B} \oplus a_{2} \oplus a_{3} , a_{h3} = a_{B} \hfill \\ \end{gathered}$$

(6)

In Eq. 6, a is an element in field GF(2⁸). MAP(a) converts a to its isomorphic element in GF((2⁴)²), which is represented as a_hx + a_l.

$$\begin{gathered} a = {\text{MAP}}^{ - 1} \left( {a_{h} x + a_{l} } \right), a \in {\text{GF}}\left( {2^{8} } \right), a_{h} ,a_{l} \in {\text{GF}}\left( {2^{4} } \right) \hfill \\ a_{A} = a_{l1} \oplus a_{h3} , a_{B} = a_{h0} \oplus a_{h1} \hfill \\ a_{0} = a_{l0} \oplus a_{h0} , a_{1} = a_{B} \oplus a_{h3} , a_{2} = a_{A} \oplus a_{B} \hfill \\ a_{3} = a_{B} \oplus a_{l1} \oplus a_{h2} , a_{4} = a_{A} \oplus a_{B} \oplus a_{l3} , a_{5} = a_{B} \oplus a_{l2} \hfill \\ a_{6} = a_{A} \oplus a_{l2} \oplus a_{l3} \oplus a_{h0} , a_{7} = a_{B} \oplus a_{l2} \oplus a_{h3} \hfill \\ \end{gathered}$$

(7)

In Eq. 7, a_hx + a_l is an element in field GF((2⁴)²). MAP⁻¹(a_hx + a_l) converts a_hx + a_l to its isomorphic element in GF(2⁸), which is represented as a.

3.3 Subpipelined encryptor/decryptor and keyschedule

Pipelining is applied in the designs to optimize speed/area ratio of AES. By inserting registers in combinational logic circuits, multiple blocks of hardware are running simultaneously. The frequency of the design is determined by the maximum delay between registers. We reduce the maximum delay and increase the frequency by optimizing the balance between stages. A 32-bit single-round subpipelined architecture in full composite field is proposed where one round unit is implemented and subpipelined into eight substages. To generate the roundkeys synchronously, we present an on-the-fly keyschedule. The encryption/decryption unit and the key expansion unit share the same clock which leads to the fact that the clock frequency is determined by the maximum delay in both units. This makes the balance of substage in keyschedule as important as in encryptor/decryptor. We propose a new subpipelined keyschedule on composite field for all standard key sizes. The most costly part of keyschedule is still S-Box. We divide it into the same substages as in encryptor/decryptor.

3.4 Double-block subpipelined architecture

The proposed architecture for encryptor is illustrated in Fig. 5. The decryption can be easily implemented by the equivalent cipher [1]. The eight 32-bit registers (four in ShiftRows, three in Subbytes and one between Subbytes and MixColumns) are used to cut one round unit into eight substages, which leads to an eight clock cycles initial delay to generate the first 32-bit ciphertext. clk counter is a clock register counter generated in keyschedule. It is used to synchronize encryptor/decryptor and keyschedule. We use a double-block (block A and B) data flow in our subpipelined architecture.

Figure 5a illustrates the subpipelining in ShiftRows operation, and Fig. 5b shows the subpipelining in Subbytes operation. We can see that the mappings from GF(2⁸) to GF((2⁴)²) are only required once after the inputs of plaintext and cipherkey. The inverse mapping ( GF((2⁴)²) to GF(2⁸)) is applied to the final output in order to get the cipher text.

The 3-to-1 multiplexer (“mul”) is controlled by the clk counter:

Case a In initial round, where 0 ≤ clk counter < 8, 128-bit plaintext is MAPed into GF((2⁴)²) and XORed with the according roundkey in four clock cycles, 32 bits at each clock. The result is the outcome of the initial round (Nr = 0) which is the input of the second round;
Case b In normal rounds, where 8 ≤ clk counter < Nr × 8, the output of MixColumns XORs with the corresponding roundkey.
Case c In the last round, where Nr × 8 ≤ clk counter < (Nr + 1) × 8. The output of Subbytes XORs with the corresponding roundkey. The ciphertext is obtained.

The detail operations of the ShiftRows, Subbytes and MixColumns are presented in the following.

3.4.1 ShiftRows

We use our proposed ShiftRows operation [22] in the design. It includes sixteen 8-bit registers and three 2-to-1 multiplexers. The block of data is shifted column by column. Two blocks of data are processed in the pipeline.

Our ShiftRows operation is designed in a column fashion (Fig. 6). In the architecture, the data (32-bits) in the columns are shifted in the order of column instead of rows. Each column is composed of four shift registers, and each register has 8 bits. By transforming the ShiftRows operation to a column fashion operation, we can make the design of Mix-columns operation easier, since all the data in one column are required in the MixColumn operation.

The following are the ShiftRows procedure for encryption.

(1)
First row No shift. We just let the data flow through.
(2)
Second row Circular left shift operation. In this case, we connect the output of register R1C2 and the output of R1C3 to a multiplexer in order to select the output.
(3)
Third row Switch data. Switch the data between first element and third element, second element and fourth element in the row. The outputs of R2C1 and R2C3 are connected to a Multiplexer.
(4)
Fourth row Circular right shift operation. Similar to the case of second row, we connect the output of register R3C0 and the output of R3C3 to a Multiplexer.

Similarly, we can derive the procedures for Inverse ShiftRows (Inv-ShiftRows) operations:

(1)
First row No shift.
(2)
Second row Circular right shift operation. We connect the output of register R1C0 and the output of R1C3 to a Multiplexer.
(3)
Third row Switch Data. Same as the operation in ShiftRows of encryption.
(4)
Fourth row Circular left shift operation. We connect the output of register R3C2 and the output of R3C3 to a multiplexer in order to select the output.

The multiplexers are controlled by some clock counters and the encryption/decryption signals.

3.4.2 Subpipelined Subbytes

The key step of Subbytes is the calculation of the multiplicative inverse. Figure 7 illustrates the architecture of Subbytes proposed in [8]. It uses multiplication in GF(2⁴)² three times. It also needs one inversion (x⁻¹), one constant multiplier with {e} (× e, {e} is in hexadecimal notation, which is ‘1110’ in binary notation), one squarer and two 4-bit XORs ( ⊕).

We proposed a 32-bit subpipelined compact s-box architecture in composite field of GF(2⁴)² with balanced substages and efficient performance [23]. Considering x, y, z ∈ GF(2⁴), x, y and z are represented in binary notation where $x=\left\{{x}_{3}{x}_{2}{x}_{1}{x}_{0}\right\}, y=\left\{{y}_{3}{y}_{2}{y}_{1}{y}_{0}\right\},z=\left\{{z}_{3}{z}_{2}{z}_{1}{z}_{0}\right\}$. Let a, b, c, d, e and f be 1-bit values, which equal to 0 or 1. ⊕ stands for XOR-operation. x₀y₁ means x₀∧y₁. Equations 8, 9, 10 and 11 [21] are used to calculate squaring, constant multiplication with {e}, multiplication and multiplicative inverse.

$${\varvec{y}} = {\varvec{x}}^{2} :y_{0} = x_{0} \oplus x_{2} , y_{1} = x_{2} , y_{2} = x_{1} \oplus x_{3} , y_{3} = x_{3 }$$

(8)

$${\varvec{y}} = {\varvec{x}} \times \left\{ {\varvec{e}} \right\}:a = x_{0} \oplus x_{1} , b = x_{2} \oplus x_{3} , y_{0} = x1 \oplus b, y_{1} = a, y_{2} = a \oplus x_{2} , y_{3} = a \oplus b$$

(9)

$$\begin{gathered} {\varvec{z}} = {\varvec{x}} \times {\varvec{y}}: a = x_{0} \oplus x_{3} , b = x_{2} \oplus x_{3} , c = x_{1} \oplus x_{2} \hfill \\ z_{0} = \left( {x_{0} y_{0} } \right) \oplus \left( {x_{3} y_{1} } \right) \oplus \left( {x_{2} y_{2} } \right) \oplus \left( {x_{1} y_{3} } \right), z_{1} = \left( {x_{1} y_{0} } \right) \oplus \left( {ay_{1} } \right) \oplus \left( {by_{2} } \right) \oplus \left( {cy_{3} } \right) \hfill \\ z_{2} = \left( {x_{2} y_{0} } \right) \oplus \left( {x_{1} y_{1} } \right) \oplus \left( {ay_{2} } \right) \oplus \left( {by_{3} } \right), z_{3} = \left( {x_{3} y_{0} } \right) \oplus \left( {x_{2} y_{1} } \right) \oplus \left( {x_{1} y_{2} } \right) \oplus \left( {ay_{3} } \right) \hfill \\ \end{gathered}$$

(10)

$$\begin{gathered} {\varvec{y}} = {\varvec{x}}^{ - 1} : a = x_{1} \oplus x_{2} \oplus x_{3} \oplus \left( {x_{1} x_{2} x_{3} } \right), y_{0} = a \oplus x_{0} \oplus \left( {x_{0} x_{2} } \right) \oplus \left( {x_{1} x_{2} } \right) \oplus \left( {x_{0} x_{1} x_{2} } \right) \hfill \\ y_{1} = \left( {x_{0} x_{1} } \right) \oplus \left( {x_{0} x_{2} } \right) \oplus \left( {x_{1} x_{2} } \right) \oplus x_{3} \oplus \left( {x_{1} x_{3} } \right) \oplus \left( {x_{0} x_{1} x_{3} } \right) \hfill \\ y_{2} = \left( {x_{0} x_{1} } \right) \oplus x_{2} \oplus \left( {x_{0} x_{2} } \right) \oplus x_{3} \oplus \left( {x_{0} x_{3} } \right) \oplus \left( {x_{0} x_{2} x_{3} } \right) \hfill \\ y_{3} = a \oplus \left( {x_{0} x_{3} } \right) \oplus \left( {x_{1} x_{3} } \right) \oplus x_{2} x_{3} \hfill \\ \end{gathered}$$

(11)

In our design which is illustrated in Fig. 5, Subbytes should be cut into four substages. The key to an efficient subpipelining technology is to balance the delays of these substages.

We derive a new Eq. 12 from Eq. 11 to reduce the delay caused by x⁻¹.

Equation 12 is derived in three steps:

(1)
In Eq. 11, replace “a” by its expression:
$$y_{0} = x_{1} \oplus x_{2} \oplus x_{3} \oplus \left( {x_{1} x_{2} x_{3} } \right) \oplus x_{0} \oplus \left( {x_{0} x_{2} } \right) \oplus \left( {x_{1} x_{2} } \right) \oplus \left( {x_{0} x_{1} x_{2} } \right)$$
$$y_{1} = \left( {x_{0} x_{1} } \right) \oplus \left( {x_{0} x_{2} } \right) \oplus \left( {x_{1} x_{2} } \right) \oplus x_{3} \oplus \left( {x_{1} x_{3} } \right) \oplus \left( {x_{0} x_{1} x_{3} } \right)$$
$$y_{2} = \left( {x_{0} x_{1} } \right) \oplus x_{2} \oplus \left( {x_{0} x_{2} } \right) \oplus x_{3} \oplus \left( {x_{0} x_{3} } \right) \oplus \left( {x_{0} x_{2} x_{3} } \right)$$
$$y_{3} = x_{1} \oplus x_{2} \oplus x_{3} \oplus \left( {x_{1} x_{2} x_{3} } \right) \oplus \left( {x_{0} x_{3} } \right) \oplus \left( {x_{1} x_{3} } \right) \oplus x_{2} x_{3}$$
(2)
The expressions in step 1 can be equally changed to:
$$y_{0} = x_{1} \oplus x_{2} \oplus \left( {x_{1} x_{2} } \right) \oplus \left( {x_{0} x_{2} } \right) \oplus \left( {x_{0} \oplus x_{3} } \right)\left( {1 \oplus \left( {x_{1} x_{2} } \right)} \right)$$
$$y_{1} = \left( {x_{0} x_{1} } \right) \oplus \left( {x_{0} x_{2} } \right) \oplus \left( {x_{1} x_{2} } \right) \oplus x_{3} \left( {1 \oplus x_{1} \oplus \left( {x_{0} x_{1} } \right)} \right)$$
$$y_{2} = \left( {x_{0} x_{1} } \right) \oplus x_{2} \oplus \left( {x_{0} x_{2} } \right) \oplus x_{3} \left( {1 \oplus x_{0} \oplus \left( {x_{0} x_{2} } \right)} \right)$$
$$y_{3} = x_{1} \oplus x_{2} \oplus x_{3} \left( {1 \oplus x_{0} \oplus x_{1} \oplus x_{2} \oplus (x_{1} x_{2} } \right))$$
(3)
Let $a = x_{1} x_{2} , b = x_{0} x_{2} , c = x_{0} x_{1} , d = x_{1} \oplus x_{2} , e = 1 \oplus a, f = b \oplus c$, we have:
$$\begin{gathered} y_{0} = a \oplus b \oplus d \oplus \left( {\left( {x_{0} \oplus x_{3} } \right)e} \right) \hfill \\ y_{1} = a \oplus f \oplus x_{3} \left( {1 \oplus x_{1} \oplus c} \right) \hfill \\ y_{2} = f \oplus x_{2} \oplus x_{3} \left( {1 \oplus x_{0} \oplus b} \right) \hfill \\ y_{3} = d \oplus x_{3} \left( {e \oplus x_{0} \oplus d} \right) \hfill \\ \end{gathered}$$
(12)

According to Eq. 12, we design the logic circuit illustrated in Fig. 8 to perform x⁻¹ over GF(2⁴)². Besides multiplicative inversion, other operations in Fig. 7 are the three multiplications (× 1, × 2 and × 3). In order to decrease the maximum delay caused by multiplication, we separate each multiplication into two steps and put each step in different substages. The registers between each substage store the result of the first step of multiplication and pass it to the second step. We decompose these three multipliers into two different manners (AB-type and MN-type) to achieve the best balance.

AB-type The AB-type multiplication is based on Eq. 13 which is derived from Eq. 10. Step A calculates the value of all the binomials; Step B conducts XOR of every four values to generate z0, z1, z2 and z3. A register is inserted between Step A and Step B to store p₀, p₁, …, p₁₅. The multiplication “ × ₁” in Fig. 7 is separated as × _1A and × _1B in Fig. 8;

$$z=x\times y (AB-\mathrm{type})$$

Step A:

$$a = x_{0} \oplus x_{3} , b = x_{2} \oplus x_{3} , c = x_{1} \oplus x_{2} , p_{0} = x_{0} y_{0} , p_{1} = x_{3} y_{1} , p_{2} = x_{2} y_{2} , p_{3} = x_{1} y_{3}$$

$$p_4 = x_{1} y_{0} , p_5 = ay_{1} , p_6 = by_{2} , p_{7} = cy_{3} , p_{8} = x_{2} y_{0} , p_{9} = x_{1} y_{1} , p_{10} = ay_{2} , p_{11} = by_{3}$$

$$p_{12} = x_{3} y_{0} , p_{13} = x_{2} y_{1} , p_{14} = x_{1} y_{2} , p_{15} = ay_{3}$$

Step B:

$$\begin{gathered} z_{0} = p_{0} \oplus p_{1} \oplus p_{2} \oplus p_{3} , z_{1} = p_{4} \oplus p_{5} \oplus p_{6} \oplus p_{7} \hfill \\ z_{2} = p_{8} \oplus p_{9} \oplus p_{10} \oplus p_{11} , z_{3} = p_{12} \oplus p_{13} \oplus p_{14} \oplus p_{15} \hfill \\ \end{gathered}$$

(13)

MN-type The MN-type multiplication is based on Eq. 14 which is also derived from Eq. 10. Step M creates the value of a, b and c; Step N implements the rest of Eq. 10. A register is inserted between Step M and Step N to store a, b, c. The multiplications of “ × ₂” and “ × ₃” in Fig. 7 are separated as × _2 M and × _2 N, × _3 M and × _3 N in Fig. 8.

$$z = x \times y \left( {MN - type} \right)$$

Step M:

$$a = x_{0} \oplus x_{3} , b = x_{2} \oplus x_{3} , c = x_{1} \oplus x_{2}$$

Step N:

$$\begin{gathered} z_{0} = x_{0} y_{0} \oplus x_{3} y_{1} \oplus x_{2} y_{2} \oplus x_{1} y_{3} , z_{1} = x_{1} y_{0} \oplus ay_{1} \oplus by_{2} \oplus cy_{3} \hfill \\ z_{2} = x_{2} y_{0} \oplus x_{1} y_{1} \oplus ay_{2} \oplus by_{3} , z_{3} = x_{3} y_{0} \oplus x_{2} y_{1} \oplus x_{1} y_{2} \oplus ay_{3} \hfill \\ \end{gathered}$$

(14)

The last operation in Subbytes is the affine transformation. We derive Eq. 21 to do the affine transformation in $GF\left( {2^{4} } \right)^{2}$ based on Eqs. 1,6 and 7.

Consider $p \in {\text{GF}}\left( {2^{4} } \right)^{2} , \;q \in {\text{GF}}\left( {2^{8} } \right):p = \left\{ {p_{7} p_{6} p_{5} p_{4} p_{3} p_{2} p_{1} p_{0} } \right\}, q = \left\{ {q_{7} q_{6} q_{5} q_{4} q_{3} q_{2} q_{1} q_{0} } \right\}$

For Eq. 6:

(1)
Replace $a_{A} , a_{B} , a_{C} {\text{with their expressions:}}$
$$a_{l0} = a_{4} \oplus a_{6} \oplus a_{0} \oplus a_{5} , a_{l1} = a_{1} \oplus a_{2} , a_{l2} = a_{1} \oplus a_{7} , a_{l3} = a_{2} \oplus a_{4}$$
$$a_{h0} = a_{4} \oplus a_{6} \oplus a_{5} , a_{h1} = a_{1} \oplus a_{7} \oplus a_{4} \oplus a_{6} , a_{h2} = a_{5} \oplus a_{7} \oplus a_{2} \oplus a_{3} , a_{h3} = a_{5} \oplus a_{7}$$
(2)
Let $p {\text{replace}} \, a_{h} x + a_{l} , q \, {\text{replace}} \, a,$ we derive Eq. 15:
$$\begin{gathered} {\varvec{p}} = {\varvec{MAP}}\left( {\varvec{q}} \right),\user2{ p} \in {\varvec{GF}}\left( {2^{4} } \right)^{2} ,\user2{ q} \in {\varvec{GF}}\left( {2^{8} } \right) \hfill \\ p_{0} = q_{0} \oplus q_{4} \oplus q_{5} \oplus q_{6} , p_{1} = q_{1} \oplus q_{2} , p_{2} = q_{1} \oplus q_{7} ,p_{3} = q_{2} \oplus q_{4} \hfill \\ p_{4} = q_{4} \oplus q_{5} \oplus q_{6} , p_{5} = q_{1} \oplus q_{4} \oplus q_{6} \oplus q_{7} , p_{6} = q_{2} \oplus q_{3} \oplus q_{5} \oplus q_{7} , p_{7} = q_{5} \oplus q_{7} \hfill \\ \end{gathered}$$
(15)

Similar steps are applied in Eq. 7. Equation 16 is derived:

$$\begin{gathered} {\varvec{q}} = {\varvec{MAP}}^{ - 1} \left( {\varvec{p}} \right),\user2{ p} \in {\varvec{GF}}\left( {2^{4} } \right)^{2} ,\user2{ q} \in {\varvec{GF}}\left( {2^{8} } \right) \hfill \\ q_{0} = p_{0} \oplus p_{4} , q_{1} = p_{4} \oplus p_{5} \oplus p_{7} , q_{2} = p_{1} \oplus p_{4} \oplus p_{5} \oplus p_{7} ,q_{3} = p_{1} \oplus p_{4} \oplus p_{5} \oplus p_{6} \hfill \\ q_{4} = p_{1} \oplus p_{3} \oplus p_{4} \oplus p_{5} \oplus p_{7} , q_{5} = p_{2} \oplus p_{4} \oplus p_{5} , q_{6} = p_{1} \oplus p_{2} \oplus p_{3} \oplus p_{4} \oplus p_{7} , \hfill \\ q_{7} = p_{2} \oplus p_{4} \oplus p_{5} \oplus p_{7} \hfill \\ \end{gathered}$$

(16)

In the following, we derive Eq. 21 based on Eqs. 1, 15 and 16.

Let ${x}^{^{\prime}},y$ be the element in ${\text{GF}}\left( {2^{8} } \right)$.: $x^{\prime} = \left\{ {x^{\prime}_{7} x^{\prime}_{6} x^{\prime}_{5} x^{\prime}_{4} x^{\prime}_{3} x^{\prime}_{2} x^{\prime}_{1} x^{\prime}_{0} } \right\}, y = \left\{ {y_{7} y_{6} y_{5} y_{4} y_{3} y_{2} y_{1} y_{0} } \right\}$.

According to Eq. 1:

$$\begin{gathered} y_{0} = x_{0}^{^{\prime}} \oplus x_{4}^{^{\prime}} \oplus x_{5}^{^{\prime}} \oplus x_{6}^{^{\prime}} \oplus x_{7}^{^{\prime}} \oplus 1, y_{1} = x_{0}^{^{\prime}} \oplus x_{1}^{^{\prime}} \oplus x_{5}^{^{\prime}} \oplus x_{6}^{^{\prime}} \oplus x_{7}^{^{\prime}} \oplus 1 \hfill \\ y_{2} = x_{0}^{^{\prime}} \oplus x_{1}^{^{\prime}} \oplus x_{2}^{^{\prime}} \oplus x_{6}^{^{\prime}} \oplus x_{7}^{^{\prime}} , y_{3} = x_{0}^{^{\prime}} \oplus x_{1}^{^{\prime}} \oplus x_{2}^{^{\prime}} \oplus x_{3}^{^{\prime}} \oplus x_{7}^{^{\prime}} \hfill \\ y_{4} = x_{0}^{^{\prime}} \oplus x_{1}^{^{\prime}} \oplus x_{2}^{^{\prime}} \oplus x_{3}^{^{\prime}} \oplus x_{4}^{^{\prime}} , y_{5} = x_{1}^{^{\prime}} \oplus x_{2}^{^{\prime}} \oplus x_{3}^{^{\prime}} \oplus x_{4}^{^{\prime}} \oplus x_{5}^{^{\prime}} \oplus 1 \hfill \\ y_{6} = x_{2}^{^{\prime}} \oplus x_{3}^{^{\prime}} \oplus x_{4}^{^{\prime}} \oplus x_{5}^{^{\prime}} \oplus x_{6}^{^{\prime}} \oplus 1, y_{7} = x_{3}^{^{\prime}} \oplus x_{4}^{^{\prime}} \oplus x_{5}^{^{\prime}} \oplus x_{6}^{^{\prime}} \oplus x_{7}^{^{\prime}} \hfill \\ \end{gathered}$$

(17)

We convert $y$ to ${\text{GF}}\left( {2^{4} } \right)^{2}$ and also represent $x^{\prime}$ in ${\text{GF}}\left( {2^{4} } \right)^{2}$ to derive the affine transformation in ${\text{GF}}\left( {2^{4} } \right)^{2} .$

(1)
Let $w$ represent $y$ in $GF\left( {2^{4} } \right)^{2}$. By Eq. 15 (map from $GF\left( {2^{8} } \right)$ to $GF\left( {2^{4} } \right)^{2}$):
$$\begin{gathered} w_{0} = y_{0} \oplus y_{4} \oplus y_{5} \oplus y_{6} , w_{1} = y_{1} \oplus y_{2} , w_{2} = y_{1} \oplus y_{7} ,w_{3} = y_{2} \oplus y_{4} ,w_{4} = y_{4} \oplus y_{5} \oplus y_{6} \hfill \\ w_{5} = y_{1} \oplus y_{4} \oplus y_{6} \oplus y_{7} , w_{6} = y_{2} \oplus y_{3} \oplus y_{5} \oplus y_{7} , w_{7} = y_{5} \oplus y_{7} \hfill \\ \end{gathered}$$
(18)
(2)
Let $z$ be the $GF\left( {2^{4} } \right)^{2}$ format of $x^{\prime}.$ From Eq. 16:
$$\begin{gathered} x^{\prime}_{0} = z_{0} \oplus z_{4} , x^{\prime}_{1} = z_{4} \oplus z_{5} \oplus z_{7} , x^{\prime}_{2} = z_{1} \oplus z_{4} \oplus z_{5} \oplus z_{7} ,x^{\prime}_{3} = z_{1} \oplus z_{4} \oplus z_{5} \oplus z_{6} \hfill \\ x^{\prime}_{4} = z_{1} \oplus z_{3} \oplus z_{4} \oplus z_{5} \oplus z_{7} , x^{\prime}_{5} = z_{2} \oplus z_{4} \oplus z_{5} , x^{\prime}_{6} = z_{1} \oplus z_{2} \oplus z_{3} \oplus z_{4} \oplus z_{7} , \hfill \\ x^{\prime}_{7} = z_{2} \oplus z_{4} \oplus z_{5} \oplus z_{7} \hfill \\ \end{gathered}$$
(19)
(3)
Replace $y$ in Eq. 18 with x’ in Eq. 17 and replace x’ with its $GF\left( {2^{4} } \right)^{2}$ format z:
$$\begin{aligned} w_{0} & = y_{0} \oplus y_{4} \oplus y_{5} \oplus y_{6} \\ & = (x_{0}^{^{\prime}} \oplus x_{4}^{^{\prime}} \oplus x_{5}^{^{\prime}} \oplus x_{6}^{^{\prime}} \oplus x_{7}^{^{\prime}} \oplus 1) \oplus (x_{0}^{^{\prime}} \oplus x_{1}^{^{\prime}} \oplus x_{2}^{^{\prime}} \oplus x_{3}^{^{\prime}} \oplus x_{4}^{^{\prime}} ) \\ & \quad \quad \oplus \left( {x_{1}^{^{\prime}} \oplus x_{2}^{^{\prime}} \oplus x_{3}^{^{\prime}} \oplus x_{4}^{^{\prime}} \oplus x_{5}^{^{\prime}} \oplus 1} \right) \\ & \quad \quad \oplus \left( { x_{2}^{^{\prime}} \oplus x_{3}^{^{\prime}} \oplus x_{4}^{^{\prime}} \oplus x_{5}^{^{\prime}} \oplus x_{6}^{^{\prime}} \oplus 1} \right)\;\;\left( {\text{by Equation 17}} \right) \\ & = x_{2}^{^{\prime}} \oplus x_{3}^{^{\prime}} \oplus x_{5}^{^{\prime}} \oplus x_{7}^{^{\prime}} \oplus 1 = \left( {z_{1} \oplus z_{4} \oplus z_{5} \oplus z_{7} } \right) \\ & \quad \quad \oplus \left( {z_{1} \oplus z_{4} \oplus z_{5} \oplus z_{6} } \right) \oplus \left( { z_{2} \oplus z_{4} \oplus z_{5} } \right) \\ & \quad \quad \oplus \left( { z_{2} \oplus z_{4} \oplus z_{5} \oplus z_{7} } \right) \oplus 1\;\;\left( {\text{by Equation 19}} \right) \\ & = z_{6} \oplus 1 = \left( {z_{6} } \right)^{\prime} \\ \end{aligned}$$

Similarly, we can get:
$$\begin{gathered} w_{1} = (z_{1} \oplus z_{2} \oplus z_{7} )^{\prime}, w_{2} = (z_{0} \oplus z_{5} \oplus z_{6} \oplus z_{3} )^{\prime}, w_{3} = z_{1} \oplus z_{5} \oplus z_{6} \oplus z_{7} \hfill \\ w_{4} = z_{0} \oplus z_{2} \oplus z_{4} \oplus z_{5} \oplus z_{6} \oplus z_{7} , w_{5} = z_{1} \oplus z_{5} \oplus z_{6} , w_{6} = (z_{2} \oplus z_{6} \oplus z_{7} )^{\prime} \hfill \\ w_{7} = (z_{3} \oplus z_{5} )^{\prime} \hfill \\ \end{gathered}$$
(20)
(4)
For the consistency of the other equations in this paper, we replace w by y, z by x (x,y $\in GF\left( {2^{4} } \right)^{2}$) in Eq. 20 and let $a = x_{5} \oplus x_{6} \oplus x_{7}$, we derive
$$\begin{gathered} {\mathbf{y}}\,{\mathbf{ = }}\,{\mathbf{AFF\_TRAN}}\left( {\mathbf{x}} \right){\mathbf{:}} \hfill \\ a = x_{5} \oplus x_{6} \oplus x_{7} \hfill \\ y_{0} = (x_{6} )^{\prime}, y_{1} = (x_{1} \oplus x_{2} \oplus x_{7} )^{\prime},y_{2} = (x_{0} \oplus x_{3} \oplus x_{5} \oplus x_{6} )^{\prime},y_{3} = x_{1} \oplus a \hfill \\ y_{4} = x_{0} \oplus x_{2} \oplus x_{4} \oplus a, y_{5} = x_{1} \oplus x_{5} \oplus x_{6} ,y_{6} = \left( {x_{2} \oplus x_{6} \oplus x_{7} } \right)^{\prime},y_{7} = \left( {x_{3} \oplus x_{5} } \right)^{\prime} \hfill \\ \end{gathered}$$
(21)

Figure 8 describes the proposed subpipelined architecture of Subbytes in GF((2⁴)²). The dashed lines stand for the registers.

We cut an AES round unit into 8 substages with the maximum delay determined by part II (Fig. 8) in Subbytes. The inverse S-box can use the same multiplicative inverse in encryption except that the inverse affine transformation is applied before the multiplicative inverse. We also derive the following formula for the inverse affine transformation in GF(2⁴)²:

$$\begin{gathered} {\varvec{y}}\,\user2{ = }\,\user2{Inv\_AFF\_TRAN}\left( {\varvec{x}} \right) \hfill \\ a = x_{1} \oplus x_{5} \oplus x_{6} , b = x_{0} \oplus x_{2} \oplus x_{7} \hfill \\ y_{0} = b^{\prime}, y_{1} = (x_{0} \oplus x_{1} \oplus x_{6} )^{\prime}, y_{2} = x_{0} \oplus x_{3} \oplus x_{5} \oplus x_{6} , y_{3} = (x_{7} \oplus a)^{\prime} \hfill \\ y_{4} = x_{1} \oplus x_{4} \oplus x_{5} \oplus b, y_{5} = a, y_{6} = x_{0}^{^{\prime}} , y_{7} = x_{3} \oplus x_{5} \hfill \\ \end{gathered}$$

(22)

Figure 9 illustrates the design of S-box in encryption and decryption. It can process eight bits input in GF(2⁴)². Four units are required to process the 32-bit data path.

3.4.3 MixColumns on GF((2⁴)²)

MixColumns are another transformation which involves mathematical operations in GF((2⁴)²). We derive the following formulas to perform MixColumns in composite field.

Since GF((2⁴)²) is an isomorphic field to GF(2⁸), and {02}, {03}, {01} in GF(2⁸) are mapped to {26}, {27}, {01}, respectively, in GF((2⁴)²), the MixColumns operation described by Eq. 2 can be mapped directly to Eq. 23.

$$\left[ {\begin{array}{*{20}c} {26} & {27} & {01} & {01} \\ {01} & {26} & {27} & {01} \\ {01} & {01} & {26} & {27} \\ {27} & {01} & {01} & {26} \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {s_{0,0} } & {s_{0,1} } & {s_{0,2} } & {s_{0,3} } \\ {s_{1,0} } & {s_{1,1} } & {s_{1,2} } & {s_{1,3} } \\ {s_{2,0} } & {s_{2,1} } & {s_{2,2} } & {s_{2,3} } \\ {s_{3,0} } & {s_{3,1} } & {s_{3,2} } & {s_{3,3} } \\ \end{array} } \right]$$

(23)

Observing that in GF((2⁴)²), {27} = {26} ⊕ {01}, Eq. 23 is equal to Eq. 24, where j = 0, 1, 2, 3:

$$\begin{gathered} S_{0,j}^{^{\prime}} = \left\{ {26} \right\} \times \left( {S_{0,j} + S_{1,j} } \right) + S_{1,j} + S_{2,j} + S_{3,j} \hfill \\ S_{1,j}^{^{\prime}} = \left\{ {26} \right\} \times \left( {S_{1,j} + S_{2,j} } \right) + S_{0,j} + S_{2,j} + S_{3,j} \hfill \\ S_{2,j}^{^{\prime}} = \left\{ {26} \right\} \times \left( {S_{2,j} + S_{3,j} } \right) + S_{0,j} + S_{1,j} + S_{3,j} \hfill \\ S_{3,j}^{^{\prime}} = \left\{ {26} \right\} \times \left( {S_{0,j} + S_{3,j} } \right) + S_{0,j} + S_{1,j} + S_{2,j} \hfill \\ \end{gathered}$$

(24)

Equation 24 presents the MixColumn transformation of one column of a state. The MixColumn transformation can be implemented by the parallel structure in Fig. 10.

In the following, we derive Eq. 28 to calculate x × 26 in GF((2⁴)²). That is, we represent the results of x × {02} in GF((2⁴)²).

(1)
Let x,y $\in GF\left( {2^{8} } \right),$ $y = x \times \left\{ {02} \right\}:$
$$y_{0} = x_{7} , y_{1} = x_{0} \oplus x_{7} , y_{2} = x_{1} , y_{3} = x_{2} \oplus x_{7} , y_{4} = x_{3} \oplus x_{7} , y_{5} = x_{4} , y_{6} = x_{5} , y_{7} = x_{6}$$
(25)
(2)
Convert y to the element in GF((2⁴)²). Let w represent y in GF((2⁴)²), that is Eq. 18.
(3)
Let $z$ be the $GF\left( {2^{4} } \right)^{2}$ format of x:
$$\begin{gathered} x_{0} = z_{0} \oplus z_{4} , x_{1} = z_{4} \oplus z_{5} \oplus z_{7} , x_{2} = z_{1} \oplus z_{4} \oplus z_{5} \oplus z_{7} ,x_{3} = z_{1} \oplus z_{4} \oplus z_{5} \oplus z_{6} \hfill \\ x_{4} = z_{1} \oplus z_{3} \oplus z_{4} \oplus z_{5} \oplus z_{7} , x_{5} = z_{2} \oplus z_{4} \oplus z_{5} , x_{6} = z_{1} \oplus z_{2} \oplus z_{3} \oplus z_{4} \oplus z_{7} , \hfill \\ x_{7} = z_{2} \oplus z_{4} \oplus z_{5} \oplus z_{7} \hfill \\ \end{gathered}$$
(26)
(4)
Replace x and y with their corresponding GF((2⁴)²) format z and w:
$$\begin{aligned} w_{0} & = y_{0} \oplus y_{4} \oplus y_{5} \oplus y_{6} \left( {{\text{by Equation}}\;18} \right) \\ & = x_{7} \oplus \left( {x_{3} \oplus x_{7} } \right) \oplus x_{4} \oplus x_{5} \left( {{\text{by Equation}}\;25} \right) \\ & = x_{3} \oplus x_{4} \oplus x_{5} \\ & = (z_{1} \oplus z_{4} \oplus z_{5} \oplus z_{6} ) \oplus \left( {z_{1} \oplus z_{3} \oplus z_{4} \oplus z_{5} \oplus z_{7} } \right) \oplus \left( {z_{2} \oplus z_{4} \oplus z_{5} } \right)\left( {{\text{by Equation}}\;26} \right) = z_{2} \oplus z_{3} \oplus z_{4} \oplus z_{5} \oplus z_{6} \oplus z_{7} \\ \end{aligned}$$

Through the same procedures, we can derive:
$$\begin{gathered} w_{0} = z_{2} \oplus z_{3} \oplus z_{4} \oplus z_{5} \oplus z_{6} \oplus z_{7} , w_{1} = z_{0} \oplus z_{2} \oplus z_{4} , w_{2} = z_{0} \oplus z_{1} \oplus z_{3} \oplus z_{4} \oplus z_{5} \hfill \\ w_{3} = z_{1} \oplus z_{2} \oplus z_{4} \oplus z_{5} \oplus z_{6} , w_{4} = z_{3} \oplus z_{6} , w_{5} = z_{0} \oplus z_{3} \oplus z_{6} \oplus z_{7} \hfill \\ w_{6} = z_{1} \oplus z_{4} \oplus z_{7} , w_{7} = z_{2} \oplus z_{5} \hfill \\ \end{gathered}$$
(27)

(5)
For consistency, replace z with x, and replace w with y (x,y $\in GF\left( {2^{4} } \right)^{2}$):
$$\begin{gathered} {\varvec{y}} = {\varvec{x}} \times 26,\user2{ }\;\user2{x,y} \in {\varvec{GF}}\left( {2^{4} } \right)^{2} \hfill \\ a = x_{2} \oplus x_{4} , b = x_{3} \oplus x_{6} \oplus x_{7} , c = x_{1} \oplus x_{5} \hfill \\ y_{0} = a \oplus b \oplus x_{5} , y_{1} = a \oplus x_{0} , y_{2} = c \oplus x_{0} \oplus x_{3} \oplus x_{4} , y_{3} = c \oplus a \oplus x_{6} \hfill \\ y_{4} = x_{3} \oplus x_{6} , y_{5} = b \oplus x_{0} , y_{6} = x_{1} \oplus x_{4} \oplus x_{7} , y_{7} = x_{2} \oplus x_{5} \hfill \\ \end{gathered}$$
(28)

In this design for both encryption and decryption, we will modify the MixColumn and InvMixColumn architecture proposed by Fischer et al. [24]. We need to map the previous architecture from GF(2⁸) to GF((2⁴)²). It can be seen that we only need to modify the “xtime” operation. That is, to calculate “xtimes” in GF((2⁴)²).

3.4.4 Subpipelined keyschedule

There are two approaches to implement keyschedule: (1) pre-calculated keyschedule and (2) on-the-fly keyschedule. In the pre-calculated keyschedule, the (Nr + 1) 128-bit roundkeys are generated before the encryption or decryption begins and stored in the memory. The addroundkey operation accesses the roundkeys by referring to the corresponding address in the memory. The advantage of this approach is that the keyschedule only needs to be performed once; however, the drawbacks include:

(i)
The (Nr + 1) roundkeys cost (Nr + 1) × 128 bits memory space;
(ii)
The cipherkey should not change frequently. Every time it changes, the roundkeys must be recalculated.

In this paper, we propose a new 32-bit pipelined on-the-fly keyschedule in fully composite field (GF((2⁴)²)) with 128-, 192-, 256-bit key sizes, where each 128-bit roundkey is generated at every four clock cycles (32-bit at each clock). The following shows the 32-bit roundkeys at each clock cycle (KA(i), and KB(i) represent the round keys for block A and block B, each is 32-bit, 0 ≤ i ≤ 4Nr + 3).

The roundkeys for block A:

roundkey[0]={KA(0), KA(1), KA(2), KA(3)}

roundkey[1]={KA(4), KA(5), KA(6), KA(7)}

……

roundkey[Nr]={KA(4Nr), KA(4Nr+1), KA(4Nr+2), KA(4Nr+3)}

The roundkeys for block B:

roundkey[0]={KB(0), KB(1), KB(2), KB(3)}

roundkey[1]={KB(4), KB(5), KB(6), KB(7)}

……

roundkey[Nr]={KB(4Nr), KB(4Nr+1), KB(4Nr+2), KB(4Nr+3)}

Because we are using the on-the-fly keyschedule, keyschedule and encryptor/decryptor are sharing the same clock, and the general frequency is determined by the maximum delay in both keyschedule and encryptor/decryptor modules. To achieve an efficient pipelining, proper division in keyschedule is as important as in encryptor/decryptor. We know that subword is the most costly component in keyschedule. In order to make the optimal delay in both modules, we implement subword in the same way as Subbytes in encryptor/decryptor.

All mathematic operations in keyschedule are transformed into field GF((2⁴)²). Subword shares the same structure as in Subbytes. Xorrcon is a simple XOR operation with a round constant, which is initially {01} and multiplied by {02} at each keyschedule round. Keyschedule round is defined as follows. It begins when clk counter = 0. If key size is 128 bit, keyschedule round cycle is four; if key size is 192 bit, keyschedule round cycle is six; if key size is 256 bit, keyschedule round cycle is eight. We know that in GF((2⁴)²), {01} is still {01} and {02} is mapped to {26}. We can use Eq. 28 to generate round constant for each keyschedule round.

The proposed keyschedule has three key size options: Key128, Key192 and Key256. The notation of roundkey32 stands for 32-bit roundkey for each clock cycle, roundkey stands for 128-bit roundkey for a round of AES.

For decryption, the roundkey32 must be created in the reverse order. The last Nk roundkey32 from encryption is stored in a 256-bit register to be used as the initial decipherkey roundkey32 for decryption. For a given cipherkey, at least one encryption operation must be performed in order to store the final Nk roundkey32 for use during decryption. Multiplexers are then used to select between the cipherkey and decipherkey, based on encryption or decryption mode, respectively. Since the decipherkey roundkey32 is already in GF((2⁴)²), they do not pass through the MAP operation.

Figure 11 illustrates the keyschedule architecture. The multiplexers mul1 and mul2 are used to reconfigure the pipeline for each of the three key sizes.

SA, SB, SC and SD are the four sections of subword operation with interspersed registers. RW is the outcome of rotword. RC generates the round constant for xorrcon in GF((2⁴)²). Multiplexor mul3 is used to select the correct previous roundkey32 as input to the subword operation. Multiplexor mul4 selects the appropriate calculated result to serve as the next roundkey32.

Table 2 summarizes the reconfigurable control of the multiplexers to generate three key sizes (• represents that the multiplexer is enabled for the corresponding key size, and the numbers represent the input selections of the multiplexer depending on the corresponding clock cycles).

Table 2 Summary of reconfigurable control for keyschedule-128/192/256

Full size table

When key size is 128 bits, the encryptor round number is ten. Two blocks A and B need 22 roundkeys. In our design, the first step is to map (MAP) cipherkey from GF(2⁸) to GF((2⁴)²). After that, it performs its isomorphic functions in GF((2⁴)²). The output of keyschedule is roundkey32s represented in GF((2⁴)²). They are the exact format required in encryption where the message blocks are represented in GF((2⁴)²). No inverse MAP is required in keyschedule. SA, SB, SC and SD are the four sections of subword operation. We place three registers among the four substages in subword. RW is the outcome of rotword. RC generates the round constant for xorrcon in GF((2⁴)²).

4 Implementation performance and comparison

Many studies of hardware AES implementations have been published. Table 3 summarizes the functions provided by different FPGA implementations.

Table 3 Function comparisons of different AES architectures

Full size table

We do not use BRAM in our design in order to make the architecture suitable for wireless and embedded devices. Our proposed architecture has been simulated and synthesized with Xilinx Synthesis Technology (XST) ISE 10, and implemented on a Xilinx Virtex-4 device. From the synthesis result, we also optimize the delay time between different stages in our design to improve the performance. Table 4 illustrates the synthesis results with Virtex-4 XC4VSX25 and performance comparison.

Table 4 Design synthesis results and performance comparison

Full size table

Compared with the previous architectures, our design focuses on the low cost, non-BRAM implementations. Pramstaller et al. proposed a compact design costing 1125 slices in [5] with throughput of 215 Mbps for 128-bit, 180 Mbps for 192-bit, and 156 Mbps for 256-bit in the maximum frequency of 161 MHz. However, the round keys were pre-calculated by the key generator and RAM required to store those keys. We generate the round keys on-the-fly which is very useful and efficient when the key has been changed (AES is a symmetric-key cryptography, and the session key usually changes frequently.) In addition, our throughput increases greatly for each of the three key sizes. Furthermore, we propose a new subpipelined keyschedule which can support all three key sizes (128, 192, 256-bits). The time delays between the stages in encryption/decryption and keyschedule have been optimized in our architecture. We also present a new 32-bit complete composite field approach where the GF((2⁴)²) field arithmetic applies in all transformations in encryptor/decryptor and keyschedule to save the cost of mapping between GF(2⁸) and GF((2⁴)²) greatly. In addition, the 32-bit data path in our design can reduce the hardware cost greatly and can be efficiently applied in computing-resources restricted environments, such as wireless devices and embedded devices.

5 Conclusion

AES is an important and popular cryptographic algorithm to secure the information and data transmission. In this paper, we propose a compact reconfigurable FPGA architecture for the AES implementation. The 32-bit single-round unit design results in low area cost, which makes it suitable for low-end devices. The combinational logic approach of S-Box eliminates the need for BRAMs.

In our architecture, a fully GF((2⁴)²) composite field arithmetic is applied in all transformations in encryption/decryption and keyschedule to save the cost of mapping greatly. That is, only one MAP and one MAP⁻¹ are applied in encryption/decryption, and one MAP is applied in keyschedule. Full composite field-based design decreases hardware complexity of arithmetic operations in AES. In addition, we apply subpipelining technology in both encryptor/decryptor and keyschedule modules to optimize the speed/area ratio. The capability to deal with three key sizes makes our design an efficient reconfigurable architecture of AES. The performance comparison indicates that the proposed AES architecture achieves better performance than previous work.

In conclusion, the proposed compact and reconfigurable AES architecture has high throughput and low area cost, which is very useful in the computing-restricted environment and wireless devices.

6 Future work

In the future, we will synthesize our FPGA prototype, optimize the design and implement it in VLSI. We believe the performance of the proposed architecture could be increased with current VLSI design tools and technology, and develop a new reconfigurable and efficient AES encryption/decryption chip which can be easily embedded into the wireless and computing-restricted devices to provide the security services.

References

W. Stallings, Cryptography and Network Security-Principles and Practices, 4th edn. (Pearson Prentice hall, 2006)
Google Scholar
NIST. Announcing the advanced encryption standard (AES). Available at https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf, 2001.
Daemen J, Rijmen V. AES proposal: Rijndael. Technical report, National Institute of Standards and Technology (NIST). Available at http://www.nic.funet.fi/pub/crypt/cryptography/symmetric/aes/nist/Rijndael.pdf, 2000.
P. Chodowiec, K. Gaj, Very compact FPGA implementation of the AES algorithm. Cryptogr Hardw Embed Syst CHES 2003, 319–333 (2003)
Google Scholar
N. Pramstaller, J. Wolkerstorfer, A universal and efficient AES co-processor for field programmable logic arrays, in Field programmable logic and application. ed. by J. Becker, M. Platzner, S. Vernalde (Springer Berlin Heidelberg, Berlin, Heidelberg, 2004), pp.565–574. https://doi.org/10.1007/978-3-540-30117-2_58
Chapter Google Scholar
T. Good, M. Benaissa, AES on FPGA from the fastest to the smallest, in Cryptographic hardware and embedded systems – CHES 2005. ed. by J.R. Rao, B. Sunar (Springer Berlin Heidelberg, Berlin, Heidelberg, 2005), pp.427–440. https://doi.org/10.1007/11545262_31
Chapter Google Scholar
N. Pramstaller, S. Mangard, S. Dominikus, J. Wolkerstorfer, Efficient AES implementations on ASICs and FPGAs, in Advanced encryption standard – AES. ed. by H. Dobbertin, V. Rijmen, A. Sowa (Springer Berlin Heidelberg, Berlin, Heidelberg, 2005), pp.98–112. https://doi.org/10.1007/11506447_9
Chapter MATH Google Scholar
X. Zhang, K.K. Parhi, High-speed VLSI architectures for the AES algorithm. IEEE Trans VLSI Syst 12(9), 957–967 (2004)
Article Google Scholar
Gaj K, Chodowiec P. Comparison of the hardware performance of the AES candidates using reconfigurable hardware. In: AES candidate conference, pp. 40–54, 2000.
Liberatori M, Otero F, Bonadero JC, Castineira J. AES-128 Cipher. high speed, low cost FPGA implementation. In: 2007 3rd southern conference on programmable logic, pp. 195–198, 2007.
A. Rudra, P.K. Dubey, C.S. Jutla, V. Kumar, J.R. Rao, P. Rohatgi, Efficient rijndael encryption implementation with composite field arithmetic, in Cryptographic hardware and embedded systems — CHES 2001. ed. by Ç.K. Koç, D. Naccache, C. Paar (Springer Berlin Heidelberg, Berlin, Heidelberg, 2001), pp.171–184. https://doi.org/10.1007/3-540-44709-1_16
Chapter Google Scholar
A. Satoh, S. Morioka, K. Takano, S. Munetoh, A compact Rijndael hardware Architecture with S-Box Optimization, in Advances in Cryptology. ed. by C. Boyd (Springer Berlin Heidelberg, Berlin, Heidelberg, 2001), pp.239–254. https://doi.org/10.1007/3-540-45682-1_15
Chapter Google Scholar
W.-K. Lee, H.J. Seo, S.C. Seo, S.O. Hwang, Efficient implementation of AES-CTR and AES-ECB on GPUs with applications for high-speed FrodoKEM and exhaustive key search. IEEE Trans Circuits Syst II Express Briefs 69(6), 2962–2966 (2022)
Google Scholar
A.A. Pammu, W.-G. Ho, N.K.Z. Lwin, K.-S. Chong, B.-H. Gwee, A high throughput and secure authentication-encryption AES-CCM algorithm on asynchronous multicore processor. IEEE Trans Inf Forensics Secur 14(4), 1023–1036 (2019)
Article Google Scholar
Y. Zhou, G.-M. Tang, J.-H. Yang, P.-S. Yu, C. Peng, Logic design and simulation of a 128-b AES encryption accelerator based on rapid single-flux-quantum circuits. IEEE Trans Appl Supercond 31(6), 1–11 (2021)
Article Google Scholar
Hodjat A, Verbauwhede I. A 21.54 Gbits/s fully pipelined AES processor on FPGA. In: 12th annual IEEE symposium on field-programmable custom computing machines, pp. 308–309, 2004.
McLoone M´, McCanny JV. High performance single-chip FPGA Rijndael algorithm implementations. In: Cryptographic hardware and embedded systems --- CHES 2001, pp. 65–76, 2001.
N. Yu, H.M. Heys, Investigation of compact hardware implementation of the advanced encryption standard. Can Conf Electr Comput Eng 2005, 1069–1072 (2005)
Google Scholar
Järvinen K, Tommiska M, Skyttä J. A fully pipelined memoryless 17.8 Gbps AES-128 encryptor. In: ACM/SIGDA eleventh international symposium on field programmable gate arrays, pp. 207–215, 2003.
Chang C-J, Huang C-W, Tai H-Y, Lin M-Y. 8-bit AES implementation in FPGA by multiplexing 32-bit AES operation. In: The first international symposium on data, privacy, and E-commerce (ISDPE 2007), pp. 505–507, 2007.
J. Wolkerstorfer, E. Oswald, M. Lamberger, An ASIC implementation of the AES SBoxes, in Topics in cryptology — CT-RSA 2002. ed. by B. Preneel (Springer Berlin Heidelberg, Berlin, Heidelberg, 2002), pp.67–78. https://doi.org/10.1007/3-540-45760-7_6
Chapter Google Scholar
H. Li, J. Li, A new compact architecture for AES with optimized ShiftRows operation. In: Proceedings of 2007 IEEE international symposium on circuits and systems, pp. 1851–1854, New Orleans, USA, May 27–30, 2007.
K. Li, H. Li, An efficient and compact subpipelined s-box architecture for AES. In: Proceedings of the ISCA 2nd international conference on advanced computing and communications, pp 45–49, Los Angeles, USA, 2012.
V. Fischer, M. Drutarovsky, P. Chodowiec, F. Gramain, InvMixColumn decomposition and multilevel resource sharing in AES implementations. IEEE Trans VLSI Syst 13(8), 989–992 (2005). https://doi.org/10.1109/TVLSI.2005.853606
Article Google Scholar
P. Bulens, F.-X. Standaert, J.-J. Quisquater, P. Pellegrin, G. Rouvroy, Implementation of the AES-128 on virtex-5 FPGAs, in Progress in cryptology – AFRICACRYPT 2008. ed. by S. Vaudenay (Springer Berlin Heidelberg, Berlin, Heidelberg, 2008), pp.16–26. https://doi.org/10.1007/978-3-540-68164-9_2
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, University of Lethbridge, Lethbridge, AB, Canada
Ke Li, Hua Li & Graeme Mund

Authors

Ke Li
View author publications
You can also search for this author in PubMed Google Scholar
Hua Li
View author publications
You can also search for this author in PubMed Google Scholar
Graeme Mund
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

KL and HL proposed the reconfigurable and compact AES architecture for encryption and decryption. GM implemented the design with Xilinx FPGA. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Hua Li.

Ethics declarations

Competing interests

There are no conflict/competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, K., Li, H. & Mund, G. A reconfigurable and compact subpipelined architecture for AES encryption and decryption. EURASIP J. Adv. Signal Process. 2023, 5 (2023). https://doi.org/10.1186/s13634-022-00963-3

Download citation

Received: 28 July 2022
Accepted: 14 December 2022
Published: 09 January 2023
DOI: https://doi.org/10.1186/s13634-022-00963-3

A reconfigurable and compact subpipelined architecture for AES encryption and decryption

Abstract

1 Introduction

2 AES algorithm

2.1 Subbytes

2.2 ShiftRows

2.3 MixColumns

2.4 Addroundkey

2.5 Keyschedule

3 32-bit subpipelined reconfigurable and compact architecture for AES

3.1 32-bit single-round unit

3.2 Full composite field architecture with keyschedule

3.3 Subpipelined encryptor/decryptor and keyschedule

3.4 Double-block subpipelined architecture

3.4.1 ShiftRows

3.4.2 Subpipelined Subbytes

3.4.3 MixColumns on GF((24)2)

3.4.4 Subpipelined keyschedule

4 Implementation performance and comparison

5 Conclusion

6 Future work

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

3.4.3 MixColumns on GF((2⁴)²)