A reconfigurable and compact subpipelined architecture for AES encryption and decryption

AES has been used in many applications to provide the data confidentiality. A new 32-bit reconfigurable and compact architecture for AES encryption and decryption is presented and implemented in non-BRAM FPG in this paper. It can be reconfigured for the options of different key sizes which is very flexible for the users to apply AES for various application environments. The proposed design employs a single-round architecture and subpipeling to minimize the hardware cost. The fully composite field GF((24)2)-based encryption/decryption and keyschedule lead to the lower hardware complexity and efficient subpipelining for 32-bit data path. In addition, a new subpipelined on-the-fly keyschedule over composite field GF((24)2) is proposed for all standard key sizes (128-, 192-, 256-bit) which generates the roundkeys simultaneously and efficiently. This feature is very useful and efficient when the main key has been changed since AES is a symmetric-key cryptography and the session key usually changes frequently. The proposed reconfigurable and compact design has higher throughput and lower hardware cost. It achieves throughputs of 375Mbits/s with 128-bit key, 318Mbits/s with 192-bit key and 275Mbits/s with 256-bit key on VIRTEX XC4VSX25-12, and the total number of slices is 1766. The proposed reconfigurable and compact AES architecture can be efficiently applied in computing-restricted environments such as wireless and embedded devices.

throughput, but there are much more area and energy cost which only suitable for high-end applications. Another approach is only implementing a single-round unit and applies the same unit in different rounds.
In this paper, a compact and reconfigurable design of AES with low hardware cost and adequate throughput is proposed and implemented in a non-BRAM FPGA. This design applies a 32-bit single-round unit, which costs much less hardware area than the 128-bit fully unrolled schemes. In order to reduce the hardware complexity further, we convert the arithmetic operations of AES from field GF (2 8 ) to field GF((2 4 ) 2 ). Unlike the previous designs in [6,8,12,16] where partial-composite field AES is applied, we conduct the entire AES operations in GF((2 4 ) 2 ) to minimize the overhead of isomorphic mapping functions. In our design, only two forward mapping functions and one backward mapping function are used. In addition, subpipelining is applied to improve the throughput/ area ratio.
The standard announced by NIST [2] indicates that AES is a block cipher with 128bit block size and 128-, 192-, 256-bit key sizes. These three key sizes are specified for various security levels. The capability to deal with all key sizes makes reconfigurability an important feature of AES implementations. The previous work of [6,8,12,[17][18][19][20] applied the on-the-fly key generator to support instant key changing. The design in [8] made a subpipelined keyschedule, but it only supported 128-bit key size. When subpipelining on-the-fly keyschedule is employed in an AES implementation, the stages in keyschedule must be synchronized with the stages in the cipher, because they share the same clock. In this design, we propose a subpipelined on-the-fly keyschedule over field GF((2 4 ) 2 ), which supports all three key sizes.
The issue of secure communication in computing-restricted environments, such as personal digital assistants (PDAs), wireless devices, and many other embedded devices, has become more important recently. In order to apply AES in these devices, the AES implementations must be cost efficient. The objective of this research work is to design a reconfigurable and compact AES architecture which can be applied to the computingrestricted devices. The proposed architecture can be reconfigured to three different AES key sizes which is very useful when the users change the main key and also change the key sizes for different security levels because AES is a symmetric-key cryptography and the session key usually changes frequently. We also propose a subpipelined on-the-fly keyschedule for three options of key sizes that make the proposed architecture be easily implemented on non-BRAM FPGA.
The remainder of the paper is organized as the following. In Sect. 2, AES algorithm is introduced. The proposed compact and reconfigurable AES architecture is presented in Sect. 3. Implementation and performance are included in Sect. 4. Sections 5 and 6 are the conclusion and future work.

AES algorithm
AES is a symmetric block cipher with block size of 128-bit and three key sizes of (128-, 192-, or 256-bit) [1][2][3]. The AES parameters depend on the key size ( Table 1, the size of word is 32 bits): AES runs iteratively on four transformations (inv-/Subbytes, inv-/Shift-Rows, inv-/MixColumns and addroundkey) with different sequences in encryption and decryption. Figure 1 illustrates the basic architecture of AES. In the initial round (r = 0),

ShiftRows
This transformation circularly shifts each row of the state to the left on encryption. As in Fig. 2 [1], the top row of the state is noted as row(0), and the bottom row is noted as row (3). The ShiftRows perform i − byte circular left shift to row(i) (i = 0, 1, 2, 3).

MixColumns
This transformation treats each column of the state as a four-term polynomial over GF (2 8 ) and transforms each column to a new one by multiplying it with a constant poly- is the matrix form of MixColumns.

Addroundkey
The addroundkey is a simple logical XOR of the current state with a roundkey which is generated by the keyschedule.

32-bit subpipelined reconfigurable and compact architecture for AES
In this section, the 32-bit reconfigurable and compact AES architecture is proposed. In our design, the data path is 32-bit. That is one operation, for example, s-box, will be applied four times to process the 128 bits of one block in plaintext. The subpipelined on-the-fly keyschedule for different key sizes is also presented to provide the roundkey simultaneously and efficiently. In addition, the equivalent cipher [1] is adopted to make the same data flow for encryption and decryption and to share the reusable units.

32-bit single-round unit
Roll unfolded architecture is widely used to achieve high throughput. It conducts multiple rounds on one block by implementing more than one round units on the hardware. The more round units the architecture includes, the higher the hardware cost. The alternative scheme, which is called the single-round unit architecture, can be applied to simplify the hardware complexity. Instead of unfolding all the round units in devices, it implements a single-round unit which costs approximately 1/N r area of the unfolded scheme.
We propose a 32-bit single-round unit for a compact AES architecture. It needs iterating four times to perform a round on a block (128-bit), once every 32 bits.

Full composite field architecture with keyschedule
Many high-end FPGA devices possess Block-RAMs (BRAMs) which is efficient for the implementation of S-Box. S-Box, also referred as Subbytes, is the important and complicated operation in both encryptor/decryptor and keyschedule modules. However, these BRAM-based designs cannot be implemented in the low-cost devices which do not have BRAMs. An alternative approach for S-Box implementation is using combinational logic. But this method may lead to high hardware complexity because of the mathematical operations in AES over finite field GF(2 8 ).
In Fig. 4a, in each round before S-Box, it needs an isomorphic mapping function (MAP) from GF(2 8 ) to GF((2 4 ) 2 ); and the inverse mapping (MAP −1 ) afterwards. If key size is 128 bits, it applies 10 times S-Box to the plaintext and the cipherkey, which means that it needs 20MAPs and 20 MAP −1 s for the encryption of 128-bit data. In order to save the cost of MAP and MAP −1 , we propose a new 32-bit complete composite field approach (Fig. 4b). The GF((2 4 ) 2 ) field applies in all transformations in encryptor/decryptor and keyschedule. As illustrated in Fig. 4b, one MAP and one MAP −1 are applied in encryption, and one MAP is applied in keyschedule. This is a constant overhead which is not affected by the number of rounds.
In Eq. 6, a is an element in field GF (2 8 ). MAP(a) converts a to its isomorphic element in GF((2 4 ) 2 ), which is represented as a h x + a l .
In Eq. 7, a h x + a l is an element in field GF((2 4 ) 2 ). MAP −1 (a h x + a l ) converts a h x + a l to its isomorphic element in GF(2 8 ), which is represented as a.

Subpipelined encryptor/decryptor and keyschedule
Pipelining is applied in the designs to optimize speed/area ratio of AES. By inserting registers in combinational logic circuits, multiple blocks of hardware are running simultaneously. The frequency of the design is determined by the maximum delay between registers. We reduce the maximum delay and increase the frequency by optimizing the balance between stages. A 32-bit single-round subpipelined architecture in full composite field is proposed where one round unit is implemented and subpipelined into eight substages. To generate the roundkeys synchronously, we present an on-the-fly keyschedule. The encryption/decryption unit and the key expansion unit share the same clock which leads to the fact that the clock frequency is determined by the maximum delay in both units. This makes the balance of substage in keyschedule as important as in encryptor/decryptor. We propose a new subpipelined keyschedule on composite field for all standard key sizes. The most costly part of keyschedule is still S-Box. We divide it into the same substages as in encryptor/decryptor.

Double-block subpipelined architecture
The proposed architecture for encryptor is illustrated in Fig. 5. The decryption can be easily implemented by the equivalent cipher [1]. The eight 32-bit registers (four in Shift-Rows, three in Subbytes and one between Subbytes and MixColumns) are used to cut one round unit into eight substages, which leads to an eight clock cycles initial delay to generate the first 32-bit ciphertext. clk counter is a clock register counter generated in keyschedule. It is used to synchronize encryptor/decryptor and keyschedule. We use a double-block (block A and B) data flow in our subpipelined architecture. Figure 5a illustrates the subpipelining in ShiftRows operation, and Fig. 5b shows the subpipelining in Subbytes operation. We can see that the mappings from GF(2 8 ) to GF( (2 4 ) 2 ) are only required once after the inputs of plaintext and cipherkey. The inverse mapping ( GF( (2 4 ) 2 ) to GF (2 8 )) is applied to the final output in order to get the cipher text.
The 3-to-1 multiplexer ("mul") is controlled by the clk counter: • Case a In initial round, where 0 ≤ clk counter < 8, 128-bit plaintext is MAPed into GF( (2 4 ) 2 ) and XORed with the according roundkey in four clock cycles, 32 bits at The detail operations of the ShiftRows, Subbytes and MixColumns are presented in the following.

ShiftRows
We use our proposed ShiftRows operation [22] in the design. It includes sixteen 8-bit registers and three 2-to-1 multiplexers. The block of data is shifted column by column. Two blocks of data are processed in the pipeline.
Our ShiftRows operation is designed in a column fashion (Fig. 6). In the architecture, the data (32-bits) in the columns are shifted in the order of column instead of rows. Each column is composed of four shift registers, and each register has 8 bits. By transforming the ShiftRows operation to a column fashion operation, we can make the design of Mixcolumns operation easier, since all the data in one column are required in the MixColumn operation.
The following are the ShiftRows procedure for encryption.
(1) First row No shift. We just let the data flow through.
(2) Second row Circular left shift operation. In this case, we connect the output of register R1C2 and the output of R1C3 to a multiplexer in order to select the output. (3) Third row Switch data. Switch the data between first element and third element, second element and fourth element in the row. The outputs of R2C1 and R2C3 are connected to a Multiplexer. (4) Fourth row Circular right shift operation. Similar to the case of second row, we connect the output of register R3C0 and the output of R3C3 to a Multiplexer. Similarly, we can derive the procedures for Inverse ShiftRows (Inv-ShiftRows) operations: (1) First row No shift.
(2) Second row Circular right shift operation. We connect the output of register R1C0 and the output of R1C3 to a Multiplexer. (3) Third row Switch Data. Same as the operation in ShiftRows of encryption. (4) Fourth row Circular left shift operation. We connect the output of register R3C2 and the output of R3C3 to a multiplexer in order to select the output.
The multiplexers are controlled by some clock counters and the encryption/decryption signals.

Subpipelined Subbytes
The key step of Subbytes is the calculation of the multiplicative inverse. Figure 7 illustrates the architecture of Subbytes proposed in [8]. It uses multiplication in GF(2 4 ) 2 three times. It also needs one inversion (x −1 ), one constant multiplier with {e} (× e, {e} is in hexadecimal notation, which is '1110' in binary notation), one squarer and two 4-bit XORs ( ⊕).
We proposed a 32-bit subpipelined compact s-box architecture in composite field of GF(2 4 ) 2 with balanced substages and efficient performance [23]. Considering x, y, z ∈ GF(2 4 ), x, y and z are represented in binary notation where Let a, b, c, d, e and f be 1-bit values, which equal to 0 or 1. ⊕ stands for XOR-operation. x 0 y 1 means x 0 ∧y 1 . Equations 8, 9, 10 and 11 [21] are used to calculate squaring, constant multiplication with {e}, multiplication and multiplicative inverse. In our design which is illustrated in Fig. 5, Subbytes should be cut into four substages. The key to an efficient subpipelining technology is to balance the delays of these substages.
We derive a new Eq. 12 from Eq. 11 to reduce the delay caused by x −1 . Equation 12 is derived in three steps: (1) In Eq. 11, replace "a" by its expression: (2) The expressions in step 1 can be equally changed to: According to Eq. 12, we design the logic circuit illustrated in Fig. 8 to perform x −1 over GF(2 4 ) 2 . Besides multiplicative inversion, other operations in Fig. 7 are the three multiplications (× 1, × 2 and × 3). In order to decrease the maximum delay caused by multiplication, we separate each multiplication into two steps and put each step in different substages. The registers between each substage store the result of the first step of multiplication and pass it to the second step. We decompose these three multipliers into two different manners (AB-type and MN-type) to achieve the best balance.
AB-type The AB-type multiplication is based on Eq. 13 which is derived from Eq. 10.
Step A calculates the value of all the binomials; Step B conducts XOR of every four (11)  values to generate z0, z1, z2 and z3. A register is inserted between Step A and Step B to store p 0 , p 1 , …, p 15 . The multiplication " × 1 " in Fig. 7 is separated as × 1A and × 1B in Fig. 8; Step A: Step B:

MN-type
The MN-type multiplication is based on Eq. 14 which is also derived from Eq. 10.
Step M creates the value of a, b and c; Step N implements the rest of Eq. 10. A register is inserted between Step M and Step N to store a, b, c. The multiplications of " × 2 " and " × 3 " in Fig. 7 are separated as × 2 M and × 2 N , × 3 M and × 3 N in Fig. 8.
According to Eq. 1: We convert y to GF 2 4 2 and also represent x ′ in GF 2 4 2 to derive the affine transformation in GF 2 4 2 .

Subpipelined keyschedule
There are two approaches to implement keyschedule: (1) pre-calculated keyschedule and (2) on-the-fly keyschedule. In the pre-calculated keyschedule, the (Nr + 1) 128-bit roundkeys are generated before the encryption or decryption begins and stored in the memory. The addroundkey operation accesses the roundkeys by referring to the corresponding address in the memory. The advantage of this approach is that the keyschedule only needs to be performed once; however, the drawbacks include: (i) The (Nr + 1) roundkeys cost (Nr + 1) × 128 bits memory space; (ii) The cipherkey should not change frequently. Every time it changes, the roundkeys must be recalculated.
In this paper, we propose a new 32-bit pipelined on-the-fly keyschedule in fully composite field (GF((2 4 ) 2 )) with 128-, 192-, 256-bit key sizes, where each 128-bit roundkey is generated at every four clock cycles (32-bit at each clock). The following shows the 32-bit roundkeys at each clock cycle (KA(i), and KB(i) represent the round keys for block A and block B, each is 32-bit, 0 ≤ i ≤ 4Nr + 3).
The roundkeys for block A: roundkey Because we are using the on-the-fly keyschedule, keyschedule and encryptor/decryptor are sharing the same clock, and the general frequency is determined by the maximum delay in both keyschedule and encryptor/decryptor modules. To achieve an efficient pipelining, proper division in keyschedule is as important as in encryptor/decryptor. We know that subword is the most costly component in keyschedule. In order to make the optimal delay in both modules, we implement subword in the same way as Subbytes in encryptor/decryptor. All mathematic operations in keyschedule are transformed into field GF((2 4 ) 2 ). Subword shares the same structure as in Subbytes. Xorrcon is a simple XOR operation with a round constant, which is initially {01} and multiplied by {02} at each keyschedule round. Keyschedule round is defined as follows. It begins when clk counter = 0. If key size is 128 bit, keyschedule round cycle is four; if key size is 192 bit, keyschedule round cycle is six; if key size is 256 bit, keyschedule round cycle is eight. We know that in GF((2 4 ) 2 ), {01} is still {01} and {02} is mapped to {26}. We can use Eq. 28 to generate round constant for each keyschedule round.
The proposed keyschedule has three key size options: Key128, Key192 and Key256. The notation of roundkey32 stands for 32-bit roundkey for each clock cycle, roundkey stands for 128-bit roundkey for a round of AES.
For decryption, the roundkey32 must be created in the reverse order. The last Nk roundkey32 from encryption is stored in a 256-bit register to be used as the initial decipherkey roundkey32 for decryption. For a given cipherkey, at least one encryption operation must be performed in order to store the final Nk roundkey32 for use during decryption. Multiplexers are then used to select between the cipherkey and decipherkey, based on encryption or decryption mode, respectively. Since the decipherkey roundkey32 is already in GF((2 4 ) 2 ), they do not pass through the MAP operation. Figure 11 illustrates the keyschedule architecture. The multiplexers mul1 and mul2 are used to reconfigure the pipeline for each of the three key sizes.
SA, SB, SC and SD are the four sections of subword operation with interspersed registers. RW is the outcome of rotword. RC generates the round constant for xorrcon in GF((2 4 ) 2 ). Multiplexor mul3 is used to select the correct previous roundkey32 as input to the subword operation. Multiplexor mul4 selects the appropriate calculated result to serve as the next roundkey32. Table 2 summarizes the reconfigurable control of the multiplexers to generate three key sizes (• represents that the multiplexer is enabled for the corresponding key size, and the numbers represent the input selections of the multiplexer depending on the corresponding clock cycles).
When key size is 128 bits, the encryptor round number is ten. Two blocks A and B need 22 roundkeys. In our design, the first step is to map (MAP) cipherkey from GF (2 8 ) to GF( (2 4 ) 2 ). After that, it performs its isomorphic functions in GF( (2 4 ) 2 ). The output of keyschedule is roundkey32s represented in GF((2 4 ) 2 ). They are the exact format required in encryption where the message blocks are represented in GF((2 4 ) 2 ). No inverse MAP is required in keyschedule. SA, SB, SC and SD are the four sections of subword operation. We place three registers among the four substages in subword. RW is the outcome of rotword. RC generates the round constant for xorrcon in GF((2 4 ) 2 ).

Implementation performance and comparison
Many studies of hardware AES implementations have been published. Table 3 summarizes the functions provided by different FPGA implementations.
We do not use BRAM in our design in order to make the architecture suitable for wireless and embedded devices. Our proposed architecture has been simulated and synthesized with Xilinx Synthesis Technology (XST) ISE 10, and implemented on a Xilinx Virtex-4 device. From the synthesis result, we also optimize the delay time between different stages in our design to improve the performance. Table 4 illustrates the synthesis results with Virtex-4 XC4VSX25 and performance comparison. Compared with the previous architectures, our design focuses on the low cost, non-BRAM implementations. Pramstaller et al. proposed a compact design costing 1125 slices in [5] with throughput of 215 Mbps for 128-bit, 180 Mbps for 192-bit, and 156 Mbps for 256-bit in the maximum frequency of 161 MHz. However, the round keys were pre-calculated by the key generator and RAM required to store those keys. We generate the round keys on-the-fly which is very useful and efficient when the key has been changed (AES is a symmetric-key cryptography, and the session key usually changes frequently.) In addition, our throughput increases greatly for each of the three key sizes. Furthermore, we propose a new subpipelined keyschedule which can support all three key sizes (128, 192, 256-bits). The time delays between the stages in encryption/decryption and keyschedule have been optimized in our architecture. We also present a new 32-bit complete composite field approach where the GF((2 4 ) 2 ) field arithmetic applies in all transformations in encryptor/decryptor and keyschedule to save the cost of mapping between GF(2 8 ) and GF( (2 4 ) 2 ) greatly. In addition, the 32-bit data path in our design can reduce the hardware cost greatly and can be efficiently applied in computing-resources restricted environments, such as wireless devices and embedded devices.

Conclusion
AES is an important and popular cryptographic algorithm to secure the information and data transmission. In this paper, we propose a compact reconfigurable FPGA architecture for the AES implementation. The 32-bit single-round unit design results in low area cost, which makes it suitable for low-end devices. The combinational logic approach of S-Box eliminates the need for BRAMs.
In our architecture, a fully GF((2 4 ) 2 ) composite field arithmetic is applied in all transformations in encryption/decryption and keyschedule to save the cost of mapping greatly. That is, only one MAP and one MAP −1 are applied in encryption/decryption, and one MAP is applied in keyschedule. Full composite field-based design decreases hardware complexity of arithmetic operations in AES. In addition, we apply subpipelining technology in both encryptor/decryptor and keyschedule modules to optimize the speed/area ratio. The capability to deal with three key sizes makes our design an efficient reconfigurable architecture of AES. The performance comparison indicates that the proposed AES architecture achieves better performance than previous work. In conclusion, the proposed compact and reconfigurable AES architecture has high throughput and low area cost, which is very useful in the computing-restricted environment and wireless devices.

Future work
In the future, we will synthesize our FPGA prototype, optimize the design and implement it in VLSI. We believe the performance of the proposed architecture could be increased with current VLSI design tools and technology, and develop a new reconfigurable and efficient AES encryption/decryption chip which can be easily embedded into the wireless and computing-restricted devices to provide the security services.