Deeply Pipelined AES Implementation on FPGA
AES is composed of five main components: SubBytes, ShiftRows, MixColumn, AddRoundKey, and KeyExpansion. The authors focused only on SubBytes & KeyExpansion components with respect to encryption. SubBytes is responsible of replacing a byte by another one to introduce dissimilarity between the original text and the cipher text. AES algorithm for encryption and decryption can be shown in Figure 1.

AES Components
- SubBytes
There are multiple ways to implement the SubBytes on FPGA such as LUT, BRAM and Mathematically using Addition & Multiplication in Galois Fields. In this research, we are focusing only on the use of Galois Field to implement SubBytes.
The SubBytes proposed by the authors is based on the multiplicative inverse in the Galois Field GF (28) of the input byte. The authors’ approach is a 7-stage pipelined calculation of the SubBytes is shown in Figure 2 [2]. There are 8 sub components beside registers to store intermediate values. The first components are Delta (δ) and Delta Inverse (δ-1) which are responsible to transform the input from GF (28) to GF (24) and vice versa. In encryption, we need to take the output of δ-1 and perform Affine transformation, we can combine both matrices into one, similarly with Affine Inverse and Delta in decryption. Then, we need a component that multiplies two inputs by each other in GF (24) denoted by (×16). There are two other components that are special case of the previous multiplier, they are (×2) which square the input (multiply it by itself), and (×λ) which multiplies the input by a fixed 4-bit value (1100)2. Another multiplier is needed for GF(22) denoted by (×). There is a component derived from the previous multiplier (×φ) which multiplies the input by a fixed 2-bit input (10)2. Lastly, we have the inverse component (× −1) which computes the inverse of input in the GF(24). The internal of the above components can be found in Figure 3 [3].
Figure 2: Proposed 7-stage Pipeline SubByte. Figure 3: SubBytes Components. - MixColumn
The MixColumn and InvMixColumn operates at the column level of the input (Figure 9). The authors did not improve on the MixColumn component. Hence, I used a ready-made one from open-source online file [4]
- ShiftRows
The ShiftRows and InvShiftRows operates at the Row level of the input (Figure 10). ShiftRows is just rewiring component, there are no improvements to be done here. I also took this module from the online source file [4].
- Add Round Key
AddRoundKey stage is simply Xoring the Round_Text with the corresponding Round_Key
- Key Expansion
Key Expansion stage uses the SubBytes internally. The authors suggested to create 9-stage pipeline key expander using the previous 7-stage pipelined SubBytes (Figure2) [2]. Rcon is a constant byte Xored with the most significant byte of the word only. The first 10 Rcon values used in AES are {0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36}. For decryption, we need to use the Round Keys in reverse order (i.e. from 10 to 0). In order to do that, we have two options. In the first place, we start with the initial key and stall the entire system for 90 cycles. This solution is bad since it harms the system performance very badly. Second solution is to give Round Key 10 as the initial key and reverse the operation of the key expander. This contradicts with the symmetric key algorithm. From Figure 4 below, given the output as (Q1, Q2, Q3, Q4) we can get the initial key (W1, W2, W3, W4) using the following equations:
W4 = Q4 ⊕ Q3
W3 = Q3 ⊕ Q2
W2 = Q2 ⊕ Q1
W1 = Q1 ⊕ [(Rcon(j) ⊕ SubWord(W4)(31: 24)), SubWord(W4)(23 ∶ 0)]
Where j is (10 – decryption_round) Figure 4: Proposed Key Expansion. - Round
Round internal components depend on the Round Number. If the Round number is between 1 and 9, it will have 4 components: SubBytes, ShiftRows, MixColumn, and AddRoundKey, which will take 7, 1, 1 and 1 cycles respectively. Hence, it will take 10 cycles to finish. On the other hand, if it is the last round (i.e. Round number = 10), then it will not have the MixColumn component and will take only 9 cycles to finish. To reduce the number of registers needed, each round has an output called encrypt signal that should be connected to the next round. Figure 5 below shows how the Round entity is pipelined [2].
Figure 5: Pipelined Round. - AES (top-level)
This entity is the interface for the whole algorithm, it consists of 10 instances of the Round entity, 10 instances of Key Expansion, and registers to hold intermediate values. These intermediate values include Round_Text, Round_Key, and encrypt signals. Figure 6 shows the internal components of the AES algorithm.
Figure 6: AES Algorithm Internal Components. Performance Evaluation
In order to evaluate the performance of the proposed solution, we need to count how many cycles each step takes. Table 1 shows how many cycles each component needs for computation and register respectively. The total number of cycles to get the first valid result is: (1 + (9 * (10+1)) + 9) = 109 A more important metric is the average clock per operation (CPO) which is equal to (108 + k))/k, where k is the number of operations. It is clear that as we increase the number of operations (k), CPO gets closer to 1.
Table 1: Component Timing Component Time (cycles)
computation + registerPreRound 0 + 1 Rounds(1-9) 10 + 1 Round(10) 9 + 0 The FPGA board used was Xilinx Virtex-6 XC6VLX550T. The maximum frequency reported by Xilinx software is 462.515 MHz where the critical delay is 2.126 ns. Table 2 compares our design performance with the proposed one[2].
Table 2: Performance Results Design Device Slices Critical Delay Max Freq
(MHz)Throughput
(Gbps)Efficiency
(Mbps/slice)[2] Vertex-6 XC6VLX240T 6361 1.17 849.185 108.69 17.08 Our Design Virtex-6 XC6VLX550T 8258 1.385 721.891 92.42 11.19 Future Work
There are a couple of improvements that the authors did not consider. First, the multiplicative inverse (x-1) in GF (24) uses 3 of (x2) and 2 of (x); however, it is not pipelined. Pipelining this component might reduce the critical delay. Another improvement should be considered is to generate RoundKey(10) from an initial RoundKey(0) in the decryption operation.
References
- [1] T. Wada, "SubBytes Transform circuit for AES Cipher," [Online]. Available: http://www.ie.u- ryukyu.ac.jp/~wada/design04/spec_e.html. [Accessed 04 06 2017].
- [2] S. B. Soufiane Oukili, "High throughput FPGA implementation of Advanced Encryption Standard algorithm," TELKOMNIKA, vol. 15, no. 1, pp. 494-503, 2017.
- [3] S. M. Xinmiao Zhang, "High-Speed VLSI Architectures for the AES Algorithm," IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, pp. 957-967, 2004.
- [4] A. B. S. K. N. A. S. Q. Vipul Joshi, "Hardware and Software Implementation Of Rijndael AES on IXP- 2XXX," [Online]. Available: http://web.csulb.edu/projects/npu/Rijndael_AES/Rinjdael_AES.htm. [Accessed 04 06 2017].
- MixColumn