Exploring .NET Core platform intrinsics: Part 2 - Accelerating AES encryption on ARMv8
Previous posts in the series:
This is part 2 in a series of posts on using .NET Core platform intrinsics. Part 1 discussed the implementation of SHA-256 using specialized instructions available in the ARMv8 Cryptography Extensions. Part 2 will describe the accelerated implementation of AES encryption using hardware instructions that are part of the same instruction set.
AES is probably the most widely used symmetric encryption algorithm today. It’s a block cipher that can encrypt 128-bit blocks of data using the secret key that can be 128, 196, or 256 bits long. In practice we almost always want to encrypt variable-length messages, which is why block ciphers are inseparable from modes of operation such as CBC, CTR, or GCM. In this post I will talk about AES-128 only as a block cipher, without implementing any particular mode of operation (don’t worry, GCM will be the topic of one of the future posts in this series).
AES block cipher
The only thing you should know about AES in most situations is that it’s pseudo-random permutation that takes a 128-bit input, a 128-bit key, and produces a 128-bit encrypted output. On the other hand, if you want to implement it using specialized hardware instructions, you have to know just a little bit more about how the AES black box works internally.
AES is defined in the NIST’s FIPS 197 publication, but we are currently interested only in high-level description of the algorithm, which can be found on Wikipedia (some people also like the Stick Figure Guide to the Advanced Encryption Standard). For those who dislike clicking the links in the middle of a blog post, here’s the pseudocode of the encryption algorithm:
var rounds = 10;
var state = input;
AddRoundKey(state, roundKeys[0]);
for (int i = 1; i < rounds; ++i)
{
SubBytes(state);
ShiftRows(state);
MixColumns(state);
AddRoundKey(state, roundKeys[i]);
}
SubBytes(state);
ShiftRows(state);
AddRoundKey(state, roundKeys[rounds]);
You can see that the core of the algorithm is actually pretty simple (the individual operations are more complicated, but at this moment we don’t care about what they are doing internally). We have 11 different round keys generated from the main key, and four basic operations performed over 10 rounds. First 9 rounds are the same; the last one is slightly different as it does not include the MixColumns step.
Decryption works in a similar fashion. Each transformation has its inverse, so all the decryption algorithm has to do is to perform the inverse operations in the reverse order. Because encryption and decryption are very similar to each other, I’ll only talk about the encryption in the remainder of this post to avoid repeating myself (the decryption function will be available in the GitHub repository accompanying this post).
ARMv8 AES intrinsics
Now that you are familiar with the basic algorithm, let’s see which hardware instructions are available to us:
- AESE (AES single round encryption)
- AESD (AES single round decryption)
- AESMC (AES mix columns)
- AESIMC (AES inverse mix columns)
Looks like almost everything we need is here. AESE combines the AddRoundKey, SubBytes, and ShiftRows steps, and AESMC performs the MixColumns step, believe it or not! The only thing that is missing is the instruction for generating the round keys (Intel has the AESKEYGENASSIST instruction for that purpose), which I will talk about later.
If you were following closely, you might have noticed that because the instructions don’t map one-to-one to AES rounds, we don’t have the last AddRoundKey operation covered. AddRoundKey is just a plain old bitwise XOR operation, which is why we can implement it with the simple VEOR instruction. All of this is best shown in this gorgeous diagram that I personally created:
As I mentioned previously, we are not much interested in the instructions themselves. What we care about are the C language intrinsics for using them, and they are defined like this:
// Performs AES single round encryption
uint8x16_t vaeseq_u8(uint8x16_t data, uint8x16_t key);
// Performs AES single round decryption
uint8x16_t vaesdq_u8(uint8x16_t data, uint8x16_t key);
// Performs AES mix columns
uint8x16_t vaesmcq_u8(uint8x16_t data);
// Performs AES inverse mix columns
uint8x16_t vaesimcq_u8(uint8x16_t data);
The C# API looks pretty much the same:
public static class Aes
{
public static bool IsSupported { get; }
public static Vector128<byte> Decrypt(Vector128<byte> value, Vector128<byte> roundKey);
public static Vector128<byte> Encrypt(Vector128<byte> value, Vector128<byte> roundKey);
public static Vector128<byte> MixColumns(Vector128<byte> value);
public static Vector128<byte> InverseMixColumns(Vector128<byte> value);
}
C# implementation
The first thing we need to do is to expand the secret key into 11 round keys. We don’t have the intrinsics that can help us with that, so we have to implement the key expansion from scratch. The key expansion is not particularly interesting, so I will not talk about it here—I will just use the pre-expanded round keys instead. If you ever need to implement it in C#, the easiest way to do so would be to just copy the definition of GenerateKeyExpansion function from the RijndaelManagedTransform class, which is a part of the .NET Framework standard library.
The code implementing the encryption is relatively short and very easy to follow, so I’ll just show you the entire Encrypt function here:
public void Encrypt(byte[] input, byte[] output)
{
int position = 0;
int left = input.Length;
var key0 = Unsafe.ReadUnaligned<Vector128<byte>>(ref enc[0 * BlockSize]);
var key1 = Unsafe.ReadUnaligned<Vector128<byte>>(ref enc[1 * BlockSize]);
var key2 = Unsafe.ReadUnaligned<Vector128<byte>>(ref enc[2 * BlockSize]);
var key3 = Unsafe.ReadUnaligned<Vector128<byte>>(ref enc[3 * BlockSize]);
var key4 = Unsafe.ReadUnaligned<Vector128<byte>>(ref enc[4 * BlockSize]);
var key5 = Unsafe.ReadUnaligned<Vector128<byte>>(ref enc[5 * BlockSize]);
var key6 = Unsafe.ReadUnaligned<Vector128<byte>>(ref enc[6 * BlockSize]);
var key7 = Unsafe.ReadUnaligned<Vector128<byte>>(ref enc[7 * BlockSize]);
var key8 = Unsafe.ReadUnaligned<Vector128<byte>>(ref enc[8 * BlockSize]);
var key9 = Unsafe.ReadUnaligned<Vector128<byte>>(ref enc[9 * BlockSize]);
var key10 = Unsafe.ReadUnaligned<Vector128<byte>>(ref enc[10 * BlockSize]);
while (left >= BlockSize)
{
var block = Unsafe.ReadUnaligned<Vector128<byte>>(ref input[position]);
block = Aes.Encrypt(block, key0);
block = Aes.MixColumns(block);
block = Aes.Encrypt(block, key1);
block = Aes.MixColumns(block);
block = Aes.Encrypt(block, key2);
block = Aes.MixColumns(block);
block = Aes.Encrypt(block, key3);
block = Aes.MixColumns(block);
block = Aes.Encrypt(block, key4);
block = Aes.MixColumns(block);
block = Aes.Encrypt(block, key5);
block = Aes.MixColumns(block);
block = Aes.Encrypt(block, key6);
block = Aes.MixColumns(block);
block = Aes.Encrypt(block, key7);
block = Aes.MixColumns(block);
block = Aes.Encrypt(block, key8);
block = Aes.MixColumns(block);
block = Aes.Encrypt(block, key9);
block = Simd.Xor(block, key10);
Unsafe.WriteUnaligned(ref output[position], block);
position += BlockSize;
left -= BlockSize;
}
}
Everything looks pretty ordinary here. We are using the fact that we have plenty of registers available to load all the round keys prior to the actual encryption, so we can process multiple blocks of data much faster, in a single invocation of the function. Each round is just one Encrypt/MixColumn combo, except the last one, which replaces the MixColumns step with the AddRoundKey (implemented with the simple call to Xor function from the Simd class in the Arm64 namespace). Now let’s see how fast we are going!
Performance
The setup for measuring performance was the same as the one from the previous post: Scaleway cloud instance for running the code, BenchmarkDotNet for running the benchmarks, AES implementation from the .NET standard library (which is again using OpenSSL behind the scenes), and BouncyCastle’s managed implementation. This time, though, two different AES classes were available in BouncyCastle: one was slow and safe, and the other one was fast, but not resistant to cache-timing attacks (side-channel attack resistance is one of the main reasons for the existence of AES hardware instructions, because safely implementing AES by hand is incredibly tricky). Without further ado, here are the numbers:
Method | Mean | Error | StdDev |
---|---|---|---|
OpenSsl16B | 845.92 ns | 0.3094 ns | 0.2894 ns |
Intrinsics16B | 69.18 ns | 0.0064 ns | 0.0060 ns |
BouncyCastleSlow16B | 1,236.72 ns | 0.2859 ns | 0.2534 ns |
BouncyCastleFast16B | 1,130.88 ns | 0.3798 ns | 0.3552 ns |
OpenSsl1K | 3,970.90 ns | 0.2851 ns | 0.2527 ns |
Intrinsics1K | 2,560.72 ns | 0.2241 ns | 0.2097 ns |
BouncyCastleSlow1K | 67,591.19 ns | 4.9099 ns | 4.5927 ns |
BouncyCastleFast1K | 59,282.02 ns | 2.9021 ns | 2.7146 ns |
OpenSsl1M | 3,521,992.27 ns | 165.1864 ns | 154.5154 ns |
Intrinsics1M | 2,899,086.42 ns | 176.2553 ns | 164.8693 ns |
BouncyCastleSlow1M | 69,124,217.92 ns | 2,713.7717 ns | 2,538.4635 ns |
BouncyCastleFast1M | 60,573,568.95 ns | 2,857.6861 ns | 2,673.0811 ns |
The results are much more impressive this time. BouncyCastle is orders of magnitude slower, as expected. But the intrinsics implementation is also noticeably better than OpenSSL in each category of inputs: 16B, 1K, and 1M.
Single block comparison may look slightly unfair, though, as it includes big overhead of several function calls and whatnot. Nevertheless, it’s very important when implementing some non-standard modes of operation which use AES block cipher as a basic building block—if we can’t encrypt a single block efficiently, the performance of the whole construction suffers greatly.
Now that the victory is officially declared, are we done here? Not just yet—we can do even better!
Pipelining
Modern processors are complex beasts that can juggle multiple instructions at the same time. Thanks to pipelining, we can have several instructions that take multiple cycles to execute running at the same time, each of them being in the different phase of its execution. How can we take advantage of that in our case? Let’s go back to this snippet of code from our encryption function:
var block = Unsafe.ReadUnaligned<Vector128<byte>>(ref input[position]);
block = Aes.Encrypt(block, key0);
block = Aes.MixColumns(block);
block = Aes.Encrypt(block, key1);
block = Aes.MixColumns(block);
Each Encrypt/MixColumns function call depends on the output of the previous operation, which means the instructions cannot be executed in parallel. We can, however, process multiple input blocks at the same time! For example, if we want to handle four blocks in one loop iteration, we can rearrange the code to look like this:
var block0 = Unsafe.ReadUnaligned<Vector128<byte>>(ref input[position + 0 * BlockSize]);
var block1 = Unsafe.ReadUnaligned<Vector128<byte>>(ref input[position + 1 * BlockSize]);
var block2 = Unsafe.ReadUnaligned<Vector128<byte>>(ref input[position + 2 * BlockSize]);
var block3 = Unsafe.ReadUnaligned<Vector128<byte>>(ref input[position + 3 * BlockSize]);
// Round 1
block0 = Aes.Encrypt(block0, key0);
block1 = Aes.Encrypt(block1, key0);
block2 = Aes.Encrypt(block2, key0);
block3 = Aes.Encrypt(block3, key0);
block0 = Aes.MixColumns(block0);
block1 = Aes.MixColumns(block1);
block2 = Aes.MixColumns(block2);
block3 = Aes.MixColumns(block3);
We are now issuing four independent AESE/AESMC instructions sequentially, which should be much more efficient from the pipelining perspective. The numbers are confirming that hypothesis:
Method | Mean | Error | StdDev |
---|---|---|---|
Intrinsics1K | 2,560.72 ns | 0.2241 ns | 0.2097 ns |
IntrinsicsPipelined1K | 1,919.83 ns | 0.1546 ns | 0.1447 ns |
Intrinsics1M | 2,899,086.42 ns | 176.2553 ns | 164.8693 ns |
IntrinsicsPipelined1M | 2,094,697.83 ns | 149.5739 ns | 139.9115 ns |
The performance boost is clearly visible (although not as much as I have expected). This trick is not always applicable, though. Remember that we always have to combine some mode of operation with the actual block cipher? Well, not all modes can take advantage of encrypting multiple blocks at the same time. However, some modes of operation such as CTR and GCM are inherently parallelizable, which is where this optimization really shines.
How did I come up with the number four for the number of instructions issued at the same time? Performance of each instruction is defined by its latency (number of cycles needed to complete the instruction) and throughput (number of cycles before another instruction of the same type can be issued). You can read the Cortex-A57 Software Optimization Guide to get the rough idea how some instruction will perform on the ARMv8 platform, but ultimately you have to measure and see what works best for you in practice.
Conclusion
You can find the complete C# implementation of both AES encryption and decryption in my ARMv8 AES intrinsics GitHub repository (it currently serves only as a demo, not as a production-ready implementation). In the next post we will take a break from writing the code and explore the various ways to see the instructions emitted by the JIT.