Feature Spotlight: Cryptographic performance on low-end hardware

by Michael Tremer, May 22, 2015

Do you like what you are reading? Subscribe to our newsletter and don't miss out on the latest...   Join Now

I want to tell you a little story about processors today. Those tiny pieces of silicon that are the main ingredient to our computers. Over the last couple of weeks we found out about some mind-blowing facts that I personally did not expect and that I want to tell you about.

It all started with our approach to increase cryptographic performance on IPFire systems. We have been doing many improvements on really high-end hardware. This article is a bit tough for most readers and I got some emails of people who did not make it through it all the way, but please bear with me: This post is similar to the last one but mostly covering the low-end section of hardware that people are very commonly using. Those systems are that interesting for us because they are often used as VPN endpoints in branch offices or at home. Traditionally there has been not enough bandwidth to saturate the processors when transferring data through a VPN connection, but as bandwidth has been increasing massively, it is becoming more and more important that those systems can cope with that, too.

It should never be an option to trade security for performance. Therefore we are looking at two cheap and versatile systems here. They are just a representation of what is commonly used for that task and we will evaluate where they are functioning good and where possible deficits are.

The first one is the Fountain Networks – IPFire Prime Box – a tiny system that is based on an Intel Atom processor and the PC Engines APU1C that is based on an AMD APU processor.

Excursion: Encryption and Integrity

But first I have to insert a brief excursion to what is happening when you are sending packets through a VPN connection. A VPN is not just encrypting the data, it also cryptographically ensures the integrity of the data so that the recipient can trust that the data has not been changed on its way by an attacker or by any other transmission error. For that so called hashing algorithm are used. Those take a block of data and compute a sort of checksum that is called hash which is then sent with the encrypted block of data. The recipient will decrypt the block of data, compute the hash again and compare it with the sent hash. If they match the data has not been changed.

So in summary each piece of data that is sent over the VPN tunnel is cut into blocks of a fixed size. Those are then encrypted and the hash is computed. Those two operations are the most expensive ones. The overhead of the VPN is neglected here as it is not computationally expensive to perform. The receiving end performs the same process in reverse so that the computational power that is required on both ends is always exactly the same.

Benchmarks

Common ciphers are the Advanced Encryption Standard (AES) with key sizes of 256 and 128 bit. I have decided to benchmark the Camellia ciphers as well which is very similar to AES but not as often used.

The most often used hash algorithm is SHA1 although it is considered to become unsafe. It is recommended to change to the SHA-2 family of which there are SHA256 and SHA512. For comparison I added MD5, too. MD5 is considered to be broken and should not be used any more at all. However, many proprietary VPN gateways still do not support anything better than MD5.

Without any further ado I would like to present you the results of the benchmark: I used IPFire 2.17 – Core Update 90 with openssl 1.0.2a so that both systems could use the SSE2 improvements of this release. I am looking at the results for a block size of 1024 bytes.

IPFire Prime Box APU1C
Processor Intel Atom N2600 – 1.6 GHz AMD APU G-T40E – 1 GHz
Remarks SSSE3 disabled
Ciphers
AES-256-GCM 15.02M 11.11M
AES-128-GCM 19.49M 14.46M
AES-256-CBC 18.72M 31.24M
AES-128-CBC 26.22M 42.65M
Camellia-256-CBC 21.29M 35.02M
Camellia-128-CBC 28.10M 45.56M
Hash Algorithms
SHA512 28.50M 32.02M
SHA256 63.95M 37.67M
SHA1 134.74M 81.40M
MD5 192.83M 146.06M
for algo in {camellia,aes}-{256,128}-{cbc,gcm} md5 sha{1,256,512}; do
    openssl speed -elapsed -evp ${algo}
done

So what are we seeing here? The results are all over the place at a first glance. I was expecting that one or the other processor would be faster for all ciphers but that is not at all what we are seeing in the data. The IPFire Prime Box is much faster for GCM which combines encryption and integrity in one step (that is also why GCM is much slower than CBC). The PC Engines APU system is faster for CBC. For the hashing algorithms it is the opposite way around again. The IPFire Prime Box is significantly faster here with exception of SHA512.

So that makes it very hard to say which one is the fastest.

Where do you get the most bang for the buck?

We can say this for sure for GCM, because this encryption mode already includes the integrity bit so that the result of the benchmark is pretty close to the actual performance in a real-world scenario. On top of that: GCM can use both cores of the processors simultaneously so that the throughput doubles. That is 19.49 MByte/s x 2 = 38.98 MByte/s or a bit over 300 MBit/s for AES-128-GCM. Therefore it would be possible to saturate a symmetric 100M link and still have some resources left for other tasks.

For CBC it is way more complicated to tell the actual throughput from this data. One of the main reasons is that CBC cannot use a second processor because encrypting the next block of data requires the last one as an input. The APU would perform encryption faster but is really slow for integrity. The IPFire Prime Box is significantly faster when computing the hash but not as fast when encrypting the data block. I would say that this is a tie then. Let’s see…

AMD’s little tricks

The APU is actually slower when encrypting and decrypting data with AES. Arne found out about this and it seems that no one has really noticed that this is true for most of the latest generation of AMD processors. Those are significantly slower as expected when they use a certain instruction of the SSSE3 instruction set. For those who are more interested in that: It is an instruction that performs the shuffle operation on AES. This operation is slow when implemented in software, but can be implemented very well in hardware. It appears to use that AMD has not implemented that on in hardware but emulates it in the micro code of the processor. So it is software again. With the openssl update in Core Update 90, we decided to disable SSSE3 on all AMD processors to get better results. The PC Engines APU systems benefits massively from that and so do the high-end processors like the AMD FX series based on the Bulldozer micro architecture.

There are a couple of other of those special instructions involved in each and every one of the ciphers or hashes. We have not noticed anything similar for other processors although that may be well possible that we have some similar cases. What must have happened here is that someone implemented a highly optimised version of the AES cipher that uses the SSSE3 instructions. When you write such code you will always benchmark every step it makes. Of course nobody can test it on all processors that there are in the world. Maybe he or she did test it on AMD processors. It was either right from the beginning when AMD decided to emulate the SSSE3 instruction set in the micro code or they scrapped it later to make the processors smaller and not waste any space on the die for instructions that are not very often needed. That does not only save space. It also reduces the energy consumption of the processor at the cost of making it slower.

The non-optimised version of the same algorithm which has been benchmarked here is performing faster than the one which is theoretically faster because it uses SSSE3. At that point I would have found it better to not implement SSSE3 at all so that programs that come with several implementations of an algorithm can then pick the actual fastest on that list. If an implementation is theoretically faster because it uses optimised instructions, it must be faster in the real world, too. There is no point in emulating something in the micro code except being able to put a label on that box that says that this processor is “SSSE3-ready”.

Conclusion

So in the end: The Intel Atom processor is – from a network appliance point of view – the faster processor compared to the AMD APU processor. Although it is the most stripped-down version of a processor that Intel has to offer implementing instruction sets like SSSE3 properly and is able to execute those complex instructions in a decent time.

Interestingly, the much higher clock speed does not make a huge difference in the execution speed of general, non-optimised code – the Intel processor is in fact a bit slower here – but the faster access to memory gives it great advantage for many operations like bus transfer speeds which is important for high network throughput. The AMD APU processor has pretty huge deficits here that we can see perfectly in the benchmarks of the hashing algorithms although the pure computation is executed faster. The processor surprisingly consumes much more power and emits much more heat. That is pretty bad for only one GHz clock speed.

It seems to be very close to a tie here, but for our application, the winner is clear: A network appliance rarely has tasks to do where the processor hits 100% load over a long time. There are peaks where many packets need to be transferred and where lots of cryptographic tasks need to be done in a short amount of time. The proper implementation of instruction sets that help us with that and high bus transfer rates are clearly putting forward the Intel Atom processor as a winner here. The tasks IPFire does are far away from general-purpose computing and therefore should the hardware be tailored for that application, too.

One last side note

The Atom series is certainly designed to be slow. An Intel Core2Duo processor with the same clock speed is three or four times as fast when executing AES. The huge differences in this benchmark cannot be found in all places and execution speed of cryptographic operations is capped so that Intel can sell their more expensive products. Up-selling is basically the strategy Intel is doing most prominently. However, the Intel Atom series is not the way to go when high VPN throughput is needed. It has to be a more powerful processor then. This is just a view at the low-price segment which is perfect for the tasks mentioned above: branch offices, small companies and homes. It is just in my view that at Intel, you will get what you are paying for.

This feature will be included in IPFire 2.17 – Core Update 90