VIA PadLock—Wicked Fast Encryption

by Michal Ludvig

Probably everyone who has used encryption soon realised that the demand for processor power grew instantly. On older systems, the trade-off for using encrypted filesystems is slower file operations; on newer systems, the trade-off is, at minimum, significantly higher CPU loads. Encrypting network traffic with the IPsec protocol also slows things down, and sometimes you may encounter performance problems even on the standard 100Mbps network.

Options exist, however, for working around these encryption/performance trade-offs:

  • Don't encrypt: apparently the cheapest solution, but this can become very expensive in the long run.

  • Accept the slowdown: the typical approach.

  • Use a standalone cryptography accelerator: a PCI card, for example, doesn't help as much as you might expect, however, because the data must traverse the PCI bus more often than necessary.

  • Use a CPU with VIA PadLock technology. What's VIA PadLock? Read on.

VIA PadLock

A while back, VIA introduced a simple but slightly controversial approach: select some cryptographic algorithms and wire them directly in to the CPU. The result was the introduction of an i686 class processor that understands some new instructions dedicated to cryptographic functions. This technology is called VIA PadLock, and the processor is fully compatible with AMD Athlons and Intel Pentiums.

The PadLock features available on your machine's processor are determined by its version. Processor versions usually are written as a family-model-stepping (F/M/S) triplet. Family is always 6 for i686 class CPUs. If the model is 9, your CPU has a Nehemiah core; if the model is 10, it has an Esther core. The stepping denotes a revision of each model. You can find your processor's version in /proc/cpuinfo.

Nehemiah stepping 3 and higher offers an electrical noise-based random number generator (RNG) that produces good random numbers for different purposes. The instruction for accessing the RNG is called xstore. As in Intel and AMD processors, the random number generator in VIA processors is supported by the hw_random device driver.

Nehemiah stepping 8 and higher contains two independent RNGs and the Advanced Cryptography Engine (ACE). The ACE can encrypt and decrypt data using the Advanced Encryption Standard (AES) algorithm with three standard key lengths—128, 192 and 256 bytes—in four different modes of operation: electronic codebook (ECB), cipher block chaining (CBC), cipher feedback (CFB) and output feedback (OFB) modes (see the on-line Resources). The appropriate instructions are called xcryptecb, xcryptcbc and so on. Later in this article, I predominately use their common group name, xcrypt, instead of the mode-specific instruction names.

Esther stepping 0 and higher inherited two RNG units from Nehemiah. ACE was extended with counter (CTR) mode support and MAC (Message Authentication Code) computation. And there are two new acronyms, PHE and PMM. PadLock Hash Engine (PHE) is used for computing a cryptographic hash, also known as a digest, of a given input block, using the SHA1 or SHA256 algorithm. The proposed instruction name is xsha.

The PadLock Montgomery Multiplier (PMM) is responsible for speeding up one of the most time-consuming computations used in asymmetric, or public-key, cryptography: AB mod M, where A, B and M are huge numbers, usually 1,024 or 2,048 bits. This instruction is called montmul.

As noted above, in the rest of this article I mostly speak about the xcrypt instruction. Principles described further mostly are valid for other units as well, and xcrypt serves only as an example. Also, the terms and concepts covered in this encryption discussion apply to decryption as well.

How to Use PadLock

In contrast to the external cryptography accelerators usually plugged in to PCI slots, the PadLock engine is an integral part of the CPU. This fact significantly simplifies its use, because it is not necessary to bother with accessing the bus or with interrupts, asynchronous operations and so on. Encrypting a block of memory with xcrypt is as easy as copying it over with the movs instruction.

At this point, encryption is almost an atomic operation. Before executing the instruction, the buffer contains plain-text input data; a few clock cycles later, when the execution finishes, we have ciphertext. If a task requested processing of a single block, which is 16 bytes in the case of the AES algorithm, the operation is fully atomic. That is, the CPU doesn't interrupt it in the middle and doesn't do anything else until the encryption is finished.

But what if the buffer contains a gigabyte of plain text to be processed? It isn't good to stop all other operations and wait for the encryption to finish when it's this large. In such a case, the CPU can interrupt the encryption after every single block of 16 bytes. The current state is saved, and whatever else can be done is done—interrupts can be handled and processes switched. As soon as the encrypting process is restarted, the instruction continues from the point at which it was suspended. That's why I say this is almost an atomic operation: for the calling process it looks atomic, but it can be interrupted by a higher-priority event. The current processing state then is saved into the memory and registers of the running process, which enables multiple tasks to do encryption simultaneously, without the risk of mixing their data. Again, it is an analogous situation to copying memory blocks with the movs instruction.

How Fast Is It?

According to VIA, the maximum throughput on 1.2GHz processors exceeds 15Gb/s, which is almost 1.9GB/s. The benchmarks I have run confirm that such a speed could be achieved in real-world applications and not only in VIA marketing papers, which definitely was a nice surprise.

The actual encryption speed depends on two factors, cipher mode and data alignment. ECB is the fastest, while the most widely used CBC mode runs at about half of the ECB speed. PadLock requires the data to be aligned at 16-byte boundaries, so unaligned data must be moved to proper addresses first, which takes some time. In some cases, the Esther CPU can realign the data automatically, but this still causes some slowdown.

Table 1 shows some numbers from my testing. The OpenSSL benchmark for VIA Nehemiah 1.2GHz produced the following results in kB/s.

Table 1. The Open SSL Benchmark for VIA Nehemiah 1.2GHz, in kB/s

Type16 bytes64 bytes256 bytes1,024 bytes8,192 bytes
aes-128-ecb (software)11,274.5314,327.7914,608.6414,672.5514,693.72
aes-128-ecb (PadLock)66,892.82346,583.52910,704.211,489,932.591,832,151.72
aes-128-cbc (software)8,276.2712,915.7513,264.1313,313.0213,322.92
aes-128-cbc (PadLock)48,542.30241,898.79523,706.28745,157.61846,402.90

The bigger the blocks are, the better the results are, because the overhead of the OpenSSL library itself is eliminated. Encryption of 8kB blocks in ECB mode can run at about 1.7GB/s; in CBC mode, we get about 800MB/s. In comparison to software encryption, PadLock in ECB mode is 120 times faster on the same processor, and CBC mode is 60 times faster.

Thanks to this speedup, the IPsec on 100Mbps network runs at almost full speed somewhere around 11MB/s. Similar speedups can be seen on encrypted filesystems. The Bonnie benchmark running on a Seagate Barracuda in UDMA100 mode produced plain-text throughput at a rate of 61,543kB/s; with PadLock, it was 49,961kB/s, and a pure software encryption ran at only 10,005kB/s. In other words, PadLock was only about 20% slower, while the pure software was almost 85% slower than the non-encrypted run. See Resources for a link to my benchmark page with more details and more numbers.

Linux Support

So far I have developed Linux support for the following packages only for the AES algorithm provided by the xcrypt instruction, because I haven't used the Esther CPU yet. As soon as I get the new processor, I will add support for the other algorithms where appropriate.

Kernel

When the kernel needs the AES algorithm, it loads by default the aes.ko module, which provides its software implementation. To use PadLock for AES, you must load the padlock.ko module instead. You can do this either by hand with modprobe or by adding a single line to /etc/modprobe.conf:

alias aes padlock

Now, every time the kernel requires AES, it automatically loads padlock.ko too.

Patches are available for kernel version 2.6.5 and above; see the PadLock in Linux home page in Resources. Also, the basic driver will be available in the vanilla 2.6.11 kernel without any patching.

OpenSSL

Those amongst us who are brave enough to use recent CVS versions of OpenSSL already have PadLock support. Users of OpenSSL 0.9.7 have to patch and rebuild the library, or they can use a Linux distribution that has the patch already included in its packages, such as SuSE Linux 9.2.

To see if your OpenSSL build has PadLock support, run this simple command:

$ openssl engine padlock
(padlock) VIA PadLock (RNG, ACE)

If instead of (RNG, ACE) you see (no-RNG, no-ACE), it means that your OpenSSL installation is PadLock-ready, but your processor is not. You also could see an ugly error message saying that there is no such engine. In that case, you should upgrade or patch your OpenSSL library.

For programs that use OpenSSL for their cryptography needs to enjoy PadLock support, they must use the so-called EVP_interface and initialize hardware accelerator support somewhere at the beginning of their runs:


#include <openssl/engine.h>
int main ()
{
    [...]
    ENGINE_load_builtin_engines();
    ENGINE_register_all_complete();
    [...]
}

See the evp(3) man page from the OpenSSL documentation for details.

In SUSE Linux 9.2, for example, OpenSSH has a similar patch to let you experience much faster scp transfers over the network.

Binutils

To use PadLock in your own programs, you either can call the instruction by name—for example, xcryptcbc—or write its hexadecimal form directly:

.byte 0xf3,0x0f,0xa7,0xd0

For backward compatibility with older development tools, it is safer to use the opcode form. Binutils versions 2.15 and newer, however, already understand the symbolic names where appropriate, for example, in gas (GNU assembler) or objdump programs. The binutils' BFD-library responsible among other things for instruction-level operations also is used in the GNU debugger gdb. A sample instruction dump of an encryption function may be as simple as:


(gdb) x/3i $pc
0x8048392 <demo1+14>:    lea    0x80495f0,%edx
0x8048398 <demo1+20>:    repz xcryptecb
0x804839c <demo1+24>:    push   %eax

As you might have guessed, SUSE Linux 9.2 has PadLock patches in all the appropriate packages, and you can enjoy PadLock support out of the box. If your distribution does not have these patches, check out my Linux PadLock home page in Resources for the available patches.

Programming PadLock

In the following sections, I describe some guidelines for programming PadLock, including details of xcryptcbc. I also explain how to set up PadLock for encrypting a buffer of data with the AES algorithm and a key length of 128bits in CBC mode. All other instructions of the xcrypt group are used in exactly the same way. Other PadLock functions apply similar rules.

xcryptcbc

xcryptcbc does not have any explicit operands. Instead, every register has a given, fixed function:

  • ESI—source address.

  • EDI—destination address.

  • EAX—initialization vector address.

  • EBX—cipher key address.

  • ECX—number of blocks for processing.

  • EDX—control word address.

Unless written otherwise, all addresses must be aligned at 16-byte boundaries.

ESI/EDI—Addresses of the Source/Destination Data

Both source and destination addresses can be the same, so it is possible to encrypt in place. The size of the destination buffer must be at least the size of the source one. Both must be a multiple of the block size, 16 bytes. Under some circumstances, the Esther CPU allows processing of unaligned buffers, but the operation is slower.

EAX—Initialization Vector Address

The initialization vector (IV) is one of the parameters on which the result of the encryption depends. The size of the IV is the same as the block size, which is 16 bytes. Consult the literature for details about initialization vectors.

EBX—Cipher Key Address

Cipher keys can have one of the following sizes: 128, 192 or 256 bits. The AES algorithm internally uses a so-called expanded key, which is derived from the given cipher key. For 128-bit keys, the expanded key can be computed by PadLock. For longer keys, you must compute it yourself.

ECX—Number of Blocks to Process

The xcrypt instruction always is used with the rep prefix, which enables its repetitive execution unless the ECX register is zero. The value in ECX is decremented after each block is encrypted or decrypted.

EDX—Control Word Address

To let PadLock know exactly how to process the data, we must fill a structure called control word with following items:

  • Algorithm—you can choose only AES.

  • Key size—one of the supported sizes.

  • Enc/Dec—direction: encryption or decryption.

  • Keygen—did we prepare the expanded key or should PadLock compute it itself?

  • Rounds—internal value of the algorithm; see the explanation later in the text and in PadLock documentation.

In C we can use union to allocate the appropriate space for the structure and a bit field to describe and access its items easily :

union cword {
    uint8_t cword[16];
    struct {
        int rounds:4;
        int algo:3;
        int keygen:1;
        int interm:1;
        int encdec:1;
        int ksize:2;
    } b;
};
Assembler Example

Now that we know all the theory, it's time for a real example. To begin, here are some lines of pure assembler:

.comm   iv,16,16
.comm   key,16,16
.comm   data,16,16
.comm   cword,16,16

.text
cryptcbc:
    movl    $data, %esi  #; Source address
    movl    %esi, %edi   #; Destination
    movl    $iv, %eax    #; IV
    movl    $key, %ebx   #; Cipher key
    movl    $cword, %edx #; Control word
    movl    $1, %ecx     #; Block count
    rep xcryptcbc
    ret

This piece of code encrypts one block of data with a cipher key and an initialization vector, following the parameters set in control word cword. Notice that we use the same address for both source and destination data, therefore we encrypt in-place. Because the field data has a size of only a single block, we set the ECX register to one.

C Language Example

To use PadLock directly in a C program, we can write the PadLock routines to separate assembler source file, then compile to standalone modules and finally link to our binary. It often is easier, though, to use the GCC inline assembler and write the instructions directly in the C code. See Resources for a link to a tutorial on the inline assembler.

static inline void *
padlock_xcryptcbc(char *input, char *output,
    void *key, void *iv, void *control_word,
    int count)
{
    asm volatile ("xcryptcbc"
       : "+S"(input), "+D"(output), "+a"(iv)
       : "c"(count), "d"(control_word), "b"(key));
    return iv;
}

This code instructs the compiler to load the given values of input, count and other parameters into the appropriate registers. It then is told to issue the xcryptcbc instruction and, finally, to return the value found in the EAX register as a pointer to the new initialization vector.

To be successful here, we also must fill in the control word structure correctly. First of all, it is a good idea to clear the union to avoid using any irrelevant values that might be in the memory:


memset(&cword, 0, sizeof(cword));

Now let's fill in the fields one by one. The first one in the list is rounds. This item specifies how many times AES processing should be run with the input block, each round using a unique part of the expanded key. To comply with the FIPS AES standard, use 10 rounds for 128-bit keys, 12 rounds for 192 bits and 14 rounds for 256 bits. Should the key_size variable contain the length of the cipher key in bytes, this is how we get the rounds value:

cword.b.rounds = 10 + (key_size - 16) / 4;

The next field is algo. This is reserved to let you choose future encryption algorithms instead of AES, although AES is the only option at the moment. Therefore, leave zero here.

The keygen field must be set to one if we prepare the expanded key ourselves. Zero means that PadLock should generate it instead, but that is possible only for 128-bit keys:

cword.b.keygen = (key_size > 16);

The item interm enables the storing of intermediate results after each round of the algorithm is run. I suspect the CPU architects used this field for debugging their core, and I don't see much sense in setting this in the program.

Encryption is distinguished from decryption by the bit encdec. Zero is encryption; one is decryption.

Finally, we must set the key size in the two bits of ksize:

cword.b.ksize = (key_size - 16) / 8;

That's it. With this prepared control word structure and properly aligned buffers, we can call padlock_xcryptcbc(). If the electrons are on our side, in a short while we receive the encrypted data.

Conclusion

PadLock documentation is available publicly on the VIA Web site; there you can find further information about PadLock programming caveats. The complete example program for encrypting one block of data, including verification of the result, can be found on my PadLock in Linux home page. See Resources for additional links.

Resources for this article: /article/8137.

Michal Ludvig (michal@logix.cz) recently moved from Prague in the Czech Republic to Auckland on the other side of the world to work as a senior engineer for Asterisk Ltd. He enjoys exploring the secrets of New Zealand with his wife and daughter.

Load Disqus comments