Boris gave an intro to crypto which I will not summarize here. See also https://youtu.be/dnGbhvweNb8
Crypto = transforming input data into something else. The implementation is the algorithms, the object is an instance that you can use to execute the algorithm and that contains state; it is called tfm. Algorithms: cipher, hash, AEAD (called authenc in the kernel), HMAC and compression. Algorithms are combined, e.g. hmac(sha1) or authenc(hmac(sha1),cbc(aes)). How to code it: allocate the algorithm tfm with cypto_alloc_<algtype>. set callbacks, set context (e.g. key, flags), feed in data with _request_set_crypt (= pass in data) + crypto_<type>_<operation> to execute it, finally free the request and the algorithm tfm. The API is asynchronous. Thypically the encrypt operation returns -EINPROGRESS or -EBUSY and you wait for a completion which is done in the callback set before.
To use kernel crypto from userspace, there are two competing solution: the out-of-tree cryptodev and the mainlined AF_ALG. cryptodev is taken from OpenBSD. It creates a device node that is accessed with ioctls. OpenSSL supports this type. AF_ALG uses a netlink socket, can be added to OpenSSL with an out-of-tree OpenSSL module. Most userspace programs don’t use AF_ALG. Boris did speed experiments with the Marvell CESA he implemented; for small blocks, they are more or less equal; for larger blocks, cryptodev is slightly faster. However, a software implementation is even faster and doesn’t take so much more CPU power. With 128 threads in parallel, AF_ALG is a bit faster. If energy consumption is important, that could change the conclusion again. But the conclusion is: if you need to choose between cryptodev or AF_ALG, perhaps it’s better not use anything at all. Better run some benchmarks.
The crypto API doesn’t distinguish between hardware or software implementations. So you register the crypto_alg subclass with the types of algorithms that are supported. Each algorithm that the engine supports is registered separately with a different name, elg. “cbc(aes)” and “ecb(aes)”. There is also a driver-name that allows selecting that specific implementation of the same algorithm. A priority constant is used for automatic selection of the implementation. Various flags can be set, e.g. that it’s asynchronous.
When the crypto engine allocates a new tfm, the driver-specific buffer is also allocated by it and passed to the init function. The implementation must also implement setkey, encrypt end decrypt functions.
Because the algorithm is passed as a string, it is quite easy to add a new algorithm to the framework. But that makes the framework complex. Fortunately there is an extensive test suite that can be used to test a new driver. However, often there are several ways to implement the same thing (by composing in a different way). The way that subclassing is done is not consistent. The framework evolves and old drivers don’t use the new features, which makes it difficult to find the current best practices. Important details are sometimes hard to discover, e.g. completion callback should be done with softirq disabled.
There is no way to do NAPI-style polling under heavy load, a driver that is async will always have to be based on interrupts. So using this for doing network encryption defeats the purpose of NAPI. Boris proposes to add a NAPI-like driver interface to the crypto subsystem.
The priority-based automatic selection will always select the same driver, so if you have two hardware crypto engines, only one of them will be used: the one with the highest priority, or the first one of equal priority. There should be load balancing, but the framework is not designed for it at all. To do that, we’d need a way to define occupation of a crypto engine and an estimate of the load (e.g. length of the request). When switching engine, the context also has to migrate. Boris proposes to do the load balancer at the driver level, i.e. you register all the engines that can be used interchangeably in a common load balancer, which itself will expose the crypto API.
Question from the audience: shouldn’t there be an interface that the crypto user can use to allocate memory, so it can allocate the buffers in a way that the driver can access it directly – some hardware will have specific restrictions on the buffer layout (e.g. no scattter-gather), requiring a memcopy if this is not the case.