| #
1.3 |
|
23-Nov-2025 |
riastradh |
aes(9): Rewrite x86 SSE2 implementation.
This computes eight AES_k instances simultaneously, using the bitsliced 32-bit aes_ct logic which computes two blocks at a time in uint32_t arithmetic, vectorized four ways.
Previously, the SSE2 code was a very naive adaptation of aes_ct64, which computes four blocks at a time in uint64_t arithmetic, without any 2x vectorization -- I did it at the time because:
(a) it was easier to get working, (b) it only affects really old hardware with neither AES-NI nor SSSE3 which are both much much faster.
But it was bugging me that this was a kind of dumb use of SSE2.
Substantially reduces stack usage (from ~1200 bytes to ~800 bytes) and should approximately double throughput for CBC decryption and for XTS encryption/decryption.
I also tried a 2x64 version but cursory performance measurements didn't reveal much benefit over 4x32. (If anyone is interested in doing more serious performance measurements, on ancient hardware for which it might matter, I also have the 2x64 code around.)
Prompted by:
PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized version in kernel
|
| #
1.1 |
|
29-Jun-2020 |
riastradh |
New SSE2-based bitsliced AES implementation.
This should work on essentially all x86 CPUs of the last two decades, and may improve throughput over the portable C aes_ct implementation from BearSSL by
(a) reducing the number of vector operations in sequence, and (b) batching four rather than two blocks in parallel.
Derived from BearSSL'S aes_ct64 implementation adjusted so that where aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ..., (q[3], q[7]), each tuple representing a pair of 64-bit quantities stacked in a single 128-bit register. This translation was done very naively, and mostly reduces the cost of ShiftRows and data movement without doing anything to address the S-box or (Inv)MixColumns, which spread all 64-bit quantities across separate registers and ignore the upper halves.
Unfortunately, SSE2 -- which is all that is guaranteed on all amd64 CPUs -- doesn't have PSHUFB, which would help out a lot more. For example, vpaes relies on that. Perhaps there are enough CPUs out there with PSHUFB but not AES-NI to make it worthwhile to import or adapt vpaes too.
Note: This includes local definitions of various Intel compiler intrinsics for gcc and clang in terms of their __builtin_* &c., because the necessary header files are not available during the kernel build. This is a kludge -- we should fix it properly; the present approach is expedient but not ideal.
|