Cross Reference: /src/sys/crypto/aes/arch/x86/files.aessse2

History log of /src/sys/crypto/aes/arch/x86/files.aessse2
Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
# 1.3	23-Nov-2025	riastradh	aes(9): Rewrite x86 SSE2 implementation. This computes eight AES_k instances simultaneously, using the bitsliced 32-bit aes_ct logic which computes two blocks at a time in uint32_t arithmetic, vectorized four ways. Previously, the SSE2 code was a very naive adaptation of aes_ct64, which computes four blocks at a time in uint64_t arithmetic, without any 2x vectorization -- I did it at the time because: (a) it was easier to get working, (b) it only affects really old hardware with neither AES-NI nor SSSE3 which are both much much faster. But it was bugging me that this was a kind of dumb use of SSE2. Substantially reduces stack usage (from ~1200 bytes to ~800 bytes) and should approximately double throughput for CBC decryption and for XTS encryption/decryption. I also tried a 2x64 version but cursory performance measurements didn't reveal much benefit over 4x32. (If anyone is interested in doing more serious performance measurements, on ancient hardware for which it might matter, I also have the 2x64 code around.) Prompted by: PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized version in kernel
Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
# 1.2	29-Jun-2020	riastradh	Split SSE2 logic into separate units. Ensure that there are no paths into files compiled with -msse -msse2 at all except via fpu_kern_enter. I didn't run into a practical problem with this, but let's not leave a ticking time bomb for subsequent toolchain changes in case the mere declaration of local __m128i variables causes trouble.
# 1.1	29-Jun-2020	riastradh	New SSE2-based bitsliced AES implementation. This should work on essentially all x86 CPUs of the last two decades, and may improve throughput over the portable C aes_ct implementation from BearSSL by (a) reducing the number of vector operations in sequence, and (b) batching four rather than two blocks in parallel. Derived from BearSSL'S aes_ct64 implementation adjusted so that where aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ..., (q[3], q[7]), each tuple representing a pair of 64-bit quantities stacked in a single 128-bit register. This translation was done very naively, and mostly reduces the cost of ShiftRows and data movement without doing anything to address the S-box or (Inv)MixColumns, which spread all 64-bit quantities across separate registers and ignore the upper halves. Unfortunately, SSE2 -- which is all that is guaranteed on all amd64 CPUs -- doesn't have PSHUFB, which would help out a lot more. For example, vpaes relies on that. Perhaps there are enough CPUs out there with PSHUFB but not AES-NI to make it worthwhile to import or adapt vpaes too. Note: This includes local definitions of various Intel compiler intrinsics for gcc and clang in terms of their __builtin_* &c., because the necessary header files are not available during the kernel build. This is a kludge -- we should fix it properly; the present approach is expedient but not ideal.