History log of /src/sys/crypto/aes/arch/x86/
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.5 05-Sep-2020 maxv

x86: fix several CPUID flags

- Rename: CPUID_PN -> CPUID_PSN
CPUID_CFLUSH -> CPUID_CLFSH
CPUID_SBF -> CPUID_PBE
CPUID_LZCNT -> CPUID_ABM
CPUID_P1GB -> CPUID_PAGE1GB
CPUID2_PCLMUL -> CPUID2_PCLMULQDQ
CPUID2_CID -> CPUID2_CNXTID
CPUID2_xTPR -> CPUID2_XTPR
CPUID2_AES -> CPUID2_AESNI
To match the x86 specification and the other OSes.

- Remove: CPUID_B10, CPUID_B20, CPUID_IA64. They do not exist.


1.4 25-Jul-2020 riastradh

Implement AES-CCM with x86 AES-NI.


1.3 25-Jul-2020 riastradh

Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.


1.2 30-Jun-2020 riastradh

New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.


1.1 29-Jun-2020 riastradh

Add x86 AES-NI support.

Limited to amd64 for now. In principle, AES-NI should work in 32-bit
mode, and there may even be some 32-bit-only CPUs that support
AES-NI, but that requires work to adapt the assembly.


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.3 25-Jul-2020 riastradh

Implement AES-CCM with x86 AES-NI.


1.2 25-Jul-2020 riastradh

Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.


1.1 29-Jun-2020 riastradh

Add x86 AES-NI support.

Limited to amd64 for now. In principle, AES-NI should work in 32-bit
mode, and there may even be some 32-bit-only CPUs that support
AES-NI, but that requires work to adapt the assembly.


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.6 27-Jul-2020 riastradh

Add RCSIDs to the AES and ChaCha .S sources.


1.5 27-Jul-2020 riastradh

Align critical-path loops in AES and ChaCha.


1.4 25-Jul-2020 riastradh

Implement AES-CCM with x86 AES-NI.


1.3 25-Jul-2020 riastradh

Invert some loops to save a jmp instruction on each iteration.

No semantic change intended.


1.2 30-Jun-2020 riastradh

Use .p2align rather than .align.

Apparently on arm, .align is actually an alias for .p2align, taking a
power of two rather than a number of bytes, so aes_armv8_64.o was
bloated to 32KB with obscene alignment when it only needed to be
barely past 4KB.

Do the same for the x86 aes_ni_64.S -- even though .align takes a
number of bytes rather than a power of two on x86, let's just stay
away from the temptations of the evil .align directive.


1.1 29-Jun-2020 riastradh

Add x86 AES-NI support.

Limited to amd64 for now. In principle, AES-NI should work in 32-bit
mode, and there may even be some 32-bit-only CPUs that support
AES-NI, but that requires work to adapt the assembly.


1.3 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.2 30-Jun-2020 riastradh

New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.


1.1 29-Jun-2020 riastradh

New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.


1.5 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.4 25-Jul-2020 riastradh

Implement AES-CCM with SSE2.


1.3 25-Jul-2020 riastradh

Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.


1.2 29-Jun-2020 riastradh

Split SSE2 logic into separate units.

Ensure that there are no paths into files compiled with -msse -msse2
at all except via fpu_kern_enter.

I didn't run into a practical problem with this, but let's not leave
a ticking time bomb for subsequent toolchain changes in case the mere
declaration of local __m128i variables causes trouble.


1.1 29-Jun-2020 riastradh

New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.


1.1 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


1.1 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


1.1 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


1.1 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


1.1 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


1.1 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


1.1 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


1.1 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


1.2 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.1 29-Jun-2020 riastradh

New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.


1.2 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.1 29-Jun-2020 riastradh

New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.


1.6 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.5 25-Jul-2020 riastradh

Implement AES-CCM with SSE2.


1.4 25-Jul-2020 riastradh

Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.


1.3 30-Jun-2020 riastradh

New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.


1.2 29-Jun-2020 riastradh

Split SSE2 logic into separate units.

Ensure that there are no paths into files compiled with -msse -msse2
at all except via fpu_kern_enter.

I didn't run into a practical problem with this, but let's not leave
a ticking time bomb for subsequent toolchain changes in case the mere
declaration of local __m128i variables causes trouble.


1.1 29-Jun-2020 riastradh

New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.


1.4 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base perseant-exfatfs-base-20240630 perseant-exfatfs-base thorpej-ifq-base thorpej-altq-separation-base
1.3 07-Aug-2023 rin

sys/crypto: Introduce arch/{arm,x86} to share common MD headers

Dedup between aes and chacha. No binary changes.


Revision tags: netbsd-10-1-RELEASE netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.2 29-Jun-2020 riastradh

Split SSE2 logic into separate units.

Ensure that there are no paths into files compiled with -msse -msse2
at all except via fpu_kern_enter.

I didn't run into a practical problem with this, but let's not leave
a ticking time bomb for subsequent toolchain changes in case the mere
declaration of local __m128i variables causes trouble.


1.1 29-Jun-2020 riastradh

New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.


1.5 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.4 08-Sep-2020 riastradh

aes(9): Fix edge case in bitsliced SSE2 AES-CBC decryption.

Make sure self-tests exercise this edge case.

Discovered by confusion over code inspection of jak's adaptation of
aes_armv8_64.S for big-endian.


1.3 25-Jul-2020 riastradh

Implement AES-CCM with SSE2.


1.2 30-Jun-2020 riastradh

New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.


1.1 29-Jun-2020 riastradh

Split SSE2 logic into separate units.

Ensure that there are no paths into files compiled with -msse -msse2
at all except via fpu_kern_enter.

I didn't run into a practical problem with this, but let's not leave
a ticking time bomb for subsequent toolchain changes in case the mere
declaration of local __m128i variables causes trouble.


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.2 30-Jun-2020 riastradh

New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.


1.1 29-Jun-2020 riastradh

New permutation-based AES implementation using SSSE3.

This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.

Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.3 25-Jul-2020 riastradh

Implement AES-CCM with SSSE3.


1.2 25-Jul-2020 riastradh

Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.


1.1 29-Jun-2020 riastradh

New permutation-based AES implementation using SSSE3.

This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.

Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.4 25-Jul-2020 riastradh

Implement AES-CCM with SSSE3.


1.3 25-Jul-2020 riastradh

Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.


1.2 30-Jun-2020 riastradh

New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.


1.1 29-Jun-2020 riastradh

New permutation-based AES implementation using SSSE3.

This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.

Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base perseant-exfatfs-base-20240630 perseant-exfatfs-base thorpej-ifq-base thorpej-altq-separation-base
1.2 07-Aug-2023 rin

sys/crypto: Introduce arch/{arm,x86} to share common MD headers

Dedup between aes and chacha. No binary changes.


Revision tags: netbsd-10-1-RELEASE netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.1 29-Jun-2020 riastradh

New permutation-based AES implementation using SSSE3.

This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.

Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.3 25-Jul-2020 riastradh

Implement AES-CCM with SSSE3.


1.2 30-Jun-2020 riastradh

New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.


1.1 29-Jun-2020 riastradh

New permutation-based AES implementation using SSSE3.

This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.

Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.


1.10 22-Nov-2025 riastradh

aes(9): New aes_keysched_enc/dec.

These implement the standard key schedule. They are named
independently of any particular AES implementation, so that:

(a) we can swap between the BearSSL aes_ct and aes_ct64 code without
changing all the callers who don't care which one they get, and

(b) we could push it into the aes_impl abstraction if we wanted.

This eliminates all br_aes_* references outside aes_bear.c, aes_ct*.c,
and the new aes_keysched.c wrappers.

Preparation for:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base perseant-exfatfs-base-20240630 perseant-exfatfs-base
1.9 16-Jun-2024 rillig

sys/aes_via: fix broken link in comment


1.8 16-Jun-2024 christos

revert previous, probably a gcc bug?


1.7 16-Jun-2024 christos

try to fix the overflow gcc pointed out.


Revision tags: netbsd-10-1-RELEASE netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.6 28-Jul-2020 riastradh

Initialize authctr in both branches.

I guess I didn't test the unaligned case, weird.


1.5 25-Jul-2020 riastradh

Implement AES-CCM with VIA ACE.


1.4 25-Jul-2020 riastradh

Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.


1.3 30-Jun-2020 riastradh

New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.


1.2 29-Jun-2020 riastradh

VIA AES: Batch AES-XTS computation into eight blocks at a time.

Experimental -- performance improvement is not clearly worth the
complexity.


1.1 29-Jun-2020 riastradh

Add AES implementation with VIA ACE.


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.2 25-Jul-2020 riastradh

Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.


1.1 29-Jun-2020 riastradh

Add AES implementation with VIA ACE.


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.1 29-Jun-2020 riastradh

Add x86 AES-NI support.

Limited to amd64 for now. In principle, AES-NI should work in 32-bit
mode, and there may even be some 32-bit-only CPUs that support
AES-NI, but that requires work to adapt the assembly.


1.3 23-Nov-2025 riastradh

aes(9): Rewrite x86 SSE2 implementation.

This computes eight AES_k instances simultaneously, using the
bitsliced 32-bit aes_ct logic which computes two blocks at a time in
uint32_t arithmetic, vectorized four ways.

Previously, the SSE2 code was a very naive adaptation of aes_ct64,
which computes four blocks at a time in uint64_t arithmetic, without
any 2x vectorization -- I did it at the time because:

(a) it was easier to get working,
(b) it only affects really old hardware with neither AES-NI nor SSSE3
which are both much much faster.

But it was bugging me that this was a kind of dumb use of SSE2.

Substantially reduces stack usage (from ~1200 bytes to ~800 bytes)
and should approximately double throughput for CBC decryption and for
XTS encryption/decryption.

I also tried a 2x64 version but cursory performance measurements
didn't reveal much benefit over 4x32. (If anyone is interested in
doing more serious performance measurements, on ancient hardware for
which it might matter, I also have the 2x64 code around.)

Prompted by:

PR kern/59774: bearssl 32-bit AES is too slow, want 64-bit optimized
version in kernel


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.2 29-Jun-2020 riastradh

Split SSE2 logic into separate units.

Ensure that there are no paths into files compiled with -msse -msse2
at all except via fpu_kern_enter.

I didn't run into a practical problem with this, but let's not leave
a ticking time bomb for subsequent toolchain changes in case the mere
declaration of local __m128i variables causes trouble.


1.1 29-Jun-2020 riastradh

New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.1 29-Jun-2020 riastradh

New permutation-based AES implementation using SSSE3.

This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.

Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.


Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.1 29-Jun-2020 riastradh

Add AES implementation with VIA ACE.


Revision tags: perseant-exfatfs-base-20250801 perseant-exfatfs-base-20240630 perseant-exfatfs-base
1.6 07-Aug-2023 rin

sys/crypto: Introduce arch/{arm,x86} to share common MD headers

Dedup between aes and chacha. No binary changes.


Revision tags: netbsd-10-1-RELEASE netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.5 25-Jul-2020 riastradh

Add some Intel intrinsics for ChaCha.

_mm_load1_ps
_mm_loadu_si128
_mm_movelh_ps
_mm_slli_epi32
_mm_storeu_si128
_mm_unpackhi_epi32
_mm_unpacklo_epi32


1.4 25-Jul-2020 riastradh

Fix target attribute on _mm_movehl_ps, fix clang _mm_unpacklo_epi64.

- _mm_movehl_ps is available in SSE2, no need for SSSE3.
- _mm_unpacklo_epi64 operates on v2di, not v4si; fix.


1.3 25-Jul-2020 riastradh

Implement AES-CCM with SSSE3.


1.2 29-Jun-2020 riastradh

New permutation-based AES implementation using SSSE3.

This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.

Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.


1.1 29-Jun-2020 riastradh

New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.


Revision tags: perseant-exfatfs-base-20250801 perseant-exfatfs-base-20240630 perseant-exfatfs-base
1.2 07-Aug-2023 rin

sys/crypto: Introduce arch/{arm,x86} to share common MD headers

Dedup between aes and chacha. No binary changes.


Revision tags: netbsd-10-1-RELEASE netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.1 29-Jun-2020 riastradh

New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.