Home | History | Annotate | only in /src/sys/crypto/aes
History log of /src/sys/crypto/aes
RevisionDateAuthorComments
 1.4 25-Jul-2020  riastradh Split aes_cbc_* and aes_xts_* into their own header files.

aes.h will remain just for key setup; any particular construction using
AES can have its own header file so we can have many of them without
rebuilding everything AES-related whenever one of them changes.

(Planning to add AES-CCM and AES-GCM too.)
 1.3 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.2 29-Jun-2020  riastradh New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.
 1.1 29-Jun-2020  riastradh Rework AES in kernel to finally address CVE-2005-1797.

1. Rip out old variable-time reference implementation.
2. Replace it by BearSSL's constant-time 32-bit logic.
=> Obtained from commit dda1f8a0c46e15b4a235163470ff700b2f13dcc5.
=> We could conditionally adopt the 64-bit logic too, which would
likely give a modest performance boost on 64-bit platforms
without AES-NI, but that's a bit more trouble.
3. Select the AES implementation at boot-time; allow an MD override.
=> Use self-tests to verify basic correctness at boot.
=> The implementation selection policy is rather rudimentary at
the moment but it is isolated to one place so it's easy to
change later on.

This (a) plugs a host of timing attacks on, e.g., cgd, and (b) paves
the way to take advantage of CPU support for AES -- both things we
should've done a decade ago. Downside: Computing AES takes 2-3x the
CPU time. But that's what hardware support will be coming for.

Rudimentary measurement of performance impact done by:

mount -t tmpfs tmpfs /tmp
dd if=/dev/zero of=/tmp/disk bs=1m count=512
vnconfig -cv vnd0 /tmp/disk
cgdconfig -s cgd0 /dev/vnd0 aes-cbc 256 < /dev/zero
dd if=/dev/rcgd0d of=/dev/null bs=64k
dd if=/dev/zero of=/dev/rcgd0d bs=64k

The AES-CBC encryption performance impact is closer to 3x because it
is inherently sequential; the AES-CBC decryption impact is closer to
2x because the bitsliced AES logic can process two blocks at once.

Discussed on tech-kern:

https://mail-index.NetBSD.org/tech-kern/2020/06/18/msg026505.html
 1.4 25-Jul-2020  riastradh Implement AES-CCM with BearSSL's bitsliced 32-bit aes_ct.
 1.3 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.2 30-Jun-2020  riastradh New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.
 1.1 29-Jun-2020  riastradh Rework AES in kernel to finally address CVE-2005-1797.

1. Rip out old variable-time reference implementation.
2. Replace it by BearSSL's constant-time 32-bit logic.
=> Obtained from commit dda1f8a0c46e15b4a235163470ff700b2f13dcc5.
=> We could conditionally adopt the 64-bit logic too, which would
likely give a modest performance boost on 64-bit platforms
without AES-NI, but that's a bit more trouble.
3. Select the AES implementation at boot-time; allow an MD override.
=> Use self-tests to verify basic correctness at boot.
=> The implementation selection policy is rather rudimentary at
the moment but it is isolated to one place so it's easy to
change later on.

This (a) plugs a host of timing attacks on, e.g., cgd, and (b) paves
the way to take advantage of CPU support for AES -- both things we
should've done a decade ago. Downside: Computing AES takes 2-3x the
CPU time. But that's what hardware support will be coming for.

Rudimentary measurement of performance impact done by:

mount -t tmpfs tmpfs /tmp
dd if=/dev/zero of=/tmp/disk bs=1m count=512
vnconfig -cv vnd0 /tmp/disk
cgdconfig -s cgd0 /dev/vnd0 aes-cbc 256 < /dev/zero
dd if=/dev/rcgd0d of=/dev/null bs=64k
dd if=/dev/zero of=/dev/rcgd0d bs=64k

The AES-CBC encryption performance impact is closer to 3x because it
is inherently sequential; the AES-CBC decryption impact is closer to
2x because the bitsliced AES logic can process two blocks at once.

Discussed on tech-kern:

https://mail-index.NetBSD.org/tech-kern/2020/06/18/msg026505.html
 1.2 29-Jun-2020  riastradh Provide the standard AES key schedule.

Different AES implementations prefer different variations on it, but
some of them -- notably VIA -- require the standard key schedule to
be available and don't provide hardware support for computing it
themselves. So adapt BearSSL's logic to generate the standard key
schedule (and decryption keys, with InvMixColumns), rather than the
bitsliced key schedule that BearSSL uses natively.
 1.1 29-Jun-2020  riastradh Rework AES in kernel to finally address CVE-2005-1797.

1. Rip out old variable-time reference implementation.
2. Replace it by BearSSL's constant-time 32-bit logic.
=> Obtained from commit dda1f8a0c46e15b4a235163470ff700b2f13dcc5.
=> We could conditionally adopt the 64-bit logic too, which would
likely give a modest performance boost on 64-bit platforms
without AES-NI, but that's a bit more trouble.
3. Select the AES implementation at boot-time; allow an MD override.
=> Use self-tests to verify basic correctness at boot.
=> The implementation selection policy is rather rudimentary at
the moment but it is isolated to one place so it's easy to
change later on.

This (a) plugs a host of timing attacks on, e.g., cgd, and (b) paves
the way to take advantage of CPU support for AES -- both things we
should've done a decade ago. Downside: Computing AES takes 2-3x the
CPU time. But that's what hardware support will be coming for.

Rudimentary measurement of performance impact done by:

mount -t tmpfs tmpfs /tmp
dd if=/dev/zero of=/tmp/disk bs=1m count=512
vnconfig -cv vnd0 /tmp/disk
cgdconfig -s cgd0 /dev/vnd0 aes-cbc 256 < /dev/zero
dd if=/dev/rcgd0d of=/dev/null bs=64k
dd if=/dev/zero of=/dev/rcgd0d bs=64k

The AES-CBC encryption performance impact is closer to 3x because it
is inherently sequential; the AES-CBC decryption impact is closer to
2x because the bitsliced AES logic can process two blocks at once.

Discussed on tech-kern:

https://mail-index.NetBSD.org/tech-kern/2020/06/18/msg026505.html
 1.1 25-Jul-2020  riastradh Split aes_cbc_* and aes_xts_* into their own header files.

aes.h will remain just for key setup; any particular construction using
AES can have its own header file so we can have many of them without
rebuilding everything AES-related whenever one of them changes.

(Planning to add AES-CCM and AES-GCM too.)
 1.6 17-Oct-2021  jmcneill Upgrade self-test passed messages from verbose to debug.
 1.5 10-Aug-2020  rin Add hack to compile aes_ccm_tag() with -O0 for m68k for GCC8.

GCC 8 miscompiles aes_ccm_tag() for m68k with optimization level -O[12],
which results in failure in aes_ccm_selftest():

| aes_ccm_selftest: tag 0: 8 bytes @ 0x4d3e38
| 03 80 5f 08 22 6f cb fe | .._."o..
| aes_ccm_selftest: verify 0 failed
| ...
| WARNING: module error: built-in module aes_ccm failed its MODULE_CMD_INIT, error 5

This is observed for amiga (A1200, 68060), mac68k (Quadra 840AV, 68040),
and luna68k (nono, 68030 emulator). However, it is not for sun3 (TME, 68020
emulator) and sun2 (TME, 68010 emulator). At the moment, it is unclear
whether this is due to differences b/w 68010-20 vs 68030-60, or something
wrong with TME.
 1.4 27-Jul-2020  riastradh Gather auth[16] and ctr[16] into one authctr[32].

Should appease clang.
 1.3 26-Jul-2020  riastradh Ensure aes_ccm module init runs after aes module init.

Otherwise the AES implementation might not be selected early enough.
 1.2 25-Jul-2020  riastradh Push CBC-MAC and CCM block updates into the aes_impl API.

This should help reduce the setup and teardown overhead (enabling and
disabling fpu, or expanding bitsliced keys) for CCM, as used in
802.11 WPA2 CCMP. But all the fiddly formatting details remain in
aes_ccm.c to reduce the effort of implementing it -- at the cost of a
handful additional setups and teardowns per message.

Not yet implemented by any of the aes_impls, so leave a fallback that
just calls aes_enc for now. This should be removed when all of the
aes_impls provide CBC-MAC and CCM block updates.
 1.1 25-Jul-2020  riastradh New aes_ccm API.

Intended for use in net80211 for WPA2 CCMP.
 1.2 27-Jul-2020  riastradh Gather auth[16] and ctr[16] into one authctr[32].

Should appease clang.
 1.1 25-Jul-2020  riastradh New aes_ccm API.

Intended for use in net80211 for WPA2 CCMP.
 1.1 25-Jul-2020  riastradh New aes_ccm API.

Intended for use in net80211 for WPA2 CCMP.
 1.1 25-Jul-2020  riastradh New aes_ccm API.

Intended for use in net80211 for WPA2 CCMP.
 1.3 30-Jun-2020  riastradh New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.
 1.2 29-Jun-2020  riastradh Provide the standard AES key schedule.

Different AES implementations prefer different variations on it, but
some of them -- notably VIA -- require the standard key schedule to
be available and don't provide hardware support for computing it
themselves. So adapt BearSSL's logic to generate the standard key
schedule (and decryption keys, with InvMixColumns), rather than the
bitsliced key schedule that BearSSL uses natively.
 1.1 29-Jun-2020  riastradh Rework AES in kernel to finally address CVE-2005-1797.

1. Rip out old variable-time reference implementation.
2. Replace it by BearSSL's constant-time 32-bit logic.
=> Obtained from commit dda1f8a0c46e15b4a235163470ff700b2f13dcc5.
=> We could conditionally adopt the 64-bit logic too, which would
likely give a modest performance boost on 64-bit platforms
without AES-NI, but that's a bit more trouble.
3. Select the AES implementation at boot-time; allow an MD override.
=> Use self-tests to verify basic correctness at boot.
=> The implementation selection policy is rather rudimentary at
the moment but it is isolated to one place so it's easy to
change later on.

This (a) plugs a host of timing attacks on, e.g., cgd, and (b) paves
the way to take advantage of CPU support for AES -- both things we
should've done a decade ago. Downside: Computing AES takes 2-3x the
CPU time. But that's what hardware support will be coming for.

Rudimentary measurement of performance impact done by:

mount -t tmpfs tmpfs /tmp
dd if=/dev/zero of=/tmp/disk bs=1m count=512
vnconfig -cv vnd0 /tmp/disk
cgdconfig -s cgd0 /dev/vnd0 aes-cbc 256 < /dev/zero
dd if=/dev/rcgd0d of=/dev/null bs=64k
dd if=/dev/zero of=/dev/rcgd0d bs=64k

The AES-CBC encryption performance impact is closer to 3x because it
is inherently sequential; the AES-CBC decryption impact is closer to
2x because the bitsliced AES logic can process two blocks at once.

Discussed on tech-kern:

https://mail-index.NetBSD.org/tech-kern/2020/06/18/msg026505.html
 1.2 29-Jun-2020  riastradh Provide the standard AES key schedule.

Different AES implementations prefer different variations on it, but
some of them -- notably VIA -- require the standard key schedule to
be available and don't provide hardware support for computing it
themselves. So adapt BearSSL's logic to generate the standard key
schedule (and decryption keys, with InvMixColumns), rather than the
bitsliced key schedule that BearSSL uses natively.
 1.1 29-Jun-2020  riastradh Rework AES in kernel to finally address CVE-2005-1797.

1. Rip out old variable-time reference implementation.
2. Replace it by BearSSL's constant-time 32-bit logic.
=> Obtained from commit dda1f8a0c46e15b4a235163470ff700b2f13dcc5.
=> We could conditionally adopt the 64-bit logic too, which would
likely give a modest performance boost on 64-bit platforms
without AES-NI, but that's a bit more trouble.
3. Select the AES implementation at boot-time; allow an MD override.
=> Use self-tests to verify basic correctness at boot.
=> The implementation selection policy is rather rudimentary at
the moment but it is isolated to one place so it's easy to
change later on.

This (a) plugs a host of timing attacks on, e.g., cgd, and (b) paves
the way to take advantage of CPU support for AES -- both things we
should've done a decade ago. Downside: Computing AES takes 2-3x the
CPU time. But that's what hardware support will be coming for.

Rudimentary measurement of performance impact done by:

mount -t tmpfs tmpfs /tmp
dd if=/dev/zero of=/tmp/disk bs=1m count=512
vnconfig -cv vnd0 /tmp/disk
cgdconfig -s cgd0 /dev/vnd0 aes-cbc 256 < /dev/zero
dd if=/dev/rcgd0d of=/dev/null bs=64k
dd if=/dev/zero of=/dev/rcgd0d bs=64k

The AES-CBC encryption performance impact is closer to 3x because it
is inherently sequential; the AES-CBC decryption impact is closer to
2x because the bitsliced AES logic can process two blocks at once.

Discussed on tech-kern:

https://mail-index.NetBSD.org/tech-kern/2020/06/18/msg026505.html
 1.1 29-Jun-2020  riastradh Rework AES in kernel to finally address CVE-2005-1797.

1. Rip out old variable-time reference implementation.
2. Replace it by BearSSL's constant-time 32-bit logic.
=> Obtained from commit dda1f8a0c46e15b4a235163470ff700b2f13dcc5.
=> We could conditionally adopt the 64-bit logic too, which would
likely give a modest performance boost on 64-bit platforms
without AES-NI, but that's a bit more trouble.
3. Select the AES implementation at boot-time; allow an MD override.
=> Use self-tests to verify basic correctness at boot.
=> The implementation selection policy is rather rudimentary at
the moment but it is isolated to one place so it's easy to
change later on.

This (a) plugs a host of timing attacks on, e.g., cgd, and (b) paves
the way to take advantage of CPU support for AES -- both things we
should've done a decade ago. Downside: Computing AES takes 2-3x the
CPU time. But that's what hardware support will be coming for.

Rudimentary measurement of performance impact done by:

mount -t tmpfs tmpfs /tmp
dd if=/dev/zero of=/tmp/disk bs=1m count=512
vnconfig -cv vnd0 /tmp/disk
cgdconfig -s cgd0 /dev/vnd0 aes-cbc 256 < /dev/zero
dd if=/dev/rcgd0d of=/dev/null bs=64k
dd if=/dev/zero of=/dev/rcgd0d bs=64k

The AES-CBC encryption performance impact is closer to 3x because it
is inherently sequential; the AES-CBC decryption impact is closer to
2x because the bitsliced AES logic can process two blocks at once.

Discussed on tech-kern:

https://mail-index.NetBSD.org/tech-kern/2020/06/18/msg026505.html
 1.10 05-Nov-2022  jmcneill Make aes and chacha prints debug only.
 1.9 27-Jul-2020  riastradh New sysctl subtree kern.crypto.

kern.crypto.aes.selected (formerly hw.aes_impl)
kern.crypto.chacha.selected (formerly hw.chacha_impl)

XXX Should maybe deduplicate creation of kern.crypto.
 1.8 25-Jul-2020  riastradh Make aes boot message verbose-only.
 1.7 25-Jul-2020  riastradh Remove now-needless AES-CCM fallback logic.

These paths are no longer exercised because all of the aes_impls now
do the AES-CCM operations.
 1.6 25-Jul-2020  riastradh Push CBC-MAC and CCM block updates into the aes_impl API.

This should help reduce the setup and teardown overhead (enabling and
disabling fpu, or expanding bitsliced keys) for CCM, as used in
802.11 WPA2 CCMP. But all the fiddly formatting details remain in
aes_ccm.c to reduce the effort of implementing it -- at the cost of a
handful additional setups and teardowns per message.

Not yet implemented by any of the aes_impls, so leave a fallback that
just calls aes_enc for now. This should be removed when all of the
aes_impls provide CBC-MAC and CCM block updates.
 1.5 25-Jul-2020  riastradh Split aes_cbc_* and aes_xts_* into their own header files.

aes.h will remain just for key setup; any particular construction using
AES can have its own header file so we can have many of them without
rebuilding everything AES-related whenever one of them changes.

(Planning to add AES-CCM and AES-GCM too.)
 1.4 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.3 30-Jun-2020  riastradh New sysctl node hw.aes_impl for selected AES implementation.
 1.2 29-Jun-2020  riastradh Provide the standard AES key schedule.

Different AES implementations prefer different variations on it, but
some of them -- notably VIA -- require the standard key schedule to
be available and don't provide hardware support for computing it
themselves. So adapt BearSSL's logic to generate the standard key
schedule (and decryption keys, with InvMixColumns), rather than the
bitsliced key schedule that BearSSL uses natively.
 1.1 29-Jun-2020  riastradh Rework AES in kernel to finally address CVE-2005-1797.

1. Rip out old variable-time reference implementation.
2. Replace it by BearSSL's constant-time 32-bit logic.
=> Obtained from commit dda1f8a0c46e15b4a235163470ff700b2f13dcc5.
=> We could conditionally adopt the 64-bit logic too, which would
likely give a modest performance boost on 64-bit platforms
without AES-NI, but that's a bit more trouble.
3. Select the AES implementation at boot-time; allow an MD override.
=> Use self-tests to verify basic correctness at boot.
=> The implementation selection policy is rather rudimentary at
the moment but it is isolated to one place so it's easy to
change later on.

This (a) plugs a host of timing attacks on, e.g., cgd, and (b) paves
the way to take advantage of CPU support for AES -- both things we
should've done a decade ago. Downside: Computing AES takes 2-3x the
CPU time. But that's what hardware support will be coming for.

Rudimentary measurement of performance impact done by:

mount -t tmpfs tmpfs /tmp
dd if=/dev/zero of=/tmp/disk bs=1m count=512
vnconfig -cv vnd0 /tmp/disk
cgdconfig -s cgd0 /dev/vnd0 aes-cbc 256 < /dev/zero
dd if=/dev/rcgd0d of=/dev/null bs=64k
dd if=/dev/zero of=/dev/rcgd0d bs=64k

The AES-CBC encryption performance impact is closer to 3x because it
is inherently sequential; the AES-CBC decryption impact is closer to
2x because the bitsliced AES logic can process two blocks at once.

Discussed on tech-kern:

https://mail-index.NetBSD.org/tech-kern/2020/06/18/msg026505.html
 1.2 25-Jul-2020  riastradh Push CBC-MAC and CCM block updates into the aes_impl API.

This should help reduce the setup and teardown overhead (enabling and
disabling fpu, or expanding bitsliced keys) for CCM, as used in
802.11 WPA2 CCMP. But all the fiddly formatting details remain in
aes_ccm.c to reduce the effort of implementing it -- at the cost of a
handful additional setups and teardowns per message.

Not yet implemented by any of the aes_impls, so leave a fallback that
just calls aes_enc for now. This should be removed when all of the
aes_impls provide CBC-MAC and CCM block updates.
 1.1 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.3 25-Jul-2020  riastradh Remove now-unused legacy rijndael API.
 1.2 25-Jul-2020  riastradh Split aes_cbc_* and aes_xts_* into their own header files.

aes.h will remain just for key setup; any particular construction using
AES can have its own header file so we can have many of them without
rebuilding everything AES-related whenever one of them changes.

(Planning to add AES-CCM and AES-GCM too.)
 1.1 29-Jun-2020  riastradh Rework AES in kernel to finally address CVE-2005-1797.

1. Rip out old variable-time reference implementation.
2. Replace it by BearSSL's constant-time 32-bit logic.
=> Obtained from commit dda1f8a0c46e15b4a235163470ff700b2f13dcc5.
=> We could conditionally adopt the 64-bit logic too, which would
likely give a modest performance boost on 64-bit platforms
without AES-NI, but that's a bit more trouble.
3. Select the AES implementation at boot-time; allow an MD override.
=> Use self-tests to verify basic correctness at boot.
=> The implementation selection policy is rather rudimentary at
the moment but it is isolated to one place so it's easy to
change later on.

This (a) plugs a host of timing attacks on, e.g., cgd, and (b) paves
the way to take advantage of CPU support for AES -- both things we
should've done a decade ago. Downside: Computing AES takes 2-3x the
CPU time. But that's what hardware support will be coming for.

Rudimentary measurement of performance impact done by:

mount -t tmpfs tmpfs /tmp
dd if=/dev/zero of=/tmp/disk bs=1m count=512
vnconfig -cv vnd0 /tmp/disk
cgdconfig -s cgd0 /dev/vnd0 aes-cbc 256 < /dev/zero
dd if=/dev/rcgd0d of=/dev/null bs=64k
dd if=/dev/zero of=/dev/rcgd0d bs=64k

The AES-CBC encryption performance impact is closer to 3x because it
is inherently sequential; the AES-CBC decryption impact is closer to
2x because the bitsliced AES logic can process two blocks at once.

Discussed on tech-kern:

https://mail-index.NetBSD.org/tech-kern/2020/06/18/msg026505.html
 1.7 05-Dec-2021  msaitoh s/folllowing/following/
 1.6 08-Sep-2020  riastradh aes(9): Fix edge case in bitsliced SSE2 AES-CBC decryption.

Make sure self-tests exercise this edge case.

Discovered by confusion over code inspection of jak's adaptation of
aes_armv8_64.S for big-endian.
 1.5 25-Jul-2020  riastradh Remove now-needless AES-CCM fallback logic.

These paths are no longer exercised because all of the aes_impls now
do the AES-CCM operations.
 1.4 25-Jul-2020  riastradh Push CBC-MAC and CCM block updates into the aes_impl API.

This should help reduce the setup and teardown overhead (enabling and
disabling fpu, or expanding bitsliced keys) for CCM, as used in
802.11 WPA2 CCMP. But all the fiddly formatting details remain in
aes_ccm.c to reduce the effort of implementing it -- at the cost of a
handful additional setups and teardowns per message.

Not yet implemented by any of the aes_impls, so leave a fallback that
just calls aes_enc for now. This should be removed when all of the
aes_impls provide CBC-MAC and CCM block updates.
 1.3 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.2 30-Jun-2020  riastradh New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.
 1.1 29-Jun-2020  riastradh Rework AES in kernel to finally address CVE-2005-1797.

1. Rip out old variable-time reference implementation.
2. Replace it by BearSSL's constant-time 32-bit logic.
=> Obtained from commit dda1f8a0c46e15b4a235163470ff700b2f13dcc5.
=> We could conditionally adopt the 64-bit logic too, which would
likely give a modest performance boost on 64-bit platforms
without AES-NI, but that's a bit more trouble.
3. Select the AES implementation at boot-time; allow an MD override.
=> Use self-tests to verify basic correctness at boot.
=> The implementation selection policy is rather rudimentary at
the moment but it is isolated to one place so it's easy to
change later on.

This (a) plugs a host of timing attacks on, e.g., cgd, and (b) paves
the way to take advantage of CPU support for AES -- both things we
should've done a decade ago. Downside: Computing AES takes 2-3x the
CPU time. But that's what hardware support will be coming for.

Rudimentary measurement of performance impact done by:

mount -t tmpfs tmpfs /tmp
dd if=/dev/zero of=/tmp/disk bs=1m count=512
vnconfig -cv vnd0 /tmp/disk
cgdconfig -s cgd0 /dev/vnd0 aes-cbc 256 < /dev/zero
dd if=/dev/rcgd0d of=/dev/null bs=64k
dd if=/dev/zero of=/dev/rcgd0d bs=64k

The AES-CBC encryption performance impact is closer to 3x because it
is inherently sequential; the AES-CBC decryption impact is closer to
2x because the bitsliced AES logic can process two blocks at once.

Discussed on tech-kern:

https://mail-index.NetBSD.org/tech-kern/2020/06/18/msg026505.html
 1.1 25-Jul-2020  riastradh Split aes_cbc_* and aes_xts_* into their own header files.

aes.h will remain just for key setup; any particular construction using
AES can have its own header file so we can have many of them without
rebuilding everything AES-related whenever one of them changes.

(Planning to add AES-CCM and AES-GCM too.)
 1.3 25-Jul-2020  riastradh Remove now-unused legacy rijndael API.
 1.2 25-Jul-2020  riastradh New aes_ccm API.

Intended for use in net80211 for WPA2 CCMP.
 1.1 29-Jun-2020  riastradh Rework AES in kernel to finally address CVE-2005-1797.

1. Rip out old variable-time reference implementation.
2. Replace it by BearSSL's constant-time 32-bit logic.
=> Obtained from commit dda1f8a0c46e15b4a235163470ff700b2f13dcc5.
=> We could conditionally adopt the 64-bit logic too, which would
likely give a modest performance boost on 64-bit platforms
without AES-NI, but that's a bit more trouble.
3. Select the AES implementation at boot-time; allow an MD override.
=> Use self-tests to verify basic correctness at boot.
=> The implementation selection policy is rather rudimentary at
the moment but it is isolated to one place so it's easy to
change later on.

This (a) plugs a host of timing attacks on, e.g., cgd, and (b) paves
the way to take advantage of CPU support for AES -- both things we
should've done a decade ago. Downside: Computing AES takes 2-3x the
CPU time. But that's what hardware support will be coming for.

Rudimentary measurement of performance impact done by:

mount -t tmpfs tmpfs /tmp
dd if=/dev/zero of=/tmp/disk bs=1m count=512
vnconfig -cv vnd0 /tmp/disk
cgdconfig -s cgd0 /dev/vnd0 aes-cbc 256 < /dev/zero
dd if=/dev/rcgd0d of=/dev/null bs=64k
dd if=/dev/zero of=/dev/rcgd0d bs=64k

The AES-CBC encryption performance impact is closer to 3x because it
is inherently sequential; the AES-CBC decryption impact is closer to
2x because the bitsliced AES logic can process two blocks at once.

Discussed on tech-kern:

https://mail-index.NetBSD.org/tech-kern/2020/06/18/msg026505.html
 1.5 25-Jul-2020  riastradh Implement AES-CCM with ARMv8.5-AES.
 1.4 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.3 30-Jun-2020  riastradh New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.
 1.2 29-Jun-2020  riastradh Move aarch64/fpu.h to arm/fpu.h.
 1.1 29-Jun-2020  riastradh Implement AES in kernel using ARMv8.0-AES on aarch64.
 1.3 25-Jul-2020  riastradh Implement AES-CCM with ARMv8.5-AES.
 1.2 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.1 29-Jun-2020  riastradh Implement AES in kernel using ARMv8.0-AES on aarch64.
 1.15 08-Sep-2020  riastradh aesarmv8: Reallocate registers to shave off unnecessary MOV.
 1.14 08-Sep-2020  riastradh aesarmv8: Issue two 4-register ld/st, not four 2-register ld/st.
 1.13 08-Sep-2020  riastradh aesarmv8: Adapt aes_armv8_64.S to big-endian.

Patch mainly from (and tested by) jakllsch@ with minor tweaks by me.
 1.12 08-Aug-2020  riastradh Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.

New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers.
Needed because GCC and Clang disagree on the ordering of lanes,
depending on whether it's 64-bit big-endian, 32-bit big-endian, or
little-endian -- and, bizarrely, both of them disagree with the
architectural numbering of lanes.

Experimented with using

static const uint8_t x8[16] = {...};

uint8x16_t x = vld1q_u8(x8);

which doesn't require knowing anything about the ordering of lanes,
but this generates considerably worse code and apparently confuses
GCC into not recognizing the constant value of x8.

Fix some clang mistakes while here too.
 1.11 27-Jul-2020  riastradh Add RCSIDs to the AES and ChaCha .S sources.
 1.10 27-Jul-2020  riastradh Issue aese/aesmc and aesd/aesimc in pairs.

Advised by the aarch64 optimization guide; increases cgd throughput
by about 10%.
 1.9 27-Jul-2020  riastradh Align critical-path loops in AES and ChaCha.
 1.8 25-Jul-2020  riastradh Implement AES-CCM with ARMv8.5-AES.
 1.7 25-Jul-2020  riastradh Invert some loops to save a branch instruction on every iteration.
 1.6 22-Jul-2020  riastradh Fix register name in comment.

Some time ago I reallocated the registers to avoid inadvertently
clobbering the callee-saves v9, but neglected to update the comment.
 1.5 19-Jul-2020  ryo fix build with clang/llvm.

clang aarch64 assembler doesn't accept optional number of lanes of vector register.
(but ARMARM says that an assembler must accept it)
 1.4 30-Jun-2020  riastradh Reallocate registers to avoid abusing callee-saves registers, v8-v15.

Forgot to consult the AAPCS before committing this before -- oops!

While here, take advantage of the 32 aarch64 simd registers to avoid
all stack spills.
 1.3 30-Jun-2020  riastradh Use `.arch_extension aes' for aese/aesmc/aesd/aesimc.

Unlike `.arch_extension crypto', this works with clang; both work
with gas, so we'll go with this.

Clang still can't handle aes_armv8_64.S yet -- it gets confused by
dup and mov on lanes, but this makes progress.
 1.2 30-Jun-2020  riastradh Use .p2align rather than .align.

Apparently on arm, .align is actually an alias for .p2align, taking a
power of two rather than a number of bytes, so aes_armv8_64.o was
bloated to 32KB with obscene alignment when it only needed to be
barely past 4KB.

Do the same for the x86 aes_ni_64.S -- even though .align takes a
number of bytes rather than a power of two on x86, let's just stay
away from the temptations of the evil .align directive.
 1.1 29-Jun-2020  riastradh Implement AES in kernel using ARMv8.0-AES on aarch64.
 1.6 21-Nov-2020  rin Fix build with clang for earmv7hf; loadroundkey() is used only for __aarch64__.
 1.5 08-Aug-2020  riastradh branches: 1.5.2;
Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.

New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers.
Needed because GCC and Clang disagree on the ordering of lanes,
depending on whether it's 64-bit big-endian, 32-bit big-endian, or
little-endian -- and, bizarrely, both of them disagree with the
architectural numbering of lanes.

Experimented with using

static const uint8_t x8[16] = {...};

uint8x16_t x = vld1q_u8(x8);

which doesn't require knowing anything about the ordering of lanes,
but this generates considerably worse code and apparently confuses
GCC into not recognizing the constant value of x8.

Fix some clang mistakes while here too.
 1.4 28-Jul-2020  riastradh Draft 2x vectorized neon vpaes for aarch64.

Gives a modest speed boost on rk3399 (Cortex-A53/A72), around 20% in
cgd tests, for parallelizable operations like CBC decryption; same
improvement should probably carry over to rpi4 CPU which lacks
ARMv8.0-AES.
 1.3 30-Jun-2020  riastradh New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.
 1.2 29-Jun-2020  riastradh Provide hand-written AES NEON assembly for arm32.

gcc does a lousy job at compiling 128-bit NEON intrinsics on arm32;
hand-writing it made it about 12x faster, by avoiding a zillion loads
and stores to spill everything and the kitchen sink onto the stack.
(But gcc does fine on aarch64, presumably because it has twice as
many registers and doesn't have to deal with q2=d4/d5 overlapping.)
 1.1 29-Jun-2020  riastradh New permutation-based AES implementation using ARM NEON.

Also derived from Mike Hamburg's public-domain vpaes code.
 1.5.2.1 14-Dec-2020  thorpej Sync w/ HEAD.
 1.3 25-Jul-2020  riastradh Implement AES-CCM with NEON.
 1.2 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.1 29-Jun-2020  riastradh New permutation-based AES implementation using ARM NEON.

Also derived from Mike Hamburg's public-domain vpaes code.
 1.11 10-Sep-2020  riastradh aes neon: Gather mc_forward/backward so we can load 256 bits at once.
 1.10 10-Sep-2020  riastradh aes neon: Hoist dsbd/dsbe address calculation out of loop.
 1.9 10-Sep-2020  riastradh aes neon: Tweak register usage.

- Call r12 by its usual name, ip.
- No need for r7 or r11=fp at the moment.
 1.8 10-Sep-2020  riastradh aes neon: Write vtbl with {qN} rather than {d(2N)-d(2N+1)}.

Cosmetic; no functional change.
 1.7 10-Sep-2020  riastradh aes neon: Issue 256-bit loads rather than pairs of 128-bit loads.

Not sure why I didn't realize you could do this before!

Saves some temporary registers that can now be allocated to shave off
a few cycles.
 1.6 16-Aug-2020  riastradh Fix AES NEON code for big-endian softfp ARM.

...which is how the kernel runs. Switch to using __SOFTFP__ for
consistency with how it gets exposed to C, although I'm not sure how
to get it defined automagically in the toolchain for .S files so
that's set manually in files.aesneon for now.
 1.5 08-Aug-2020  riastradh Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.

New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers.
Needed because GCC and Clang disagree on the ordering of lanes,
depending on whether it's 64-bit big-endian, 32-bit big-endian, or
little-endian -- and, bizarrely, both of them disagree with the
architectural numbering of lanes.

Experimented with using

static const uint8_t x8[16] = {...};

uint8x16_t x = vld1q_u8(x8);

which doesn't require knowing anything about the ordering of lanes,
but this generates considerably worse code and apparently confuses
GCC into not recognizing the constant value of x8.

Fix some clang mistakes while here too.
 1.4 27-Jul-2020  riastradh Add RCSIDs to the AES and ChaCha .S sources.
 1.3 27-Jul-2020  riastradh Align critical-path loops in AES and ChaCha.
 1.2 27-Jul-2020  riastradh PIC for aes_neon_32.S.

Without this, tests/sys/crypto/aes/t_aes fails to start on armv7
because of R_ARM_ABS32 relocations in a nonwritable text segment for
a PIE -- which atf quietly ignores in the final report! Yikes.
 1.1 29-Jun-2020  riastradh Provide hand-written AES NEON assembly for arm32.

gcc does a lousy job at compiling 128-bit NEON intrinsics on arm32;
hand-writing it made it about 12x faster, by avoiding a zillion loads
and stores to spill everything and the kitchen sink onto the stack.
(But gcc does fine on aarch64, presumably because it has twice as
many registers and doesn't have to deal with q2=d4/d5 overlapping.)
 1.5 10-Oct-2020  jmcneill Fix detection of NEON features. ID_AA64PFR0_EL1_ADV_SIMD_NONE means SIMD
is not available, and any other value means it is.
 1.4 25-Jul-2020  riastradh Implement AES-CCM with NEON.
 1.3 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.2 30-Jun-2020  riastradh New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.
 1.1 29-Jun-2020  riastradh New permutation-based AES implementation using ARM NEON.

Also derived from Mike Hamburg's public-domain vpaes code.
 1.4 07-Aug-2023  rin sys/crypto: Introduce arch/{arm,x86} to share common MD headers

Dedup between aes and chacha. No binary changes.
 1.3 08-Aug-2020  riastradh Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.

New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers.
Needed because GCC and Clang disagree on the ordering of lanes,
depending on whether it's 64-bit big-endian, 32-bit big-endian, or
little-endian -- and, bizarrely, both of them disagree with the
architectural numbering of lanes.

Experimented with using

static const uint8_t x8[16] = {...};

uint8x16_t x = vld1q_u8(x8);

which doesn't require knowing anything about the ordering of lanes,
but this generates considerably worse code and apparently confuses
GCC into not recognizing the constant value of x8.

Fix some clang mistakes while here too.
 1.2 28-Jul-2020  riastradh Draft 2x vectorized neon vpaes for aarch64.

Gives a modest speed boost on rk3399 (Cortex-A53/A72), around 20% in
cgd tests, for parallelizable operations like CBC decryption; same
improvement should probably carry over to rpi4 CPU which lacks
ARMv8.0-AES.
 1.1 29-Jun-2020  riastradh New permutation-based AES implementation using ARM NEON.

Also derived from Mike Hamburg's public-domain vpaes code.
 1.8 26-Jun-2022  riastradh arm/aes_neon: Fix formatting of self-test failure message.

Discovered by code inspection. Remarkably, a combination of errors
made this fail to be a stack buffer overrun. Verified by booting
with ARMv8.0-AES disabled and with the self-test artificially made to
fail.
 1.7 09-Aug-2020  riastradh Use vshlq_n_s32 rather than vsliq_n_s32 with zero destination.

Not sure why I reached for vsliq_n_s32 at first -- probably so I
wouldn't have to deal with a new intrinsic in arm_neon.h!
 1.6 09-Aug-2020  riastradh Nix outdated comment.

I implemented this parallelism a couple weeks ago.
 1.5 08-Aug-2020  riastradh Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.

New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers.
Needed because GCC and Clang disagree on the ordering of lanes,
depending on whether it's 64-bit big-endian, 32-bit big-endian, or
little-endian -- and, bizarrely, both of them disagree with the
architectural numbering of lanes.

Experimented with using

static const uint8_t x8[16] = {...};

uint8x16_t x = vld1q_u8(x8);

which doesn't require knowing anything about the ordering of lanes,
but this generates considerably worse code and apparently confuses
GCC into not recognizing the constant value of x8.

Fix some clang mistakes while here too.
 1.4 28-Jul-2020  riastradh Draft 2x vectorized neon vpaes for aarch64.

Gives a modest speed boost on rk3399 (Cortex-A53/A72), around 20% in
cgd tests, for parallelizable operations like CBC decryption; same
improvement should probably carry over to rpi4 CPU which lacks
ARMv8.0-AES.
 1.3 25-Jul-2020  riastradh Implement AES-CCM with NEON.
 1.2 30-Jun-2020  riastradh New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.
 1.1 29-Jun-2020  riastradh New permutation-based AES implementation using ARM NEON.

Also derived from Mike Hamburg's public-domain vpaes code.
 1.13 07-Aug-2023  rin sys/crypto: Introduce arch/{arm,x86} to share common MD headers

Dedup between aes and chacha. No binary changes.
 1.12 07-Aug-2023  rin sys/crypto/{aes,chacha}/arch/arm/arm_neon.h: Sync (whitespace fix)

No binary changes.
 1.11 07-Sep-2020  jakllsch Fix vgetq_lane_u32 for aarch64eb with GCC

Fixes NEON AES on aarch64eb
 1.10 09-Aug-2020  riastradh Fix some clang neon intrinsics.

Compile-tested only, with -Wno-nonportable-vector-initializers. Need
to address -- and test -- this stuff properly but this is progress.
 1.9 09-Aug-2020  riastradh Use vshlq_n_s32 rather than vsliq_n_s32 with zero destination.

Not sure why I reached for vsliq_n_s32 at first -- probably so I
wouldn't have to deal with a new intrinsic in arm_neon.h!
 1.8 08-Aug-2020  riastradh Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.

New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers.
Needed because GCC and Clang disagree on the ordering of lanes,
depending on whether it's 64-bit big-endian, 32-bit big-endian, or
little-endian -- and, bizarrely, both of them disagree with the
architectural numbering of lanes.

Experimented with using

static const uint8_t x8[16] = {...};

uint8x16_t x = vld1q_u8(x8);

which doesn't require knowing anything about the ordering of lanes,
but this generates considerably worse code and apparently confuses
GCC into not recognizing the constant value of x8.

Fix some clang mistakes while here too.
 1.7 28-Jul-2020  riastradh Draft 2x vectorized neon vpaes for aarch64.

Gives a modest speed boost on rk3399 (Cortex-A53/A72), around 20% in
cgd tests, for parallelizable operations like CBC decryption; same
improvement should probably carry over to rpi4 CPU which lacks
ARMv8.0-AES.
 1.6 25-Jul-2020  riastradh Add 32-bit load, store, and shift intrinsics.

vld1q_u32
vst1q_u32
vshlq_n_u32
vshrq_n_u32
 1.5 25-Jul-2020  riastradh Fix missing clang big-endian case.
 1.4 25-Jul-2020  riastradh Implement AES-CCM with NEON.
 1.3 23-Jul-2020  ryo fix build with llvm/clang.
 1.2 30-Jun-2020  riastradh Tweak clang neon intrinsics so they build.

(this file is still a kludge)
 1.1 29-Jun-2020  riastradh New permutation-based AES implementation using ARM NEON.

Also derived from Mike Hamburg's public-domain vpaes code.
 1.3 07-Aug-2023  rin sys/crypto: Introduce arch/{arm,x86} to share common MD headers

Dedup between aes and chacha. No binary changes.
 1.2 09-Aug-2020  riastradh Fix mistake in big-endian arm clang.

Swapped the two halves (only gcc does that, I think) and wrote j,i
backwards, oops.

(I don't have a big-endian arm clang build handy to test; hoping this
works.)
 1.1 08-Aug-2020  riastradh Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.

New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers.
Needed because GCC and Clang disagree on the ordering of lanes,
depending on whether it's 64-bit big-endian, 32-bit big-endian, or
little-endian -- and, bizarrely, both of them disagree with the
architectural numbering of lanes.

Experimented with using

static const uint8_t x8[16] = {...};

uint8x16_t x = vld1q_u8(x8);

which doesn't require knowing anything about the ordering of lanes,
but this generates considerably worse code and apparently confuses
GCC into not recognizing the constant value of x8.

Fix some clang mistakes while here too.
 1.1 29-Jun-2020  riastradh Implement AES in kernel using ARMv8.0-AES on aarch64.
 1.5 08-Sep-2020  jakllsch Acknowledge clang warning for NEON cipher code on aarch64eb

We've already made the nonportable vector initializations portable; the
code works on aarch64eb.
 1.4 16-Aug-2020  riastradh Fix AES NEON code for big-endian softfp ARM.

...which is how the kernel runs. Switch to using __SOFTFP__ for
consistency with how it gets exposed to C, although I'm not sure how
to get it defined automagically in the toolchain for .S files so
that's set manually in files.aesneon for now.
 1.3 30-Jun-2020  riastradh Limit aes_neon to cpu_cortex | aarch64.

We won't use it on any other systems, and it doesn't build without
NEON anyway. Verified earmv7hf GENERIC, aarch64 GENERIC64, and
earmv6 RPI2 all build with this.
 1.2 29-Jun-2020  riastradh Provide hand-written AES NEON assembly for arm32.

gcc does a lousy job at compiling 128-bit NEON intrinsics on arm32;
hand-writing it made it about 12x faster, by avoiding a zillion loads
and stores to spill everything and the kitchen sink onto the stack.
(But gcc does fine on aarch64, presumably because it has twice as
many registers and doesn't have to deal with q2=d4/d5 overlapping.)
 1.1 29-Jun-2020  riastradh New permutation-based AES implementation using ARM NEON.

Also derived from Mike Hamburg's public-domain vpaes code.
 1.5 05-Sep-2020  maxv x86: fix several CPUID flags

- Rename: CPUID_PN -> CPUID_PSN
CPUID_CFLUSH -> CPUID_CLFSH
CPUID_SBF -> CPUID_PBE
CPUID_LZCNT -> CPUID_ABM
CPUID_P1GB -> CPUID_PAGE1GB
CPUID2_PCLMUL -> CPUID2_PCLMULQDQ
CPUID2_CID -> CPUID2_CNXTID
CPUID2_xTPR -> CPUID2_XTPR
CPUID2_AES -> CPUID2_AESNI
To match the x86 specification and the other OSes.

- Remove: CPUID_B10, CPUID_B20, CPUID_IA64. They do not exist.
 1.4 25-Jul-2020  riastradh Implement AES-CCM with x86 AES-NI.
 1.3 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.2 30-Jun-2020  riastradh New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.
 1.1 29-Jun-2020  riastradh Add x86 AES-NI support.

Limited to amd64 for now. In principle, AES-NI should work in 32-bit
mode, and there may even be some 32-bit-only CPUs that support
AES-NI, but that requires work to adapt the assembly.
 1.3 25-Jul-2020  riastradh Implement AES-CCM with x86 AES-NI.
 1.2 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.1 29-Jun-2020  riastradh Add x86 AES-NI support.

Limited to amd64 for now. In principle, AES-NI should work in 32-bit
mode, and there may even be some 32-bit-only CPUs that support
AES-NI, but that requires work to adapt the assembly.
 1.6 27-Jul-2020  riastradh Add RCSIDs to the AES and ChaCha .S sources.
 1.5 27-Jul-2020  riastradh Align critical-path loops in AES and ChaCha.
 1.4 25-Jul-2020  riastradh Implement AES-CCM with x86 AES-NI.
 1.3 25-Jul-2020  riastradh Invert some loops to save a jmp instruction on each iteration.

No semantic change intended.
 1.2 30-Jun-2020  riastradh Use .p2align rather than .align.

Apparently on arm, .align is actually an alias for .p2align, taking a
power of two rather than a number of bytes, so aes_armv8_64.o was
bloated to 32KB with obscene alignment when it only needed to be
barely past 4KB.

Do the same for the x86 aes_ni_64.S -- even though .align takes a
number of bytes rather than a power of two on x86, let's just stay
away from the temptations of the evil .align directive.
 1.1 29-Jun-2020  riastradh Add x86 AES-NI support.

Limited to amd64 for now. In principle, AES-NI should work in 32-bit
mode, and there may even be some 32-bit-only CPUs that support
AES-NI, but that requires work to adapt the assembly.
 1.2 30-Jun-2020  riastradh New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.
 1.1 29-Jun-2020  riastradh New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.
 1.4 25-Jul-2020  riastradh Implement AES-CCM with SSE2.
 1.3 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.2 29-Jun-2020  riastradh Split SSE2 logic into separate units.

Ensure that there are no paths into files compiled with -msse -msse2
at all except via fpu_kern_enter.

I didn't run into a practical problem with this, but let's not leave
a ticking time bomb for subsequent toolchain changes in case the mere
declaration of local __m128i variables causes trouble.
 1.1 29-Jun-2020  riastradh New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.
 1.1 29-Jun-2020  riastradh New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.
 1.1 29-Jun-2020  riastradh New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.
 1.5 25-Jul-2020  riastradh Implement AES-CCM with SSE2.
 1.4 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.3 30-Jun-2020  riastradh New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.
 1.2 29-Jun-2020  riastradh Split SSE2 logic into separate units.

Ensure that there are no paths into files compiled with -msse -msse2
at all except via fpu_kern_enter.

I didn't run into a practical problem with this, but let's not leave
a ticking time bomb for subsequent toolchain changes in case the mere
declaration of local __m128i variables causes trouble.
 1.1 29-Jun-2020  riastradh New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.
 1.3 07-Aug-2023  rin sys/crypto: Introduce arch/{arm,x86} to share common MD headers

Dedup between aes and chacha. No binary changes.
 1.2 29-Jun-2020  riastradh Split SSE2 logic into separate units.

Ensure that there are no paths into files compiled with -msse -msse2
at all except via fpu_kern_enter.

I didn't run into a practical problem with this, but let's not leave
a ticking time bomb for subsequent toolchain changes in case the mere
declaration of local __m128i variables causes trouble.
 1.1 29-Jun-2020  riastradh New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.
 1.4 08-Sep-2020  riastradh aes(9): Fix edge case in bitsliced SSE2 AES-CBC decryption.

Make sure self-tests exercise this edge case.

Discovered by confusion over code inspection of jak's adaptation of
aes_armv8_64.S for big-endian.
 1.3 25-Jul-2020  riastradh Implement AES-CCM with SSE2.
 1.2 30-Jun-2020  riastradh New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.
 1.1 29-Jun-2020  riastradh Split SSE2 logic into separate units.

Ensure that there are no paths into files compiled with -msse -msse2
at all except via fpu_kern_enter.

I didn't run into a practical problem with this, but let's not leave
a ticking time bomb for subsequent toolchain changes in case the mere
declaration of local __m128i variables causes trouble.
 1.2 30-Jun-2020  riastradh New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.
 1.1 29-Jun-2020  riastradh New permutation-based AES implementation using SSSE3.

This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.

Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.
 1.3 25-Jul-2020  riastradh Implement AES-CCM with SSSE3.
 1.2 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.1 29-Jun-2020  riastradh New permutation-based AES implementation using SSSE3.

This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.

Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.
 1.4 25-Jul-2020  riastradh Implement AES-CCM with SSSE3.
 1.3 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.2 30-Jun-2020  riastradh New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.
 1.1 29-Jun-2020  riastradh New permutation-based AES implementation using SSSE3.

This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.

Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.
 1.2 07-Aug-2023  rin sys/crypto: Introduce arch/{arm,x86} to share common MD headers

Dedup between aes and chacha. No binary changes.
 1.1 29-Jun-2020  riastradh New permutation-based AES implementation using SSSE3.

This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.

Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.
 1.3 25-Jul-2020  riastradh Implement AES-CCM with SSSE3.
 1.2 30-Jun-2020  riastradh New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.
 1.1 29-Jun-2020  riastradh New permutation-based AES implementation using SSSE3.

This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.

Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.
 1.9 16-Jun-2024  rillig sys/aes_via: fix broken link in comment
 1.8 16-Jun-2024  christos revert previous, probably a gcc bug?
 1.7 16-Jun-2024  christos try to fix the overflow gcc pointed out.
 1.6 28-Jul-2020  riastradh Initialize authctr in both branches.

I guess I didn't test the unaligned case, weird.
 1.5 25-Jul-2020  riastradh Implement AES-CCM with VIA ACE.
 1.4 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.3 30-Jun-2020  riastradh New test sys/crypto/aes/t_aes.

Runs aes_selftest on all kernel AES implementations supported on the
current hardware, not just the preferred one.
 1.2 29-Jun-2020  riastradh VIA AES: Batch AES-XTS computation into eight blocks at a time.

Experimental -- performance improvement is not clearly worth the
complexity.
 1.1 29-Jun-2020  riastradh Add AES implementation with VIA ACE.
 1.2 25-Jul-2020  riastradh Split aes_impl declarations out into aes_impl.h.

This will make it less painful to add more operations to struct
aes_impl without having to recompile everything that just uses the
block cipher directly or similar.
 1.1 29-Jun-2020  riastradh Add AES implementation with VIA ACE.
 1.1 29-Jun-2020  riastradh Add x86 AES-NI support.

Limited to amd64 for now. In principle, AES-NI should work in 32-bit
mode, and there may even be some 32-bit-only CPUs that support
AES-NI, but that requires work to adapt the assembly.
 1.2 29-Jun-2020  riastradh Split SSE2 logic into separate units.

Ensure that there are no paths into files compiled with -msse -msse2
at all except via fpu_kern_enter.

I didn't run into a practical problem with this, but let's not leave
a ticking time bomb for subsequent toolchain changes in case the mere
declaration of local __m128i variables causes trouble.
 1.1 29-Jun-2020  riastradh New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.
 1.1 29-Jun-2020  riastradh New permutation-based AES implementation using SSSE3.

This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.

Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.
 1.1 29-Jun-2020  riastradh Add AES implementation with VIA ACE.
 1.6 07-Aug-2023  rin sys/crypto: Introduce arch/{arm,x86} to share common MD headers

Dedup between aes and chacha. No binary changes.
 1.5 25-Jul-2020  riastradh Add some Intel intrinsics for ChaCha.

_mm_load1_ps
_mm_loadu_si128
_mm_movelh_ps
_mm_slli_epi32
_mm_storeu_si128
_mm_unpackhi_epi32
_mm_unpacklo_epi32
 1.4 25-Jul-2020  riastradh Fix target attribute on _mm_movehl_ps, fix clang _mm_unpacklo_epi64.

- _mm_movehl_ps is available in SSE2, no need for SSSE3.
- _mm_unpacklo_epi64 operates on v2di, not v4si; fix.
 1.3 25-Jul-2020  riastradh Implement AES-CCM with SSSE3.
 1.2 29-Jun-2020  riastradh New permutation-based AES implementation using SSSE3.

This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.

Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.
 1.1 29-Jun-2020  riastradh New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.
 1.2 07-Aug-2023  rin sys/crypto: Introduce arch/{arm,x86} to share common MD headers

Dedup between aes and chacha. No binary changes.
 1.1 29-Jun-2020  riastradh New SSE2-based bitsliced AES implementation.

This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by

(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.

Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.

Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.

Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.

RSS XML Feed