| History log of /src/sys/crypto/aes/arch |
| Revision | Date | Author | Comments |
| 1.5 | 25-Jul-2020 |
riastradh | Implement AES-CCM with ARMv8.5-AES.
|
| 1.4 | 25-Jul-2020 |
riastradh | Split aes_impl declarations out into aes_impl.h.
This will make it less painful to add more operations to struct aes_impl without having to recompile everything that just uses the block cipher directly or similar.
|
| 1.3 | 30-Jun-2020 |
riastradh | New test sys/crypto/aes/t_aes.
Runs aes_selftest on all kernel AES implementations supported on the current hardware, not just the preferred one.
|
| 1.2 | 29-Jun-2020 |
riastradh | Move aarch64/fpu.h to arm/fpu.h.
|
| 1.1 | 29-Jun-2020 |
riastradh | Implement AES in kernel using ARMv8.0-AES on aarch64.
|
| 1.3 | 25-Jul-2020 |
riastradh | Implement AES-CCM with ARMv8.5-AES.
|
| 1.2 | 25-Jul-2020 |
riastradh | Split aes_impl declarations out into aes_impl.h.
This will make it less painful to add more operations to struct aes_impl without having to recompile everything that just uses the block cipher directly or similar.
|
| 1.1 | 29-Jun-2020 |
riastradh | Implement AES in kernel using ARMv8.0-AES on aarch64.
|
| 1.15 | 08-Sep-2020 |
riastradh | aesarmv8: Reallocate registers to shave off unnecessary MOV.
|
| 1.14 | 08-Sep-2020 |
riastradh | aesarmv8: Issue two 4-register ld/st, not four 2-register ld/st.
|
| 1.13 | 08-Sep-2020 |
riastradh | aesarmv8: Adapt aes_armv8_64.S to big-endian.
Patch mainly from (and tested by) jakllsch@ with minor tweaks by me.
|
| 1.12 | 08-Aug-2020 |
riastradh | Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.
New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers. Needed because GCC and Clang disagree on the ordering of lanes, depending on whether it's 64-bit big-endian, 32-bit big-endian, or little-endian -- and, bizarrely, both of them disagree with the architectural numbering of lanes.
Experimented with using
static const uint8_t x8[16] = {...};
uint8x16_t x = vld1q_u8(x8);
which doesn't require knowing anything about the ordering of lanes, but this generates considerably worse code and apparently confuses GCC into not recognizing the constant value of x8.
Fix some clang mistakes while here too.
|
| 1.11 | 27-Jul-2020 |
riastradh | Add RCSIDs to the AES and ChaCha .S sources.
|
| 1.10 | 27-Jul-2020 |
riastradh | Issue aese/aesmc and aesd/aesimc in pairs.
Advised by the aarch64 optimization guide; increases cgd throughput by about 10%.
|
| 1.9 | 27-Jul-2020 |
riastradh | Align critical-path loops in AES and ChaCha.
|
| 1.8 | 25-Jul-2020 |
riastradh | Implement AES-CCM with ARMv8.5-AES.
|
| 1.7 | 25-Jul-2020 |
riastradh | Invert some loops to save a branch instruction on every iteration.
|
| 1.6 | 22-Jul-2020 |
riastradh | Fix register name in comment.
Some time ago I reallocated the registers to avoid inadvertently clobbering the callee-saves v9, but neglected to update the comment.
|
| 1.5 | 19-Jul-2020 |
ryo | fix build with clang/llvm.
clang aarch64 assembler doesn't accept optional number of lanes of vector register. (but ARMARM says that an assembler must accept it)
|
| 1.4 | 30-Jun-2020 |
riastradh | Reallocate registers to avoid abusing callee-saves registers, v8-v15.
Forgot to consult the AAPCS before committing this before -- oops!
While here, take advantage of the 32 aarch64 simd registers to avoid all stack spills.
|
| 1.3 | 30-Jun-2020 |
riastradh | Use `.arch_extension aes' for aese/aesmc/aesd/aesimc.
Unlike `.arch_extension crypto', this works with clang; both work with gas, so we'll go with this.
Clang still can't handle aes_armv8_64.S yet -- it gets confused by dup and mov on lanes, but this makes progress.
|
| 1.2 | 30-Jun-2020 |
riastradh | Use .p2align rather than .align.
Apparently on arm, .align is actually an alias for .p2align, taking a power of two rather than a number of bytes, so aes_armv8_64.o was bloated to 32KB with obscene alignment when it only needed to be barely past 4KB.
Do the same for the x86 aes_ni_64.S -- even though .align takes a number of bytes rather than a power of two on x86, let's just stay away from the temptations of the evil .align directive.
|
| 1.1 | 29-Jun-2020 |
riastradh | Implement AES in kernel using ARMv8.0-AES on aarch64.
|
| 1.6 | 21-Nov-2020 |
rin | Fix build with clang for earmv7hf; loadroundkey() is used only for __aarch64__.
|
| 1.5 | 08-Aug-2020 |
riastradh | branches: 1.5.2; Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.
New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers. Needed because GCC and Clang disagree on the ordering of lanes, depending on whether it's 64-bit big-endian, 32-bit big-endian, or little-endian -- and, bizarrely, both of them disagree with the architectural numbering of lanes.
Experimented with using
static const uint8_t x8[16] = {...};
uint8x16_t x = vld1q_u8(x8);
which doesn't require knowing anything about the ordering of lanes, but this generates considerably worse code and apparently confuses GCC into not recognizing the constant value of x8.
Fix some clang mistakes while here too.
|
| 1.4 | 28-Jul-2020 |
riastradh | Draft 2x vectorized neon vpaes for aarch64.
Gives a modest speed boost on rk3399 (Cortex-A53/A72), around 20% in cgd tests, for parallelizable operations like CBC decryption; same improvement should probably carry over to rpi4 CPU which lacks ARMv8.0-AES.
|
| 1.3 | 30-Jun-2020 |
riastradh | New test sys/crypto/aes/t_aes.
Runs aes_selftest on all kernel AES implementations supported on the current hardware, not just the preferred one.
|
| 1.2 | 29-Jun-2020 |
riastradh | Provide hand-written AES NEON assembly for arm32.
gcc does a lousy job at compiling 128-bit NEON intrinsics on arm32; hand-writing it made it about 12x faster, by avoiding a zillion loads and stores to spill everything and the kitchen sink onto the stack. (But gcc does fine on aarch64, presumably because it has twice as many registers and doesn't have to deal with q2=d4/d5 overlapping.)
|
| 1.1 | 29-Jun-2020 |
riastradh | New permutation-based AES implementation using ARM NEON.
Also derived from Mike Hamburg's public-domain vpaes code.
|
| 1.5.2.1 | 14-Dec-2020 |
thorpej | Sync w/ HEAD.
|
| 1.3 | 25-Jul-2020 |
riastradh | Implement AES-CCM with NEON.
|
| 1.2 | 25-Jul-2020 |
riastradh | Split aes_impl declarations out into aes_impl.h.
This will make it less painful to add more operations to struct aes_impl without having to recompile everything that just uses the block cipher directly or similar.
|
| 1.1 | 29-Jun-2020 |
riastradh | New permutation-based AES implementation using ARM NEON.
Also derived from Mike Hamburg's public-domain vpaes code.
|
| 1.11 | 10-Sep-2020 |
riastradh | aes neon: Gather mc_forward/backward so we can load 256 bits at once.
|
| 1.10 | 10-Sep-2020 |
riastradh | aes neon: Hoist dsbd/dsbe address calculation out of loop.
|
| 1.9 | 10-Sep-2020 |
riastradh | aes neon: Tweak register usage.
- Call r12 by its usual name, ip. - No need for r7 or r11=fp at the moment.
|
| 1.8 | 10-Sep-2020 |
riastradh | aes neon: Write vtbl with {qN} rather than {d(2N)-d(2N+1)}.
Cosmetic; no functional change.
|
| 1.7 | 10-Sep-2020 |
riastradh | aes neon: Issue 256-bit loads rather than pairs of 128-bit loads.
Not sure why I didn't realize you could do this before!
Saves some temporary registers that can now be allocated to shave off a few cycles.
|
| 1.6 | 16-Aug-2020 |
riastradh | Fix AES NEON code for big-endian softfp ARM.
...which is how the kernel runs. Switch to using __SOFTFP__ for consistency with how it gets exposed to C, although I'm not sure how to get it defined automagically in the toolchain for .S files so that's set manually in files.aesneon for now.
|
| 1.5 | 08-Aug-2020 |
riastradh | Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.
New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers. Needed because GCC and Clang disagree on the ordering of lanes, depending on whether it's 64-bit big-endian, 32-bit big-endian, or little-endian -- and, bizarrely, both of them disagree with the architectural numbering of lanes.
Experimented with using
static const uint8_t x8[16] = {...};
uint8x16_t x = vld1q_u8(x8);
which doesn't require knowing anything about the ordering of lanes, but this generates considerably worse code and apparently confuses GCC into not recognizing the constant value of x8.
Fix some clang mistakes while here too.
|
| 1.4 | 27-Jul-2020 |
riastradh | Add RCSIDs to the AES and ChaCha .S sources.
|
| 1.3 | 27-Jul-2020 |
riastradh | Align critical-path loops in AES and ChaCha.
|
| 1.2 | 27-Jul-2020 |
riastradh | PIC for aes_neon_32.S.
Without this, tests/sys/crypto/aes/t_aes fails to start on armv7 because of R_ARM_ABS32 relocations in a nonwritable text segment for a PIE -- which atf quietly ignores in the final report! Yikes.
|
| 1.1 | 29-Jun-2020 |
riastradh | Provide hand-written AES NEON assembly for arm32.
gcc does a lousy job at compiling 128-bit NEON intrinsics on arm32; hand-writing it made it about 12x faster, by avoiding a zillion loads and stores to spill everything and the kitchen sink onto the stack. (But gcc does fine on aarch64, presumably because it has twice as many registers and doesn't have to deal with q2=d4/d5 overlapping.)
|
| 1.5 | 10-Oct-2020 |
jmcneill | Fix detection of NEON features. ID_AA64PFR0_EL1_ADV_SIMD_NONE means SIMD is not available, and any other value means it is.
|
| 1.4 | 25-Jul-2020 |
riastradh | Implement AES-CCM with NEON.
|
| 1.3 | 25-Jul-2020 |
riastradh | Split aes_impl declarations out into aes_impl.h.
This will make it less painful to add more operations to struct aes_impl without having to recompile everything that just uses the block cipher directly or similar.
|
| 1.2 | 30-Jun-2020 |
riastradh | New test sys/crypto/aes/t_aes.
Runs aes_selftest on all kernel AES implementations supported on the current hardware, not just the preferred one.
|
| 1.1 | 29-Jun-2020 |
riastradh | New permutation-based AES implementation using ARM NEON.
Also derived from Mike Hamburg's public-domain vpaes code.
|
| 1.4 | 07-Aug-2023 |
rin | sys/crypto: Introduce arch/{arm,x86} to share common MD headers
Dedup between aes and chacha. No binary changes.
|
| 1.3 | 08-Aug-2020 |
riastradh | Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.
New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers. Needed because GCC and Clang disagree on the ordering of lanes, depending on whether it's 64-bit big-endian, 32-bit big-endian, or little-endian -- and, bizarrely, both of them disagree with the architectural numbering of lanes.
Experimented with using
static const uint8_t x8[16] = {...};
uint8x16_t x = vld1q_u8(x8);
which doesn't require knowing anything about the ordering of lanes, but this generates considerably worse code and apparently confuses GCC into not recognizing the constant value of x8.
Fix some clang mistakes while here too.
|
| 1.2 | 28-Jul-2020 |
riastradh | Draft 2x vectorized neon vpaes for aarch64.
Gives a modest speed boost on rk3399 (Cortex-A53/A72), around 20% in cgd tests, for parallelizable operations like CBC decryption; same improvement should probably carry over to rpi4 CPU which lacks ARMv8.0-AES.
|
| 1.1 | 29-Jun-2020 |
riastradh | New permutation-based AES implementation using ARM NEON.
Also derived from Mike Hamburg's public-domain vpaes code.
|
| 1.8 | 26-Jun-2022 |
riastradh | arm/aes_neon: Fix formatting of self-test failure message.
Discovered by code inspection. Remarkably, a combination of errors made this fail to be a stack buffer overrun. Verified by booting with ARMv8.0-AES disabled and with the self-test artificially made to fail.
|
| 1.7 | 09-Aug-2020 |
riastradh | Use vshlq_n_s32 rather than vsliq_n_s32 with zero destination.
Not sure why I reached for vsliq_n_s32 at first -- probably so I wouldn't have to deal with a new intrinsic in arm_neon.h!
|
| 1.6 | 09-Aug-2020 |
riastradh | Nix outdated comment.
I implemented this parallelism a couple weeks ago.
|
| 1.5 | 08-Aug-2020 |
riastradh | Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.
New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers. Needed because GCC and Clang disagree on the ordering of lanes, depending on whether it's 64-bit big-endian, 32-bit big-endian, or little-endian -- and, bizarrely, both of them disagree with the architectural numbering of lanes.
Experimented with using
static const uint8_t x8[16] = {...};
uint8x16_t x = vld1q_u8(x8);
which doesn't require knowing anything about the ordering of lanes, but this generates considerably worse code and apparently confuses GCC into not recognizing the constant value of x8.
Fix some clang mistakes while here too.
|
| 1.4 | 28-Jul-2020 |
riastradh | Draft 2x vectorized neon vpaes for aarch64.
Gives a modest speed boost on rk3399 (Cortex-A53/A72), around 20% in cgd tests, for parallelizable operations like CBC decryption; same improvement should probably carry over to rpi4 CPU which lacks ARMv8.0-AES.
|
| 1.3 | 25-Jul-2020 |
riastradh | Implement AES-CCM with NEON.
|
| 1.2 | 30-Jun-2020 |
riastradh | New test sys/crypto/aes/t_aes.
Runs aes_selftest on all kernel AES implementations supported on the current hardware, not just the preferred one.
|
| 1.1 | 29-Jun-2020 |
riastradh | New permutation-based AES implementation using ARM NEON.
Also derived from Mike Hamburg's public-domain vpaes code.
|
| 1.13 | 07-Aug-2023 |
rin | sys/crypto: Introduce arch/{arm,x86} to share common MD headers
Dedup between aes and chacha. No binary changes.
|
| 1.12 | 07-Aug-2023 |
rin | sys/crypto/{aes,chacha}/arch/arm/arm_neon.h: Sync (whitespace fix)
No binary changes.
|
| 1.11 | 07-Sep-2020 |
jakllsch | Fix vgetq_lane_u32 for aarch64eb with GCC
Fixes NEON AES on aarch64eb
|
| 1.10 | 09-Aug-2020 |
riastradh | Fix some clang neon intrinsics.
Compile-tested only, with -Wno-nonportable-vector-initializers. Need to address -- and test -- this stuff properly but this is progress.
|
| 1.9 | 09-Aug-2020 |
riastradh | Use vshlq_n_s32 rather than vsliq_n_s32 with zero destination.
Not sure why I reached for vsliq_n_s32 at first -- probably so I wouldn't have to deal with a new intrinsic in arm_neon.h!
|
| 1.8 | 08-Aug-2020 |
riastradh | Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.
New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers. Needed because GCC and Clang disagree on the ordering of lanes, depending on whether it's 64-bit big-endian, 32-bit big-endian, or little-endian -- and, bizarrely, both of them disagree with the architectural numbering of lanes.
Experimented with using
static const uint8_t x8[16] = {...};
uint8x16_t x = vld1q_u8(x8);
which doesn't require knowing anything about the ordering of lanes, but this generates considerably worse code and apparently confuses GCC into not recognizing the constant value of x8.
Fix some clang mistakes while here too.
|
| 1.7 | 28-Jul-2020 |
riastradh | Draft 2x vectorized neon vpaes for aarch64.
Gives a modest speed boost on rk3399 (Cortex-A53/A72), around 20% in cgd tests, for parallelizable operations like CBC decryption; same improvement should probably carry over to rpi4 CPU which lacks ARMv8.0-AES.
|
| 1.6 | 25-Jul-2020 |
riastradh | Add 32-bit load, store, and shift intrinsics.
vld1q_u32 vst1q_u32 vshlq_n_u32 vshrq_n_u32
|
| 1.5 | 25-Jul-2020 |
riastradh | Fix missing clang big-endian case.
|
| 1.4 | 25-Jul-2020 |
riastradh | Implement AES-CCM with NEON.
|
| 1.3 | 23-Jul-2020 |
ryo | fix build with llvm/clang.
|
| 1.2 | 30-Jun-2020 |
riastradh | Tweak clang neon intrinsics so they build.
(this file is still a kludge)
|
| 1.1 | 29-Jun-2020 |
riastradh | New permutation-based AES implementation using ARM NEON.
Also derived from Mike Hamburg's public-domain vpaes code.
|
| 1.3 | 07-Aug-2023 |
rin | sys/crypto: Introduce arch/{arm,x86} to share common MD headers
Dedup between aes and chacha. No binary changes.
|
| 1.2 | 09-Aug-2020 |
riastradh | Fix mistake in big-endian arm clang.
Swapped the two halves (only gcc does that, I think) and wrote j,i backwards, oops.
(I don't have a big-endian arm clang build handy to test; hoping this works.)
|
| 1.1 | 08-Aug-2020 |
riastradh | Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.
New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers. Needed because GCC and Clang disagree on the ordering of lanes, depending on whether it's 64-bit big-endian, 32-bit big-endian, or little-endian -- and, bizarrely, both of them disagree with the architectural numbering of lanes.
Experimented with using
static const uint8_t x8[16] = {...};
uint8x16_t x = vld1q_u8(x8);
which doesn't require knowing anything about the ordering of lanes, but this generates considerably worse code and apparently confuses GCC into not recognizing the constant value of x8.
Fix some clang mistakes while here too.
|
| 1.1 | 29-Jun-2020 |
riastradh | Implement AES in kernel using ARMv8.0-AES on aarch64.
|
| 1.5 | 08-Sep-2020 |
jakllsch | Acknowledge clang warning for NEON cipher code on aarch64eb
We've already made the nonportable vector initializations portable; the code works on aarch64eb.
|
| 1.4 | 16-Aug-2020 |
riastradh | Fix AES NEON code for big-endian softfp ARM.
...which is how the kernel runs. Switch to using __SOFTFP__ for consistency with how it gets exposed to C, although I'm not sure how to get it defined automagically in the toolchain for .S files so that's set manually in files.aesneon for now.
|
| 1.3 | 30-Jun-2020 |
riastradh | Limit aes_neon to cpu_cortex | aarch64.
We won't use it on any other systems, and it doesn't build without NEON anyway. Verified earmv7hf GENERIC, aarch64 GENERIC64, and earmv6 RPI2 all build with this.
|
| 1.2 | 29-Jun-2020 |
riastradh | Provide hand-written AES NEON assembly for arm32.
gcc does a lousy job at compiling 128-bit NEON intrinsics on arm32; hand-writing it made it about 12x faster, by avoiding a zillion loads and stores to spill everything and the kitchen sink onto the stack. (But gcc does fine on aarch64, presumably because it has twice as many registers and doesn't have to deal with q2=d4/d5 overlapping.)
|
| 1.1 | 29-Jun-2020 |
riastradh | New permutation-based AES implementation using ARM NEON.
Also derived from Mike Hamburg's public-domain vpaes code.
|
| 1.5 | 05-Sep-2020 |
maxv | x86: fix several CPUID flags
- Rename: CPUID_PN -> CPUID_PSN CPUID_CFLUSH -> CPUID_CLFSH CPUID_SBF -> CPUID_PBE CPUID_LZCNT -> CPUID_ABM CPUID_P1GB -> CPUID_PAGE1GB CPUID2_PCLMUL -> CPUID2_PCLMULQDQ CPUID2_CID -> CPUID2_CNXTID CPUID2_xTPR -> CPUID2_XTPR CPUID2_AES -> CPUID2_AESNI To match the x86 specification and the other OSes.
- Remove: CPUID_B10, CPUID_B20, CPUID_IA64. They do not exist.
|
| 1.4 | 25-Jul-2020 |
riastradh | Implement AES-CCM with x86 AES-NI.
|
| 1.3 | 25-Jul-2020 |
riastradh | Split aes_impl declarations out into aes_impl.h.
This will make it less painful to add more operations to struct aes_impl without having to recompile everything that just uses the block cipher directly or similar.
|
| 1.2 | 30-Jun-2020 |
riastradh | New test sys/crypto/aes/t_aes.
Runs aes_selftest on all kernel AES implementations supported on the current hardware, not just the preferred one.
|
| 1.1 | 29-Jun-2020 |
riastradh | Add x86 AES-NI support.
Limited to amd64 for now. In principle, AES-NI should work in 32-bit mode, and there may even be some 32-bit-only CPUs that support AES-NI, but that requires work to adapt the assembly.
|
| 1.3 | 25-Jul-2020 |
riastradh | Implement AES-CCM with x86 AES-NI.
|
| 1.2 | 25-Jul-2020 |
riastradh | Split aes_impl declarations out into aes_impl.h.
This will make it less painful to add more operations to struct aes_impl without having to recompile everything that just uses the block cipher directly or similar.
|
| 1.1 | 29-Jun-2020 |
riastradh | Add x86 AES-NI support.
Limited to amd64 for now. In principle, AES-NI should work in 32-bit mode, and there may even be some 32-bit-only CPUs that support AES-NI, but that requires work to adapt the assembly.
|
| 1.6 | 27-Jul-2020 |
riastradh | Add RCSIDs to the AES and ChaCha .S sources.
|
| 1.5 | 27-Jul-2020 |
riastradh | Align critical-path loops in AES and ChaCha.
|
| 1.4 | 25-Jul-2020 |
riastradh | Implement AES-CCM with x86 AES-NI.
|
| 1.3 | 25-Jul-2020 |
riastradh | Invert some loops to save a jmp instruction on each iteration.
No semantic change intended.
|
| 1.2 | 30-Jun-2020 |
riastradh | Use .p2align rather than .align.
Apparently on arm, .align is actually an alias for .p2align, taking a power of two rather than a number of bytes, so aes_armv8_64.o was bloated to 32KB with obscene alignment when it only needed to be barely past 4KB.
Do the same for the x86 aes_ni_64.S -- even though .align takes a number of bytes rather than a power of two on x86, let's just stay away from the temptations of the evil .align directive.
|
| 1.1 | 29-Jun-2020 |
riastradh | Add x86 AES-NI support.
Limited to amd64 for now. In principle, AES-NI should work in 32-bit mode, and there may even be some 32-bit-only CPUs that support AES-NI, but that requires work to adapt the assembly.
|
| 1.2 | 30-Jun-2020 |
riastradh | New test sys/crypto/aes/t_aes.
Runs aes_selftest on all kernel AES implementations supported on the current hardware, not just the preferred one.
|
| 1.1 | 29-Jun-2020 |
riastradh | New SSE2-based bitsliced AES implementation.
This should work on essentially all x86 CPUs of the last two decades, and may improve throughput over the portable C aes_ct implementation from BearSSL by
(a) reducing the number of vector operations in sequence, and (b) batching four rather than two blocks in parallel.
Derived from BearSSL'S aes_ct64 implementation adjusted so that where aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ..., (q[3], q[7]), each tuple representing a pair of 64-bit quantities stacked in a single 128-bit register. This translation was done very naively, and mostly reduces the cost of ShiftRows and data movement without doing anything to address the S-box or (Inv)MixColumns, which spread all 64-bit quantities across separate registers and ignore the upper halves.
Unfortunately, SSE2 -- which is all that is guaranteed on all amd64 CPUs -- doesn't have PSHUFB, which would help out a lot more. For example, vpaes relies on that. Perhaps there are enough CPUs out there with PSHUFB but not AES-NI to make it worthwhile to import or adapt vpaes too.
Note: This includes local definitions of various Intel compiler intrinsics for gcc and clang in terms of their __builtin_* &c., because the necessary header files are not available during the kernel build. This is a kludge -- we should fix it properly; the present approach is expedient but not ideal.
|
| 1.4 | 25-Jul-2020 |
riastradh | Implement AES-CCM with SSE2.
|
| 1.3 | 25-Jul-2020 |
riastradh | Split aes_impl declarations out into aes_impl.h.
This will make it less painful to add more operations to struct aes_impl without having to recompile everything that just uses the block cipher directly or similar.
|
| 1.2 | 29-Jun-2020 |
riastradh | Split SSE2 logic into separate units.
Ensure that there are no paths into files compiled with -msse -msse2 at all except via fpu_kern_enter.
I didn't run into a practical problem with this, but let's not leave a ticking time bomb for subsequent toolchain changes in case the mere declaration of local __m128i variables causes trouble.
|
| 1.1 | 29-Jun-2020 |
riastradh | New SSE2-based bitsliced AES implementation.
This should work on essentially all x86 CPUs of the last two decades, and may improve throughput over the portable C aes_ct implementation from BearSSL by
(a) reducing the number of vector operations in sequence, and (b) batching four rather than two blocks in parallel.
Derived from BearSSL'S aes_ct64 implementation adjusted so that where aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ..., (q[3], q[7]), each tuple representing a pair of 64-bit quantities stacked in a single 128-bit register. This translation was done very naively, and mostly reduces the cost of ShiftRows and data movement without doing anything to address the S-box or (Inv)MixColumns, which spread all 64-bit quantities across separate registers and ignore the upper halves.
Unfortunately, SSE2 -- which is all that is guaranteed on all amd64 CPUs -- doesn't have PSHUFB, which would help out a lot more. For example, vpaes relies on that. Perhaps there are enough CPUs out there with PSHUFB but not AES-NI to make it worthwhile to import or adapt vpaes too.
Note: This includes local definitions of various Intel compiler intrinsics for gcc and clang in terms of their __builtin_* &c., because the necessary header files are not available during the kernel build. This is a kludge -- we should fix it properly; the present approach is expedient but not ideal.
|
| 1.1 | 29-Jun-2020 |
riastradh | New SSE2-based bitsliced AES implementation.
This should work on essentially all x86 CPUs of the last two decades, and may improve throughput over the portable C aes_ct implementation from BearSSL by
(a) reducing the number of vector operations in sequence, and (b) batching four rather than two blocks in parallel.
Derived from BearSSL'S aes_ct64 implementation adjusted so that where aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ..., (q[3], q[7]), each tuple representing a pair of 64-bit quantities stacked in a single 128-bit register. This translation was done very naively, and mostly reduces the cost of ShiftRows and data movement without doing anything to address the S-box or (Inv)MixColumns, which spread all 64-bit quantities across separate registers and ignore the upper halves.
Unfortunately, SSE2 -- which is all that is guaranteed on all amd64 CPUs -- doesn't have PSHUFB, which would help out a lot more. For example, vpaes relies on that. Perhaps there are enough CPUs out there with PSHUFB but not AES-NI to make it worthwhile to import or adapt vpaes too.
Note: This includes local definitions of various Intel compiler intrinsics for gcc and clang in terms of their __builtin_* &c., because the necessary header files are not available during the kernel build. This is a kludge -- we should fix it properly; the present approach is expedient but not ideal.
|
| 1.1 | 29-Jun-2020 |
riastradh | New SSE2-based bitsliced AES implementation.
This should work on essentially all x86 CPUs of the last two decades, and may improve throughput over the portable C aes_ct implementation from BearSSL by
(a) reducing the number of vector operations in sequence, and (b) batching four rather than two blocks in parallel.
Derived from BearSSL'S aes_ct64 implementation adjusted so that where aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ..., (q[3], q[7]), each tuple representing a pair of 64-bit quantities stacked in a single 128-bit register. This translation was done very naively, and mostly reduces the cost of ShiftRows and data movement without doing anything to address the S-box or (Inv)MixColumns, which spread all 64-bit quantities across separate registers and ignore the upper halves.
Unfortunately, SSE2 -- which is all that is guaranteed on all amd64 CPUs -- doesn't have PSHUFB, which would help out a lot more. For example, vpaes relies on that. Perhaps there are enough CPUs out there with PSHUFB but not AES-NI to make it worthwhile to import or adapt vpaes too.
Note: This includes local definitions of various Intel compiler intrinsics for gcc and clang in terms of their __builtin_* &c., because the necessary header files are not available during the kernel build. This is a kludge -- we should fix it properly; the present approach is expedient but not ideal.
|
| 1.5 | 25-Jul-2020 |
riastradh | Implement AES-CCM with SSE2.
|
| 1.4 | 25-Jul-2020 |
riastradh | Split aes_impl declarations out into aes_impl.h.
This will make it less painful to add more operations to struct aes_impl without having to recompile everything that just uses the block cipher directly or similar.
|
| 1.3 | 30-Jun-2020 |
riastradh | New test sys/crypto/aes/t_aes.
Runs aes_selftest on all kernel AES implementations supported on the current hardware, not just the preferred one.
|
| 1.2 | 29-Jun-2020 |
riastradh | Split SSE2 logic into separate units.
Ensure that there are no paths into files compiled with -msse -msse2 at all except via fpu_kern_enter.
I didn't run into a practical problem with this, but let's not leave a ticking time bomb for subsequent toolchain changes in case the mere declaration of local __m128i variables causes trouble.
|
| 1.1 | 29-Jun-2020 |
riastradh | New SSE2-based bitsliced AES implementation.
This should work on essentially all x86 CPUs of the last two decades, and may improve throughput over the portable C aes_ct implementation from BearSSL by
(a) reducing the number of vector operations in sequence, and (b) batching four rather than two blocks in parallel.
Derived from BearSSL'S aes_ct64 implementation adjusted so that where aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ..., (q[3], q[7]), each tuple representing a pair of 64-bit quantities stacked in a single 128-bit register. This translation was done very naively, and mostly reduces the cost of ShiftRows and data movement without doing anything to address the S-box or (Inv)MixColumns, which spread all 64-bit quantities across separate registers and ignore the upper halves.
Unfortunately, SSE2 -- which is all that is guaranteed on all amd64 CPUs -- doesn't have PSHUFB, which would help out a lot more. For example, vpaes relies on that. Perhaps there are enough CPUs out there with PSHUFB but not AES-NI to make it worthwhile to import or adapt vpaes too.
Note: This includes local definitions of various Intel compiler intrinsics for gcc and clang in terms of their __builtin_* &c., because the necessary header files are not available during the kernel build. This is a kludge -- we should fix it properly; the present approach is expedient but not ideal.
|
| 1.3 | 07-Aug-2023 |
rin | sys/crypto: Introduce arch/{arm,x86} to share common MD headers
Dedup between aes and chacha. No binary changes.
|
| 1.2 | 29-Jun-2020 |
riastradh | Split SSE2 logic into separate units.
Ensure that there are no paths into files compiled with -msse -msse2 at all except via fpu_kern_enter.
I didn't run into a practical problem with this, but let's not leave a ticking time bomb for subsequent toolchain changes in case the mere declaration of local __m128i variables causes trouble.
|
| 1.1 | 29-Jun-2020 |
riastradh | New SSE2-based bitsliced AES implementation.
This should work on essentially all x86 CPUs of the last two decades, and may improve throughput over the portable C aes_ct implementation from BearSSL by
(a) reducing the number of vector operations in sequence, and (b) batching four rather than two blocks in parallel.
Derived from BearSSL'S aes_ct64 implementation adjusted so that where aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ..., (q[3], q[7]), each tuple representing a pair of 64-bit quantities stacked in a single 128-bit register. This translation was done very naively, and mostly reduces the cost of ShiftRows and data movement without doing anything to address the S-box or (Inv)MixColumns, which spread all 64-bit quantities across separate registers and ignore the upper halves.
Unfortunately, SSE2 -- which is all that is guaranteed on all amd64 CPUs -- doesn't have PSHUFB, which would help out a lot more. For example, vpaes relies on that. Perhaps there are enough CPUs out there with PSHUFB but not AES-NI to make it worthwhile to import or adapt vpaes too.
Note: This includes local definitions of various Intel compiler intrinsics for gcc and clang in terms of their __builtin_* &c., because the necessary header files are not available during the kernel build. This is a kludge -- we should fix it properly; the present approach is expedient but not ideal.
|
| 1.4 | 08-Sep-2020 |
riastradh | aes(9): Fix edge case in bitsliced SSE2 AES-CBC decryption.
Make sure self-tests exercise this edge case.
Discovered by confusion over code inspection of jak's adaptation of aes_armv8_64.S for big-endian.
|
| 1.3 | 25-Jul-2020 |
riastradh | Implement AES-CCM with SSE2.
|
| 1.2 | 30-Jun-2020 |
riastradh | New test sys/crypto/aes/t_aes.
Runs aes_selftest on all kernel AES implementations supported on the current hardware, not just the preferred one.
|
| 1.1 | 29-Jun-2020 |
riastradh | Split SSE2 logic into separate units.
Ensure that there are no paths into files compiled with -msse -msse2 at all except via fpu_kern_enter.
I didn't run into a practical problem with this, but let's not leave a ticking time bomb for subsequent toolchain changes in case the mere declaration of local __m128i variables causes trouble.
|
| 1.2 | 30-Jun-2020 |
riastradh | New test sys/crypto/aes/t_aes.
Runs aes_selftest on all kernel AES implementations supported on the current hardware, not just the preferred one.
|
| 1.1 | 29-Jun-2020 |
riastradh | New permutation-based AES implementation using SSSE3.
This covers a lot of CPUs -- particularly lower-end CPUs over the past decade which lack AES-NI.
Derived from Mike Hamburg's public domain vpaes software; see <https://crypto.stanford.edu/vpaes/> for details.
|
| 1.3 | 25-Jul-2020 |
riastradh | Implement AES-CCM with SSSE3.
|
| 1.2 | 25-Jul-2020 |
riastradh | Split aes_impl declarations out into aes_impl.h.
This will make it less painful to add more operations to struct aes_impl without having to recompile everything that just uses the block cipher directly or similar.
|
| 1.1 | 29-Jun-2020 |
riastradh | New permutation-based AES implementation using SSSE3.
This covers a lot of CPUs -- particularly lower-end CPUs over the past decade which lack AES-NI.
Derived from Mike Hamburg's public domain vpaes software; see <https://crypto.stanford.edu/vpaes/> for details.
|
| 1.4 | 25-Jul-2020 |
riastradh | Implement AES-CCM with SSSE3.
|
| 1.3 | 25-Jul-2020 |
riastradh | Split aes_impl declarations out into aes_impl.h.
This will make it less painful to add more operations to struct aes_impl without having to recompile everything that just uses the block cipher directly or similar.
|
| 1.2 | 30-Jun-2020 |
riastradh | New test sys/crypto/aes/t_aes.
Runs aes_selftest on all kernel AES implementations supported on the current hardware, not just the preferred one.
|
| 1.1 | 29-Jun-2020 |
riastradh | New permutation-based AES implementation using SSSE3.
This covers a lot of CPUs -- particularly lower-end CPUs over the past decade which lack AES-NI.
Derived from Mike Hamburg's public domain vpaes software; see <https://crypto.stanford.edu/vpaes/> for details.
|
| 1.2 | 07-Aug-2023 |
rin | sys/crypto: Introduce arch/{arm,x86} to share common MD headers
Dedup between aes and chacha. No binary changes.
|
| 1.1 | 29-Jun-2020 |
riastradh | New permutation-based AES implementation using SSSE3.
This covers a lot of CPUs -- particularly lower-end CPUs over the past decade which lack AES-NI.
Derived from Mike Hamburg's public domain vpaes software; see <https://crypto.stanford.edu/vpaes/> for details.
|
| 1.3 | 25-Jul-2020 |
riastradh | Implement AES-CCM with SSSE3.
|
| 1.2 | 30-Jun-2020 |
riastradh | New test sys/crypto/aes/t_aes.
Runs aes_selftest on all kernel AES implementations supported on the current hardware, not just the preferred one.
|
| 1.1 | 29-Jun-2020 |
riastradh | New permutation-based AES implementation using SSSE3.
This covers a lot of CPUs -- particularly lower-end CPUs over the past decade which lack AES-NI.
Derived from Mike Hamburg's public domain vpaes software; see <https://crypto.stanford.edu/vpaes/> for details.
|
| 1.9 | 16-Jun-2024 |
rillig | sys/aes_via: fix broken link in comment
|
| 1.8 | 16-Jun-2024 |
christos | revert previous, probably a gcc bug?
|
| 1.7 | 16-Jun-2024 |
christos | try to fix the overflow gcc pointed out.
|
| 1.6 | 28-Jul-2020 |
riastradh | Initialize authctr in both branches.
I guess I didn't test the unaligned case, weird.
|
| 1.5 | 25-Jul-2020 |
riastradh | Implement AES-CCM with VIA ACE.
|
| 1.4 | 25-Jul-2020 |
riastradh | Split aes_impl declarations out into aes_impl.h.
This will make it less painful to add more operations to struct aes_impl without having to recompile everything that just uses the block cipher directly or similar.
|
| 1.3 | 30-Jun-2020 |
riastradh | New test sys/crypto/aes/t_aes.
Runs aes_selftest on all kernel AES implementations supported on the current hardware, not just the preferred one.
|
| 1.2 | 29-Jun-2020 |
riastradh | VIA AES: Batch AES-XTS computation into eight blocks at a time.
Experimental -- performance improvement is not clearly worth the complexity.
|
| 1.1 | 29-Jun-2020 |
riastradh | Add AES implementation with VIA ACE.
|
| 1.2 | 25-Jul-2020 |
riastradh | Split aes_impl declarations out into aes_impl.h.
This will make it less painful to add more operations to struct aes_impl without having to recompile everything that just uses the block cipher directly or similar.
|
| 1.1 | 29-Jun-2020 |
riastradh | Add AES implementation with VIA ACE.
|
| 1.1 | 29-Jun-2020 |
riastradh | Add x86 AES-NI support.
Limited to amd64 for now. In principle, AES-NI should work in 32-bit mode, and there may even be some 32-bit-only CPUs that support AES-NI, but that requires work to adapt the assembly.
|
| 1.2 | 29-Jun-2020 |
riastradh | Split SSE2 logic into separate units.
Ensure that there are no paths into files compiled with -msse -msse2 at all except via fpu_kern_enter.
I didn't run into a practical problem with this, but let's not leave a ticking time bomb for subsequent toolchain changes in case the mere declaration of local __m128i variables causes trouble.
|
| 1.1 | 29-Jun-2020 |
riastradh | New SSE2-based bitsliced AES implementation.
This should work on essentially all x86 CPUs of the last two decades, and may improve throughput over the portable C aes_ct implementation from BearSSL by
(a) reducing the number of vector operations in sequence, and (b) batching four rather than two blocks in parallel.
Derived from BearSSL'S aes_ct64 implementation adjusted so that where aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ..., (q[3], q[7]), each tuple representing a pair of 64-bit quantities stacked in a single 128-bit register. This translation was done very naively, and mostly reduces the cost of ShiftRows and data movement without doing anything to address the S-box or (Inv)MixColumns, which spread all 64-bit quantities across separate registers and ignore the upper halves.
Unfortunately, SSE2 -- which is all that is guaranteed on all amd64 CPUs -- doesn't have PSHUFB, which would help out a lot more. For example, vpaes relies on that. Perhaps there are enough CPUs out there with PSHUFB but not AES-NI to make it worthwhile to import or adapt vpaes too.
Note: This includes local definitions of various Intel compiler intrinsics for gcc and clang in terms of their __builtin_* &c., because the necessary header files are not available during the kernel build. This is a kludge -- we should fix it properly; the present approach is expedient but not ideal.
|
| 1.1 | 29-Jun-2020 |
riastradh | New permutation-based AES implementation using SSSE3.
This covers a lot of CPUs -- particularly lower-end CPUs over the past decade which lack AES-NI.
Derived from Mike Hamburg's public domain vpaes software; see <https://crypto.stanford.edu/vpaes/> for details.
|
| 1.1 | 29-Jun-2020 |
riastradh | Add AES implementation with VIA ACE.
|
| 1.6 | 07-Aug-2023 |
rin | sys/crypto: Introduce arch/{arm,x86} to share common MD headers
Dedup between aes and chacha. No binary changes.
|
| 1.5 | 25-Jul-2020 |
riastradh | Add some Intel intrinsics for ChaCha.
_mm_load1_ps _mm_loadu_si128 _mm_movelh_ps _mm_slli_epi32 _mm_storeu_si128 _mm_unpackhi_epi32 _mm_unpacklo_epi32
|
| 1.4 | 25-Jul-2020 |
riastradh | Fix target attribute on _mm_movehl_ps, fix clang _mm_unpacklo_epi64.
- _mm_movehl_ps is available in SSE2, no need for SSSE3. - _mm_unpacklo_epi64 operates on v2di, not v4si; fix.
|
| 1.3 | 25-Jul-2020 |
riastradh | Implement AES-CCM with SSSE3.
|
| 1.2 | 29-Jun-2020 |
riastradh | New permutation-based AES implementation using SSSE3.
This covers a lot of CPUs -- particularly lower-end CPUs over the past decade which lack AES-NI.
Derived from Mike Hamburg's public domain vpaes software; see <https://crypto.stanford.edu/vpaes/> for details.
|
| 1.1 | 29-Jun-2020 |
riastradh | New SSE2-based bitsliced AES implementation.
This should work on essentially all x86 CPUs of the last two decades, and may improve throughput over the portable C aes_ct implementation from BearSSL by
(a) reducing the number of vector operations in sequence, and (b) batching four rather than two blocks in parallel.
Derived from BearSSL'S aes_ct64 implementation adjusted so that where aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ..., (q[3], q[7]), each tuple representing a pair of 64-bit quantities stacked in a single 128-bit register. This translation was done very naively, and mostly reduces the cost of ShiftRows and data movement without doing anything to address the S-box or (Inv)MixColumns, which spread all 64-bit quantities across separate registers and ignore the upper halves.
Unfortunately, SSE2 -- which is all that is guaranteed on all amd64 CPUs -- doesn't have PSHUFB, which would help out a lot more. For example, vpaes relies on that. Perhaps there are enough CPUs out there with PSHUFB but not AES-NI to make it worthwhile to import or adapt vpaes too.
Note: This includes local definitions of various Intel compiler intrinsics for gcc and clang in terms of their __builtin_* &c., because the necessary header files are not available during the kernel build. This is a kludge -- we should fix it properly; the present approach is expedient but not ideal.
|
| 1.2 | 07-Aug-2023 |
rin | sys/crypto: Introduce arch/{arm,x86} to share common MD headers
Dedup between aes and chacha. No binary changes.
|
| 1.1 | 29-Jun-2020 |
riastradh | New SSE2-based bitsliced AES implementation.
This should work on essentially all x86 CPUs of the last two decades, and may improve throughput over the portable C aes_ct implementation from BearSSL by
(a) reducing the number of vector operations in sequence, and (b) batching four rather than two blocks in parallel.
Derived from BearSSL'S aes_ct64 implementation adjusted so that where aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ..., (q[3], q[7]), each tuple representing a pair of 64-bit quantities stacked in a single 128-bit register. This translation was done very naively, and mostly reduces the cost of ShiftRows and data movement without doing anything to address the S-box or (Inv)MixColumns, which spread all 64-bit quantities across separate registers and ignore the upper halves.
Unfortunately, SSE2 -- which is all that is guaranteed on all amd64 CPUs -- doesn't have PSHUFB, which would help out a lot more. For example, vpaes relies on that. Perhaps there are enough CPUs out there with PSHUFB but not AES-NI to make it worthwhile to import or adapt vpaes too.
Note: This includes local definitions of various Intel compiler intrinsics for gcc and clang in terms of their __builtin_* &c., because the necessary header files are not available during the kernel build. This is a kludge -- we should fix it properly; the present approach is expedient but not ideal.
|