Cross Reference: /src/sys/crypto/chacha/arch/arm/

History log of /src/sys/crypto/chacha/arch/arm/
Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
Revision tags: perseant-exfatfs-base-20250801 perseant-exfatfs-base-20240630 perseant-exfatfs-base
1.8	07-Aug-2023	rin	sys/crypto: Introduce arch/{arm,x86} to share common MD headers Dedup between aes and chacha. No binary changes.
Revision tags: netbsd-10-1-RELEASE netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.7	07-Sep-2020	jakllsch	Fix vgetq_lane_u32 for aarch64eb with GCC Fixes NEON AES on aarch64eb
1.6	09-Aug-2020	riastradh	Fix some clang neon intrinsics. Compile-tested only, with -Wno-nonportable-vector-initializers. Need to address -- and test -- this stuff properly but this is progress.
1.5	09-Aug-2020	riastradh	Use vshlq_n_s32 rather than vsliq_n_s32 with zero destination. Not sure why I reached for vsliq_n_s32 at first -- probably so I wouldn't have to deal with a new intrinsic in arm_neon.h!
1.4	08-Aug-2020	riastradh	Fix ARM NEON implementations of AES and ChaCha on big-endian ARM. New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers. Needed because GCC and Clang disagree on the ordering of lanes, depending on whether it's 64-bit big-endian, 32-bit big-endian, or little-endian -- and, bizarrely, both of them disagree with the architectural numbering of lanes. Experimented with using static const uint8_t x8[16] = {...}; uint8x16_t x = vld1q_u8(x8); which doesn't require knowing anything about the ordering of lanes, but this generates considerably worse code and apparently confuses GCC into not recognizing the constant value of x8. Fix some clang mistakes while here too.
1.3	27-Jul-2020	riastradh	Note that VSRI seems to hurt here.
1.2	27-Jul-2020	riastradh	Take advantage of REV32 and TBL for 16-bit and 8-bit rotations. However, disable use of (V)TBL on armv7/aarch32 for now, because for some reason GCC spills things to the stack despite having plenty of free registers, which hurts performance more than it helps at least on ARM Cortex-A8.
1.1	25-Jul-2020	riastradh	Implement ChaCha with NEON on ARM. XXX Needs performance measurement. XXX Needs adaptation to arm32 neon which has half the registers.
Revision tags: perseant-exfatfs-base-20250801 perseant-exfatfs-base-20240630 perseant-exfatfs-base
1.3	07-Aug-2023	rin	sys/crypto: Introduce arch/{arm,x86} to share common MD headers Dedup between aes and chacha. No binary changes.
Revision tags: netbsd-10-1-RELEASE netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.2	09-Aug-2020	riastradh	Fix mistake in big-endian arm clang. Swapped the two halves (only gcc does that, I think) and wrote j,i backwards, oops. (I don't have a big-endian arm clang build handy to test; hoping this works.)
1.1	08-Aug-2020	riastradh	Fix ARM NEON implementations of AES and ChaCha on big-endian ARM. New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers. Needed because GCC and Clang disagree on the ordering of lanes, depending on whether it's 64-bit big-endian, 32-bit big-endian, or little-endian -- and, bizarrely, both of them disagree with the architectural numbering of lanes. Experimented with using static const uint8_t x8[16] = {...}; uint8x16_t x = vld1q_u8(x8); which doesn't require knowing anything about the ordering of lanes, but this generates considerably worse code and apparently confuses GCC into not recognizing the constant value of x8. Fix some clang mistakes while here too.
Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base perseant-exfatfs-base-20240630 perseant-exfatfs-base thorpej-ifq-base thorpej-altq-separation-base
1.9	07-Aug-2023	rin	sys/crypto: Introduce arch/{arm,x86} to share common MD headers Dedup between aes and chacha. No binary changes.
Revision tags: netbsd-10-1-RELEASE netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.8	08-Aug-2020	riastradh	Fix ARM NEON implementations of AES and ChaCha on big-endian ARM. New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers. Needed because GCC and Clang disagree on the ordering of lanes, depending on whether it's 64-bit big-endian, 32-bit big-endian, or little-endian -- and, bizarrely, both of them disagree with the architectural numbering of lanes. Experimented with using static const uint8_t x8[16] = {...}; uint8x16_t x = vld1q_u8(x8); which doesn't require knowing anything about the ordering of lanes, but this generates considerably worse code and apparently confuses GCC into not recognizing the constant value of x8. Fix some clang mistakes while here too.
1.7	28-Jul-2020	riastradh	Implement 4-way vectorization of ChaCha for armv7 NEON. cgd performance is not as good as I was hoping (~4% improvement over chacha_ref.c) but it should improve substantially more if we let the cgd worker thread keep fpu state so we don't have to pay the cost of isb and zero-the-fpu on every 512-byte cgd block.
1.6	28-Jul-2020	riastradh	Fix big-endian build with appropriate casts around vrev32q_u8.
1.5	27-Jul-2020	riastradh	Note that VSRI seems to hurt here.
1.4	27-Jul-2020	riastradh	Take advantage of REV32 and TBL for 16-bit and 8-bit rotations. However, disable use of (V)TBL on armv7/aarch32 for now, because for some reason GCC spills things to the stack despite having plenty of free registers, which hurts performance more than it helps at least on ARM Cortex-A8.
1.3	27-Jul-2020	riastradh	Enable ChaCha NEON code on armv7 too. The 4-blocks-at-a-time assembly helper is disabled for now; adapting it to armv7 is going to be a little annoying with only 16 128-bit vector registers. (Should also do a fifth block in the integer registers for 320 bytes at a time.)
1.2	27-Jul-2020	riastradh	Reduce some duplication. Shouldn't substantively hurt performance -- the comparison that has been moved into the loop was essentially the former loop condition -- and may improve performance by reducing code size since there's only one inline call to chacha_permute instead of two.
1.1	25-Jul-2020	riastradh	Implement ChaCha with NEON on ARM. XXX Needs performance measurement. XXX Needs adaptation to arm32 neon which has half the registers.
Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.3	28-Jul-2020	riastradh	Implement 4-way vectorization of ChaCha for armv7 NEON. cgd performance is not as good as I was hoping (~4% improvement over chacha_ref.c) but it should improve substantially more if we let the cgd worker thread keep fpu state so we don't have to pay the cost of isb and zero-the-fpu on every 512-byte cgd block.
1.2	27-Jul-2020	riastradh	Enable ChaCha NEON code on armv7 too. The 4-blocks-at-a-time assembly helper is disabled for now; adapting it to armv7 is going to be a little annoying with only 16 128-bit vector registers. (Should also do a fifth block in the integer registers for 320 bytes at a time.)
1.1	25-Jul-2020	riastradh	Implement ChaCha with NEON on ARM. XXX Needs performance measurement. XXX Needs adaptation to arm32 neon which has half the registers.
Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.4	23-Aug-2020	riastradh	Adjust sp, not fp, to allocate a 32-byte temporary. Costs another couple MOV instructions, but we can't skimp on this -- there's no red zone below sp for interrupts on arm, so we can't touch anything there. So just use fp to save sp and then adjust sp itself, rather than using fp as a temporary register to point just below sp. Should fix PR port-arm/55598 -- previously the ChaCha self-test failed 33/10000 trials triggered by sysctl during running system; with the patch it has failed 0/10000 trials. (Presumably it happened more often at boot time, leading to 5/26 failures in the test bed, because we just enabled interrupts and some devices are starting to deliver interrupts.)
1.3	08-Aug-2020	riastradh	Fix ARM NEON implementations of AES and ChaCha on big-endian ARM. New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers. Needed because GCC and Clang disagree on the ordering of lanes, depending on whether it's 64-bit big-endian, 32-bit big-endian, or little-endian -- and, bizarrely, both of them disagree with the architectural numbering of lanes. Experimented with using static const uint8_t x8[16] = {...}; uint8x16_t x = vld1q_u8(x8); which doesn't require knowing anything about the ordering of lanes, but this generates considerably worse code and apparently confuses GCC into not recognizing the constant value of x8. Fix some clang mistakes while here too.
1.2	29-Jul-2020	riastradh	Issue three more swaps to save eight stores. Reduces code size and yields a small (~2%) cgd throughput boost. Remove duplicate comment while here.
1.1	28-Jul-2020	riastradh	Implement 4-way vectorization of ChaCha for armv7 NEON. cgd performance is not as good as I was hoping (~4% improvement over chacha_ref.c) but it should improve substantially more if we let the cgd worker thread keep fpu state so we don't have to pay the cost of isb and zero-the-fpu on every 512-byte cgd block.
Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.7	07-Sep-2020	jakllsch	Use a working macro to detect big endian aarch64. Fixes aarch64eb NEON ChaCha.
1.6	08-Aug-2020	riastradh	Fix ARM NEON implementations of AES and ChaCha on big-endian ARM. New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers. Needed because GCC and Clang disagree on the ordering of lanes, depending on whether it's 64-bit big-endian, 32-bit big-endian, or little-endian -- and, bizarrely, both of them disagree with the architectural numbering of lanes. Experimented with using static const uint8_t x8[16] = {...}; uint8x16_t x = vld1q_u8(x8); which doesn't require knowing anything about the ordering of lanes, but this generates considerably worse code and apparently confuses GCC into not recognizing the constant value of x8. Fix some clang mistakes while here too.
1.5	28-Jul-2020	riastradh	Fix typo in comment.
1.4	27-Jul-2020	riastradh	Add RCSIDs to the AES and ChaCha .S sources.
1.3	27-Jul-2020	riastradh	Align critical-path loops in AES and ChaCha.
1.2	27-Jul-2020	riastradh	Use <aarch64/asm.h> rather than copying things from it here. Vestige from userland build on netbsd-9 during development.
1.1	25-Jul-2020	riastradh	Implement ChaCha with NEON on ARM. XXX Needs performance measurement. XXX Needs adaptation to arm32 neon which has half the registers.
Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.2	10-Oct-2020	jmcneill	Fix detection of NEON features. ID_AA64PFR0_EL1_ADV_SIMD_NONE means SIMD is not available, and any other value means it is.
1.1	25-Jul-2020	riastradh	Implement ChaCha with NEON on ARM. XXX Needs performance measurement. XXX Needs adaptation to arm32 neon which has half the registers.
Revision tags: perseant-exfatfs-base-20250801 netbsd-11-base netbsd-10-1-RELEASE perseant-exfatfs-base-20240630 perseant-exfatfs-base netbsd-10-0-RELEASE netbsd-10-0-RC6 netbsd-10-0-RC5 netbsd-10-0-RC4 netbsd-10-0-RC3 netbsd-10-0-RC2 thorpej-ifq-base thorpej-altq-separation-base netbsd-10-0-RC1 netbsd-10-base bouyer-sunxi-drm-base thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
1.5	08-Sep-2020	jakllsch	Acknowledge clang warning for NEON cipher code on aarch64eb We've already made the nonportable vector initializations portable; the code works on aarch64eb.
1.4	08-Sep-2020	jakllsch	use correct condition
1.3	28-Jul-2020	riastradh	Implement 4-way vectorization of ChaCha for armv7 NEON. cgd performance is not as good as I was hoping (~4% improvement over chacha_ref.c) but it should improve substantially more if we let the cgd worker thread keep fpu state so we don't have to pay the cost of isb and zero-the-fpu on every 512-byte cgd block.
1.2	27-Jul-2020	riastradh	Enable ChaCha NEON code on armv7 too. The 4-blocks-at-a-time assembly helper is disabled for now; adapting it to armv7 is going to be a little annoying with only 16 128-bit vector registers. (Should also do a fifth block in the integer registers for 320 bytes at a time.)
1.1	25-Jul-2020	riastradh	Implement ChaCha with NEON on ARM. XXX Needs performance measurement. XXX Needs adaptation to arm32 neon which has half the registers.