History log of /src/sys/crypto/chacha/arch/arm/chacha_neon.c |
Revision | | Date | Author | Comments |
1.9 |
| 07-Aug-2023 |
rin | sys/crypto: Introduce arch/{arm,x86} to share common MD headers
Dedup between aes and chacha. No binary changes.
|
1.8 |
| 08-Aug-2020 |
riastradh | Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.
New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers. Needed because GCC and Clang disagree on the ordering of lanes, depending on whether it's 64-bit big-endian, 32-bit big-endian, or little-endian -- and, bizarrely, both of them disagree with the architectural numbering of lanes.
Experimented with using
static const uint8_t x8[16] = {...};
uint8x16_t x = vld1q_u8(x8);
which doesn't require knowing anything about the ordering of lanes, but this generates considerably worse code and apparently confuses GCC into not recognizing the constant value of x8.
Fix some clang mistakes while here too.
|
1.7 |
| 28-Jul-2020 |
riastradh | Implement 4-way vectorization of ChaCha for armv7 NEON.
cgd performance is not as good as I was hoping (~4% improvement over chacha_ref.c) but it should improve substantially more if we let the cgd worker thread keep fpu state so we don't have to pay the cost of isb and zero-the-fpu on every 512-byte cgd block.
|
1.6 |
| 28-Jul-2020 |
riastradh | Fix big-endian build with appropriate casts around vrev32q_u8.
|
1.5 |
| 27-Jul-2020 |
riastradh | Note that VSRI seems to hurt here.
|
1.4 |
| 27-Jul-2020 |
riastradh | Take advantage of REV32 and TBL for 16-bit and 8-bit rotations.
However, disable use of (V)TBL on armv7/aarch32 for now, because for some reason GCC spills things to the stack despite having plenty of free registers, which hurts performance more than it helps at least on ARM Cortex-A8.
|
1.3 |
| 27-Jul-2020 |
riastradh | Enable ChaCha NEON code on armv7 too.
The 4-blocks-at-a-time assembly helper is disabled for now; adapting it to armv7 is going to be a little annoying with only 16 128-bit vector registers.
(Should also do a fifth block in the integer registers for 320 bytes at a time.)
|
1.2 |
| 27-Jul-2020 |
riastradh | Reduce some duplication.
Shouldn't substantively hurt performance -- the comparison that has been moved into the loop was essentially the former loop condition -- and may improve performance by reducing code size since there's only one inline call to chacha_permute instead of two.
|
1.1 |
| 25-Jul-2020 |
riastradh | Implement ChaCha with NEON on ARM.
XXX Needs performance measurement. XXX Needs adaptation to arm32 neon which has half the registers.
|