| History log of /src/sys/crypto/chacha/arch/arm/chacha_neon_32.S |
| Revision | | Date | Author | Comments |
| 1.4 |
| 23-Aug-2020 |
riastradh | Adjust sp, not fp, to allocate a 32-byte temporary.
Costs another couple MOV instructions, but we can't skimp on this -- there's no red zone below sp for interrupts on arm, so we can't touch anything there. So just use fp to save sp and then adjust sp itself, rather than using fp as a temporary register to point just below sp.
Should fix PR port-arm/55598 -- previously the ChaCha self-test failed 33/10000 trials triggered by sysctl during running system; with the patch it has failed 0/10000 trials.
(Presumably it happened more often at boot time, leading to 5/26 failures in the test bed, because we just enabled interrupts and some devices are starting to deliver interrupts.)
|
| 1.3 |
| 08-Aug-2020 |
riastradh | Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.
New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers. Needed because GCC and Clang disagree on the ordering of lanes, depending on whether it's 64-bit big-endian, 32-bit big-endian, or little-endian -- and, bizarrely, both of them disagree with the architectural numbering of lanes.
Experimented with using
static const uint8_t x8[16] = {...};
uint8x16_t x = vld1q_u8(x8);
which doesn't require knowing anything about the ordering of lanes, but this generates considerably worse code and apparently confuses GCC into not recognizing the constant value of x8.
Fix some clang mistakes while here too.
|
| 1.2 |
| 29-Jul-2020 |
riastradh | Issue three more swaps to save eight stores.
Reduces code size and yields a small (~2%) cgd throughput boost.
Remove duplicate comment while here.
|
| 1.1 |
| 28-Jul-2020 |
riastradh | Implement 4-way vectorization of ChaCha for armv7 NEON.
cgd performance is not as good as I was hoping (~4% improvement over chacha_ref.c) but it should improve substantially more if we let the cgd worker thread keep fpu state so we don't have to pay the cost of isb and zero-the-fpu on every 512-byte cgd block.
|