Home | History | Annotate | Download | only in arm
History log of /src/sys/crypto/aes/arch/arm/aes_neon_32.S
RevisionDateAuthorComments
 1.11  10-Sep-2020  riastradh aes neon: Gather mc_forward/backward so we can load 256 bits at once.
 1.10  10-Sep-2020  riastradh aes neon: Hoist dsbd/dsbe address calculation out of loop.
 1.9  10-Sep-2020  riastradh aes neon: Tweak register usage.

- Call r12 by its usual name, ip.
- No need for r7 or r11=fp at the moment.
 1.8  10-Sep-2020  riastradh aes neon: Write vtbl with {qN} rather than {d(2N)-d(2N+1)}.

Cosmetic; no functional change.
 1.7  10-Sep-2020  riastradh aes neon: Issue 256-bit loads rather than pairs of 128-bit loads.

Not sure why I didn't realize you could do this before!

Saves some temporary registers that can now be allocated to shave off
a few cycles.
 1.6  16-Aug-2020  riastradh Fix AES NEON code for big-endian softfp ARM.

...which is how the kernel runs. Switch to using __SOFTFP__ for
consistency with how it gets exposed to C, although I'm not sure how
to get it defined automagically in the toolchain for .S files so
that's set manually in files.aesneon for now.
 1.5  08-Aug-2020  riastradh Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.

New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers.
Needed because GCC and Clang disagree on the ordering of lanes,
depending on whether it's 64-bit big-endian, 32-bit big-endian, or
little-endian -- and, bizarrely, both of them disagree with the
architectural numbering of lanes.

Experimented with using

static const uint8_t x8[16] = {...};

uint8x16_t x = vld1q_u8(x8);

which doesn't require knowing anything about the ordering of lanes,
but this generates considerably worse code and apparently confuses
GCC into not recognizing the constant value of x8.

Fix some clang mistakes while here too.
 1.4  27-Jul-2020  riastradh Add RCSIDs to the AES and ChaCha .S sources.
 1.3  27-Jul-2020  riastradh Align critical-path loops in AES and ChaCha.
 1.2  27-Jul-2020  riastradh PIC for aes_neon_32.S.

Without this, tests/sys/crypto/aes/t_aes fails to start on armv7
because of R_ARM_ABS32 relocations in a nonwritable text segment for
a PIE -- which atf quietly ignores in the final report! Yikes.
 1.1  29-Jun-2020  riastradh Provide hand-written AES NEON assembly for arm32.

gcc does a lousy job at compiling 128-bit NEON intrinsics on arm32;
hand-writing it made it about 12x faster, by avoiding a zillion loads
and stores to spill everything and the kitchen sink onto the stack.
(But gcc does fine on aarch64, presumably because it has twice as
many registers and doesn't have to deal with q2=d4/d5 overlapping.)

RSS XML Feed