1 1.2 scw | $NetBSD: oc_cksum.s,v 1.2 2000/11/30 22:26:27 scw Exp $ 2 1.1 chuck 3 1.1 chuck | Copyright (c) 1988 Regents of the University of California. 4 1.1 chuck | All rights reserved. 5 1.1 chuck | 6 1.1 chuck | Redistribution and use in source and binary forms, with or without 7 1.1 chuck | modification, are permitted provided that the following conditions 8 1.1 chuck | are met: 9 1.1 chuck | 1. Redistributions of source code must retain the above copyright 10 1.1 chuck | notice, this list of conditions and the following disclaimer. 11 1.1 chuck | 2. Redistributions in binary form must reproduce the above copyright 12 1.1 chuck | notice, this list of conditions and the following disclaimer in the 13 1.1 chuck | documentation and/or other materials provided with the distribution. 14 1.1 chuck | 3. All advertising materials mentioning features or use of this software 15 1.1 chuck | must display the following acknowledgement: 16 1.1 chuck | This product includes software developed by the University of 17 1.1 chuck | California, Berkeley and its contributors. 18 1.1 chuck | 4. Neither the name of the University nor the names of its contributors 19 1.1 chuck | may be used to endorse or promote products derived from this software 20 1.1 chuck | without specific prior written permission. 21 1.1 chuck | 22 1.1 chuck | THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 23 1.1 chuck | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 24 1.1 chuck | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 25 1.1 chuck | ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 26 1.1 chuck | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 27 1.1 chuck | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 28 1.1 chuck | OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 29 1.1 chuck | HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 30 1.1 chuck | LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 31 1.1 chuck | OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 32 1.1 chuck | SUCH DAMAGE. 33 1.1 chuck | 34 1.1 chuck | @(#)oc_cksum.s 7.2 (Berkeley) 11/3/90 35 1.1 chuck | 36 1.1 chuck | 37 1.1 chuck | oc_cksum: ones complement 16 bit checksum for MC68020. 38 1.1 chuck | 39 1.1 chuck | oc_cksum (buffer, count, strtval) 40 1.1 chuck | 41 1.1 chuck | Do a 16 bit one's complement sum of 'count' bytes from 'buffer'. 42 1.1 chuck | 'strtval' is the starting value of the sum (usually zero). 43 1.1 chuck | 44 1.1 chuck | It simplifies life in in_cksum if strtval can be >= 2^16. 45 1.1 chuck | This routine will work as long as strtval is < 2^31. 46 1.1 chuck | 47 1.1 chuck | Performance 48 1.1 chuck | ----------- 49 1.1 chuck | This routine is intended for MC 68020s but should also work 50 1.1 chuck | for 68030s. It (deliberately) doesn't worry about the alignment 51 1.1 chuck | of the buffer so will only work on a 68010 if the buffer is 52 1.1 chuck | aligned on an even address. (Also, a routine written to use 53 1.1 chuck | 68010 "loop mode" would almost certainly be faster than this 54 1.1 chuck | code on a 68010). 55 1.1 chuck | 56 1.1 chuck | We don't worry about alignment because this routine is frequently 57 1.1 chuck | called with small counts: 20 bytes for IP header checksums and 40 58 1.1 chuck | bytes for TCP ack checksums. For these small counts, testing for 59 1.1 chuck | bad alignment adds ~10% to the per-call cost. Since, by the nature 60 1.1 chuck | of the kernel's allocator, the data we're called with is almost 61 1.1 chuck | always longword aligned, there is no benefit to this added cost 62 1.1 chuck | and we're better off letting the loop take a big performance hit 63 1.1 chuck | in the rare cases where we're handed an unaligned buffer. 64 1.1 chuck | 65 1.1 chuck | Loop unrolling constants of 2, 4, 8, 16, 32 and 64 times were 66 1.1 chuck | tested on random data on four different types of processors (see 67 1.1 chuck | list below -- 64 was the largest unrolling because anything more 68 1.1 chuck | overflows the 68020 Icache). On all the processors, the 69 1.1 chuck | throughput asymptote was located between 8 and 16 (closer to 8). 70 1.1 chuck | However, 16 was substantially better than 8 for small counts. 71 1.1 chuck | (It's clear why this happens for a count of 40: unroll-8 pays a 72 1.1 chuck | loop branch cost and unroll-16 doesn't. But the tests also showed 73 1.1 chuck | that 16 was better than 8 for a count of 20. It's not obvious to 74 1.1 chuck | me why.) So, since 16 was good for both large and small counts, 75 1.1 chuck | the loop below is unrolled 16 times. 76 1.1 chuck | 77 1.1 chuck | The processors tested and their average time to checksum 1024 bytes 78 1.1 chuck | of random data were: 79 1.1 chuck | Sun 3/50 (15MHz) 190 us/KB 80 1.1 chuck | Sun 3/180 (16.6MHz) 175 us/KB 81 1.1 chuck | Sun 3/60 (20MHz) 134 us/KB 82 1.1 chuck | Sun 3/280 (25MHz) 95 us/KB 83 1.1 chuck | 84 1.1 chuck | The cost of calling this routine was typically 10% of the per- 85 1.1 chuck | kilobyte cost. E.g., checksumming zero bytes on a 3/60 cost 9us 86 1.1 chuck | and each additional byte cost 125ns. With the high fixed cost, 87 1.1 chuck | it would clearly be a gain to "inline" this routine -- the 88 1.1 chuck | subroutine call adds 400% overhead to an IP header checksum. 89 1.1 chuck | However, in absolute terms, inlining would only gain 10us per 90 1.1 chuck | packet -- a 1% effect for a 1ms ethernet packet. This is not 91 1.1 chuck | enough gain to be worth the effort. 92 1.1 chuck 93 1.1 chuck #include <m68k/asm.h> 94 1.1 chuck 95 1.1 chuck .text 96 1.2 scw .even 97 1.1 chuck 98 1.2 scw ENTRY_NOPROFILE(oc_cksum) 99 1.2 scw movl %sp@(4),%a0 | get buffer ptr 100 1.2 scw movl %sp@(8),%d1 | get byte count 101 1.2 scw movl %sp@(12),%d0 | get starting value 102 1.2 scw movl %d2,%sp@- | free a reg 103 1.1 chuck 104 1.1 chuck | test for possible 1, 2 or 3 bytes of excess at end 105 1.1 chuck | of buffer. The usual case is no excess (the usual 106 1.1 chuck | case is header checksums) so we give that the faster 107 1.1 chuck | 'not taken' leg of the compare. (We do the excess 108 1.1 chuck | first because we're about the trash the low order 109 1.1 chuck | bits of the count in d1.) 110 1.1 chuck 111 1.2 scw btst #0,%d1 112 1.1 chuck jne L5 | if one or three bytes excess 113 1.2 scw btst #1,%d1 114 1.1 chuck jne L7 | if two bytes excess 115 1.1 chuck L1: 116 1.2 scw movl %d1,%d2 117 1.2 scw lsrl #6,%d1 | make cnt into # of 64 byte chunks 118 1.2 scw andl #0x3c,%d2 | then find fractions of a chunk 119 1.2 scw negl %d2 120 1.2 scw andb #0xf,%ccr | clear X 121 1.2 scw jmp %pc@(L3-.-2:b,%d2) 122 1.1 chuck L2: 123 1.2 scw movl %a0@+,%d2 124 1.2 scw addxl %d2,%d0 125 1.2 scw movl %a0@+,%d2 126 1.2 scw addxl %d2,%d0 127 1.2 scw movl %a0@+,%d2 128 1.2 scw addxl %d2,%d0 129 1.2 scw movl %a0@+,%d2 130 1.2 scw addxl %d2,%d0 131 1.2 scw movl %a0@+,%d2 132 1.2 scw addxl %d2,%d0 133 1.2 scw movl %a0@+,%d2 134 1.2 scw addxl %d2,%d0 135 1.2 scw movl %a0@+,%d2 136 1.2 scw addxl %d2,%d0 137 1.2 scw movl %a0@+,%d2 138 1.2 scw addxl %d2,%d0 139 1.2 scw movl %a0@+,%d2 140 1.2 scw addxl %d2,%d0 141 1.2 scw movl %a0@+,%d2 142 1.2 scw addxl %d2,%d0 143 1.2 scw movl %a0@+,%d2 144 1.2 scw addxl %d2,%d0 145 1.2 scw movl %a0@+,%d2 146 1.2 scw addxl %d2,%d0 147 1.2 scw movl %a0@+,%d2 148 1.2 scw addxl %d2,%d0 149 1.2 scw movl %a0@+,%d2 150 1.2 scw addxl %d2,%d0 151 1.2 scw movl %a0@+,%d2 152 1.2 scw addxl %d2,%d0 153 1.2 scw movl %a0@+,%d2 154 1.2 scw addxl %d2,%d0 155 1.1 chuck L3: 156 1.2 scw dbra %d1,L2 | (NB- dbra doesn't affect X) 157 1.1 chuck 158 1.2 scw movl %d0,%d1 | fold 32 bit sum to 16 bits 159 1.2 scw swap %d1 | (NB- swap doesn't affect X) 160 1.2 scw addxw %d1,%d0 161 1.1 chuck jcc L4 162 1.2 scw addw #1,%d0 163 1.1 chuck L4: 164 1.2 scw andl #0xffff,%d0 165 1.2 scw movl %sp@+,%d2 166 1.1 chuck rts 167 1.1 chuck 168 1.1 chuck L5: | deal with 1 or 3 excess bytes at the end of the buffer. 169 1.2 scw btst #1,%d1 170 1.1 chuck jeq L6 | if 1 excess 171 1.1 chuck 172 1.1 chuck | 3 bytes excess 173 1.2 scw clrl %d2 174 1.2 scw movw %a0@(-3,%d1:l),%d2 | add in last full word then drop 175 1.2 scw addl %d2,%d0 | through to pick up last byte 176 1.1 chuck 177 1.1 chuck L6: | 1 byte excess 178 1.2 scw clrl %d2 179 1.2 scw movb %a0@(-1,%d1:l),%d2 180 1.2 scw lsll #8,%d2 181 1.2 scw addl %d2,%d0 182 1.1 chuck jra L1 183 1.1 chuck 184 1.1 chuck L7: | 2 bytes excess 185 1.2 scw clrl %d2 186 1.2 scw movw %a0@(-2,%d1:l),%d2 187 1.2 scw addl %d2,%d0 188 1.1 chuck jra L1 189