oc_cksum.s revision 1.1 1 | $NetBSD: oc_cksum.s,v 1.1 1995/07/25 23:12:31 chuck Exp $
2
3 | Copyright (c) 1988 Regents of the University of California.
4 | All rights reserved.
5 |
6 | Redistribution and use in source and binary forms, with or without
7 | modification, are permitted provided that the following conditions
8 | are met:
9 | 1. Redistributions of source code must retain the above copyright
10 | notice, this list of conditions and the following disclaimer.
11 | 2. Redistributions in binary form must reproduce the above copyright
12 | notice, this list of conditions and the following disclaimer in the
13 | documentation and/or other materials provided with the distribution.
14 | 3. All advertising materials mentioning features or use of this software
15 | must display the following acknowledgement:
16 | This product includes software developed by the University of
17 | California, Berkeley and its contributors.
18 | 4. Neither the name of the University nor the names of its contributors
19 | may be used to endorse or promote products derived from this software
20 | without specific prior written permission.
21 |
22 | THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
23 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
24 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
25 | ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
26 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
27 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
28 | OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
29 | HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
30 | LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
31 | OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
32 | SUCH DAMAGE.
33 |
34 | @(#)oc_cksum.s 7.2 (Berkeley) 11/3/90
35 |
36 |
37 | oc_cksum: ones complement 16 bit checksum for MC68020.
38 |
39 | oc_cksum (buffer, count, strtval)
40 |
41 | Do a 16 bit one's complement sum of 'count' bytes from 'buffer'.
42 | 'strtval' is the starting value of the sum (usually zero).
43 |
44 | It simplifies life in in_cksum if strtval can be >= 2^16.
45 | This routine will work as long as strtval is < 2^31.
46 |
47 | Performance
48 | -----------
49 | This routine is intended for MC 68020s but should also work
50 | for 68030s. It (deliberately) doesn't worry about the alignment
51 | of the buffer so will only work on a 68010 if the buffer is
52 | aligned on an even address. (Also, a routine written to use
53 | 68010 "loop mode" would almost certainly be faster than this
54 | code on a 68010).
55 |
56 | We don't worry about alignment because this routine is frequently
57 | called with small counts: 20 bytes for IP header checksums and 40
58 | bytes for TCP ack checksums. For these small counts, testing for
59 | bad alignment adds ~10% to the per-call cost. Since, by the nature
60 | of the kernel's allocator, the data we're called with is almost
61 | always longword aligned, there is no benefit to this added cost
62 | and we're better off letting the loop take a big performance hit
63 | in the rare cases where we're handed an unaligned buffer.
64 |
65 | Loop unrolling constants of 2, 4, 8, 16, 32 and 64 times were
66 | tested on random data on four different types of processors (see
67 | list below -- 64 was the largest unrolling because anything more
68 | overflows the 68020 Icache). On all the processors, the
69 | throughput asymptote was located between 8 and 16 (closer to 8).
70 | However, 16 was substantially better than 8 for small counts.
71 | (It's clear why this happens for a count of 40: unroll-8 pays a
72 | loop branch cost and unroll-16 doesn't. But the tests also showed
73 | that 16 was better than 8 for a count of 20. It's not obvious to
74 | me why.) So, since 16 was good for both large and small counts,
75 | the loop below is unrolled 16 times.
76 |
77 | The processors tested and their average time to checksum 1024 bytes
78 | of random data were:
79 | Sun 3/50 (15MHz) 190 us/KB
80 | Sun 3/180 (16.6MHz) 175 us/KB
81 | Sun 3/60 (20MHz) 134 us/KB
82 | Sun 3/280 (25MHz) 95 us/KB
83 |
84 | The cost of calling this routine was typically 10% of the per-
85 | kilobyte cost. E.g., checksumming zero bytes on a 3/60 cost 9us
86 | and each additional byte cost 125ns. With the high fixed cost,
87 | it would clearly be a gain to "inline" this routine -- the
88 | subroutine call adds 400% overhead to an IP header checksum.
89 | However, in absolute terms, inlining would only gain 10us per
90 | packet -- a 1% effect for a 1ms ethernet packet. This is not
91 | enough gain to be worth the effort.
92
93 #include <m68k/asm.h>
94
95 .text
96
97 .text; .even; .globl _oc_cksum; _oc_cksum:
98 movl sp@(4),a0 | get buffer ptr
99 movl sp@(8),d1 | get byte count
100 movl sp@(12),d0 | get starting value
101 movl d2,sp@- | free a reg
102
103 | test for possible 1, 2 or 3 bytes of excess at end
104 | of buffer. The usual case is no excess (the usual
105 | case is header checksums) so we give that the faster
106 | 'not taken' leg of the compare. (We do the excess
107 | first because we're about the trash the low order
108 | bits of the count in d1.)
109
110 btst #0,d1
111 jne L5 | if one or three bytes excess
112 btst #1,d1
113 jne L7 | if two bytes excess
114 L1:
115 movl d1,d2
116 lsrl #6,d1 | make cnt into # of 64 byte chunks
117 andl #0x3c,d2 | then find fractions of a chunk
118 negl d2
119 andb #0xf,cc | clear X
120 jmp pc@(L3-.-2:b,d2)
121 L2:
122 movl a0@+,d2
123 addxl d2,d0
124 movl a0@+,d2
125 addxl d2,d0
126 movl a0@+,d2
127 addxl d2,d0
128 movl a0@+,d2
129 addxl d2,d0
130 movl a0@+,d2
131 addxl d2,d0
132 movl a0@+,d2
133 addxl d2,d0
134 movl a0@+,d2
135 addxl d2,d0
136 movl a0@+,d2
137 addxl d2,d0
138 movl a0@+,d2
139 addxl d2,d0
140 movl a0@+,d2
141 addxl d2,d0
142 movl a0@+,d2
143 addxl d2,d0
144 movl a0@+,d2
145 addxl d2,d0
146 movl a0@+,d2
147 addxl d2,d0
148 movl a0@+,d2
149 addxl d2,d0
150 movl a0@+,d2
151 addxl d2,d0
152 movl a0@+,d2
153 addxl d2,d0
154 L3:
155 dbra d1,L2 | (NB- dbra doesn't affect X)
156
157 movl d0,d1 | fold 32 bit sum to 16 bits
158 swap d1 | (NB- swap doesn't affect X)
159 addxw d1,d0
160 jcc L4
161 addw #1,d0
162 L4:
163 andl #0xffff,d0
164 movl sp@+,d2
165 rts
166
167 L5: | deal with 1 or 3 excess bytes at the end of the buffer.
168 btst #1,d1
169 jeq L6 | if 1 excess
170
171 | 3 bytes excess
172 clrl d2
173 movw a0@(-3,d1:l),d2 | add in last full word then drop
174 addl d2,d0 | through to pick up last byte
175
176 L6: | 1 byte excess
177 clrl d2
178 movb a0@(-1,d1:l),d2
179 lsll #8,d2
180 addl d2,d0
181 jra L1
182
183 L7: | 2 bytes excess
184 clrl d2
185 movw a0@(-2,d1:l),d2
186 addl d2,d0
187 jra L1
188