Home | History | Annotate | Line # | Download | only in sboot
      1  1.2    scw |	$NetBSD: oc_cksum.s,v 1.2 2000/11/30 22:26:27 scw Exp $
      2  1.1  chuck 
      3  1.1  chuck | Copyright (c) 1988 Regents of the University of California.
      4  1.1  chuck | All rights reserved.
      5  1.1  chuck |
      6  1.1  chuck | Redistribution and use in source and binary forms, with or without
      7  1.1  chuck | modification, are permitted provided that the following conditions
      8  1.1  chuck | are met:
      9  1.1  chuck | 1. Redistributions of source code must retain the above copyright
     10  1.1  chuck |    notice, this list of conditions and the following disclaimer.
     11  1.1  chuck | 2. Redistributions in binary form must reproduce the above copyright
     12  1.1  chuck |    notice, this list of conditions and the following disclaimer in the
     13  1.1  chuck |    documentation and/or other materials provided with the distribution.
     14  1.1  chuck | 3. All advertising materials mentioning features or use of this software
     15  1.1  chuck |    must display the following acknowledgement:
     16  1.1  chuck |	This product includes software developed by the University of
     17  1.1  chuck |	California, Berkeley and its contributors.
     18  1.1  chuck | 4. Neither the name of the University nor the names of its contributors
     19  1.1  chuck |    may be used to endorse or promote products derived from this software
     20  1.1  chuck |    without specific prior written permission.
     21  1.1  chuck |
     22  1.1  chuck | THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
     23  1.1  chuck | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
     24  1.1  chuck | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
     25  1.1  chuck | ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
     26  1.1  chuck | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
     27  1.1  chuck | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
     28  1.1  chuck | OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
     29  1.1  chuck | HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
     30  1.1  chuck | LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
     31  1.1  chuck | OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
     32  1.1  chuck | SUCH DAMAGE.
     33  1.1  chuck |
     34  1.1  chuck |	@(#)oc_cksum.s	7.2 (Berkeley) 11/3/90
     35  1.1  chuck |
     36  1.1  chuck |
     37  1.1  chuck | oc_cksum: ones complement 16 bit checksum for MC68020.
     38  1.1  chuck |
     39  1.1  chuck | oc_cksum (buffer, count, strtval)
     40  1.1  chuck |
     41  1.1  chuck | Do a 16 bit one's complement sum of 'count' bytes from 'buffer'.
     42  1.1  chuck | 'strtval' is the starting value of the sum (usually zero).
     43  1.1  chuck |
     44  1.1  chuck | It simplifies life in in_cksum if strtval can be >= 2^16.
     45  1.1  chuck | This routine will work as long as strtval is < 2^31.
     46  1.1  chuck |
     47  1.1  chuck | Performance
     48  1.1  chuck | -----------
     49  1.1  chuck | This routine is intended for MC 68020s but should also work
     50  1.1  chuck | for 68030s.  It (deliberately) doesn't worry about the alignment
     51  1.1  chuck | of the buffer so will only work on a 68010 if the buffer is
     52  1.1  chuck | aligned on an even address.  (Also, a routine written to use
     53  1.1  chuck | 68010 "loop mode" would almost certainly be faster than this
     54  1.1  chuck | code on a 68010).
     55  1.1  chuck |
     56  1.1  chuck | We don't worry about alignment because this routine is frequently
     57  1.1  chuck | called with small counts: 20 bytes for IP header checksums and 40
     58  1.1  chuck | bytes for TCP ack checksums.  For these small counts, testing for
     59  1.1  chuck | bad alignment adds ~10% to the per-call cost.  Since, by the nature
     60  1.1  chuck | of the kernel's allocator, the data we're called with is almost
     61  1.1  chuck | always longword aligned, there is no benefit to this added cost
     62  1.1  chuck | and we're better off letting the loop take a big performance hit
     63  1.1  chuck | in the rare cases where we're handed an unaligned buffer.
     64  1.1  chuck |
     65  1.1  chuck | Loop unrolling constants of 2, 4, 8, 16, 32 and 64 times were
     66  1.1  chuck | tested on random data on four different types of processors (see
     67  1.1  chuck | list below -- 64 was the largest unrolling because anything more
     68  1.1  chuck | overflows the 68020 Icache).  On all the processors, the
     69  1.1  chuck | throughput asymptote was located between 8 and 16 (closer to 8).
     70  1.1  chuck | However, 16 was substantially better than 8 for small counts.
     71  1.1  chuck | (It's clear why this happens for a count of 40: unroll-8 pays a
     72  1.1  chuck | loop branch cost and unroll-16 doesn't.  But the tests also showed
     73  1.1  chuck | that 16 was better than 8 for a count of 20.  It's not obvious to
     74  1.1  chuck | me why.)  So, since 16 was good for both large and small counts,
     75  1.1  chuck | the loop below is unrolled 16 times.
     76  1.1  chuck |
     77  1.1  chuck | The processors tested and their average time to checksum 1024 bytes
     78  1.1  chuck | of random data were:
     79  1.1  chuck | 	Sun 3/50 (15MHz)	190 us/KB
     80  1.1  chuck | 	Sun 3/180 (16.6MHz)	175 us/KB
     81  1.1  chuck | 	Sun 3/60 (20MHz)	134 us/KB
     82  1.1  chuck | 	Sun 3/280 (25MHz)	 95 us/KB
     83  1.1  chuck |
     84  1.1  chuck | The cost of calling this routine was typically 10% of the per-
     85  1.1  chuck | kilobyte cost.  E.g., checksumming zero bytes on a 3/60 cost 9us
     86  1.1  chuck | and each additional byte cost 125ns.  With the high fixed cost,
     87  1.1  chuck | it would clearly be a gain to "inline" this routine -- the
     88  1.1  chuck | subroutine call adds 400% overhead to an IP header checksum.
     89  1.1  chuck | However, in absolute terms, inlining would only gain 10us per
     90  1.1  chuck | packet -- a 1% effect for a 1ms ethernet packet.  This is not
     91  1.1  chuck | enough gain to be worth the effort.
     92  1.1  chuck 
     93  1.1  chuck #include <m68k/asm.h>
     94  1.1  chuck 
     95  1.1  chuck 	.text
     96  1.2    scw 	.even
     97  1.1  chuck 
     98  1.2    scw ENTRY_NOPROFILE(oc_cksum)
     99  1.2    scw 	movl	%sp@(4),%a0	| get buffer ptr
    100  1.2    scw 	movl	%sp@(8),%d1	| get byte count
    101  1.2    scw 	movl	%sp@(12),%d0	| get starting value
    102  1.2    scw 	movl	%d2,%sp@-	| free a reg
    103  1.1  chuck 
    104  1.1  chuck 	| test for possible 1, 2 or 3 bytes of excess at end
    105  1.1  chuck 	| of buffer.  The usual case is no excess (the usual
    106  1.1  chuck 	| case is header checksums) so we give that the faster
    107  1.1  chuck 	| 'not taken' leg of the compare.  (We do the excess
    108  1.1  chuck 	| first because we're about the trash the low order
    109  1.1  chuck 	| bits of the count in d1.)
    110  1.1  chuck 
    111  1.2    scw 	btst	#0,%d1
    112  1.1  chuck 	jne	L5		| if one or three bytes excess
    113  1.2    scw 	btst	#1,%d1
    114  1.1  chuck 	jne	L7		| if two bytes excess
    115  1.1  chuck L1:
    116  1.2    scw 	movl	%d1,%d2
    117  1.2    scw 	lsrl	#6,%d1		| make cnt into # of 64 byte chunks
    118  1.2    scw 	andl	#0x3c,%d2	| then find fractions of a chunk
    119  1.2    scw 	negl	%d2
    120  1.2    scw 	andb	#0xf,%ccr		| clear X
    121  1.2    scw 	jmp	%pc@(L3-.-2:b,%d2)
    122  1.1  chuck L2:
    123  1.2    scw 	movl	%a0@+,%d2
    124  1.2    scw 	addxl	%d2,%d0
    125  1.2    scw 	movl	%a0@+,%d2
    126  1.2    scw 	addxl	%d2,%d0
    127  1.2    scw 	movl	%a0@+,%d2
    128  1.2    scw 	addxl	%d2,%d0
    129  1.2    scw 	movl	%a0@+,%d2
    130  1.2    scw 	addxl	%d2,%d0
    131  1.2    scw 	movl	%a0@+,%d2
    132  1.2    scw 	addxl	%d2,%d0
    133  1.2    scw 	movl	%a0@+,%d2
    134  1.2    scw 	addxl	%d2,%d0
    135  1.2    scw 	movl	%a0@+,%d2
    136  1.2    scw 	addxl	%d2,%d0
    137  1.2    scw 	movl	%a0@+,%d2
    138  1.2    scw 	addxl	%d2,%d0
    139  1.2    scw 	movl	%a0@+,%d2
    140  1.2    scw 	addxl	%d2,%d0
    141  1.2    scw 	movl	%a0@+,%d2
    142  1.2    scw 	addxl	%d2,%d0
    143  1.2    scw 	movl	%a0@+,%d2
    144  1.2    scw 	addxl	%d2,%d0
    145  1.2    scw 	movl	%a0@+,%d2
    146  1.2    scw 	addxl	%d2,%d0
    147  1.2    scw 	movl	%a0@+,%d2
    148  1.2    scw 	addxl	%d2,%d0
    149  1.2    scw 	movl	%a0@+,%d2
    150  1.2    scw 	addxl	%d2,%d0
    151  1.2    scw 	movl	%a0@+,%d2
    152  1.2    scw 	addxl	%d2,%d0
    153  1.2    scw 	movl	%a0@+,%d2
    154  1.2    scw 	addxl	%d2,%d0
    155  1.1  chuck L3:
    156  1.2    scw 	dbra	%d1,L2		| (NB- dbra doesn't affect X)
    157  1.1  chuck 
    158  1.2    scw 	movl	%d0,%d1		| fold 32 bit sum to 16 bits
    159  1.2    scw 	swap	%d1		| (NB- swap doesn't affect X)
    160  1.2    scw 	addxw	%d1,%d0
    161  1.1  chuck 	jcc	L4
    162  1.2    scw 	addw	#1,%d0
    163  1.1  chuck L4:
    164  1.2    scw 	andl	#0xffff,%d0
    165  1.2    scw 	movl	%sp@+,%d2
    166  1.1  chuck 	rts
    167  1.1  chuck 
    168  1.1  chuck L5:	| deal with 1 or 3 excess bytes at the end of the buffer.
    169  1.2    scw 	btst	#1,%d1
    170  1.1  chuck 	jeq	L6		| if 1 excess
    171  1.1  chuck 
    172  1.1  chuck 	| 3 bytes excess
    173  1.2    scw 	clrl	%d2
    174  1.2    scw 	movw	%a0@(-3,%d1:l),%d2	| add in last full word then drop
    175  1.2    scw 	addl	%d2,%d0		|  through to pick up last byte
    176  1.1  chuck 
    177  1.1  chuck L6:	| 1 byte excess
    178  1.2    scw 	clrl	%d2
    179  1.2    scw 	movb	%a0@(-1,%d1:l),%d2
    180  1.2    scw 	lsll	#8,%d2
    181  1.2    scw 	addl	%d2,%d0
    182  1.1  chuck 	jra	L1
    183  1.1  chuck 
    184  1.1  chuck L7:	| 2 bytes excess
    185  1.2    scw 	clrl	%d2
    186  1.2    scw 	movw	%a0@(-2,%d1:l),%d2
    187  1.2    scw 	addl	%d2,%d0
    188  1.1  chuck 	jra	L1
    189