Home | History | Annotate | Line # | Download | only in k6
README revision 1.1
      1 Copyright 2000, 2001 Free Software Foundation, Inc.
      2 
      3 This file is part of the GNU MP Library.
      4 
      5 The GNU MP Library is free software; you can redistribute it and/or modify
      6 it under the terms of the GNU Lesser General Public License as published by
      7 the Free Software Foundation; either version 3 of the License, or (at your
      8 option) any later version.
      9 
     10 The GNU MP Library is distributed in the hope that it will be useful, but
     11 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
     12 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
     13 License for more details.
     14 
     15 You should have received a copy of the GNU Lesser General Public License
     16 along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
     17 
     18 
     19 
     20 
     21 			AMD K6 MPN SUBROUTINES
     22 
     23 
     24 
     25 This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
     26 K6-3.
     27 
     28 The mmx subdirectory has MMX code suiting plain K6, the k62mmx subdirectory
     29 has MMX code suiting K6-2 and K6-3.  All chips in the K6 family have MMX,
     30 the separate directories are just so that ./configure can omit them if the
     31 assembler doesn't support MMX.
     32 
     33 
     34 
     35 
     36 STATUS
     37 
     38 Times for the loops, with all code and data in L1 cache, are as follows.
     39 
     40                                  cycles/limb
     41 
     42 	mpn_add_n/sub_n            3.25 normal, 2.75 in-place
     43 
     44 	mpn_mul_1                  6.25
     45 	mpn_add/submul_1           7.65-8.4  (varying with data values)
     46 
     47 	mpn_mul_basecase           9.25 cycles/crossproduct (approx)
     48 	mpn_sqr_basecase           4.7  cycles/crossproduct (approx)
     49                                    or 9.2 cycles/triangleproduct (approx)
     50 
     51 	mpn_l/rshift               3.0
     52 
     53 	mpn_divrem_1              20.0
     54 	mpn_mod_1                 20.0
     55 	mpn_divexact_by3          11.0
     56 
     57 	mpn_copyi                  1.0
     58 	mpn_copyd                  1.0
     59 
     60 
     61 K6-2 and K6-3 have dual-issue MMX and get the following improvements.
     62 
     63 	mpn_l/rshift               1.75
     64 
     65 
     66 Prefetching of sources hasn't yet given any joy.  With the 3DNow "prefetch"
     67 instruction, code seems to run slower, and with just "mov" loads it doesn't
     68 seem faster.  Results so far are inconsistent.  The K6 does a hardware
     69 prefetch of the second cache line in a sector, so the penalty for not
     70 prefetching in software is reduced.
     71 
     72 
     73 
     74 
     75 NOTES
     76 
     77 All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.
     78 
     79 Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
     80 execute them in both X and Y (and in both together).
     81 
     82 Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
     83 chapter 6 table 12).
     84 
     85 Write-allocate L1 data cache means prefetching of destinations is unnecessary.
     86 Store queue is 7 entries of 64 bits each.
     87 
     88 Floating point multiplications can be done in parallel with integer
     89 multiplications, but there doesn't seem to be any way to make use of this.
     90 
     91 
     92 
     93 OPTIMIZATIONS
     94 
     95 Unrolled loops are used to reduce looping overhead.  The unrolling is
     96 configurable up to 32 limbs/loop for most routines, up to 64 for some.
     97 
     98 Sometimes computed jumps into the unrolling are used to handle sizes not a
     99 multiple of the unrolling.  An attractive feature of this is that times
    100 smoothly increase with operand size, but an indirect jump is about 6 cycles
    101 and the setups about another 6, so it depends on how much the unrolled code
    102 is faster than a simple loop as to whether a computed jump ought to be used.
    103 
    104 Position independent code is implemented using a call to get eip for
    105 computed jumps and a ret is always done, rather than an addl $4,%esp or a
    106 popl, so the CPU return address branch prediction stack stays synchronised
    107 with the actual stack in memory.  Such a call however still costs 4 to 7
    108 cycles.
    109 
    110 Branch prediction, in absence of any history, will guess forward jumps are
    111 not taken and backward jumps are taken.  Where possible it's arranged that
    112 the less likely or less important case is under a taken forward jump.
    113 
    114 
    115 
    116 MMX
    117 
    118 Putting emms or femms as late as possible in a routine seems to be fastest.
    119 Perhaps an emms or femms stalls until all outstanding MMX instructions have
    120 completed, so putting it later gives them a chance to complete on their own,
    121 in parallel with other operations (like register popping).
    122 
    123 The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
    124 at the start of a routine, in case it's been preceded by x87 floating point
    125 operations.  This isn't done because in gmp programs it's expected that x87
    126 floating point won't be much used and that chances are an mpn routine won't
    127 have been preceded by any x87 code.
    128 
    129 
    130 
    131 CODING
    132 
    133 Instructions in general code are shown paired if they can decode and execute
    134 together, meaning two short decode instructions with the second not
    135 depending on the first, only the first using the shifter, no more than one
    136 load, and no more than one store.
    137 
    138 K6 does some out of order execution so the pairings aren't essential, they
    139 just show what slots might be available.  When decoding is the limiting
    140 factor things can be scheduled that might not execute until later.
    141 
    142 
    143 
    144 NOTES
    145 
    146 Code alignment
    147 
    148 - if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
    149   short decode is inhibited.  The cross.pl script detects this.
    150 
    151 - loops and branch targets should be aligned to 16 bytes, or ensure at least
    152   2 instructions before a 32 byte boundary.  This makes use of the 16 byte
    153   cache in the BTB.
    154 
    155 Addressing modes
    156 
    157 - (%esi) degrades decoding from short to vector.  0(%esi) doesn't have this
    158   problem, and can be used as an equivalent, or easier is just to use a
    159   different register, like %ebx.
    160 
    161 - K6 and pre-CXT core K6-2 have the following problem.  (K6-2 CXT and K6-3
    162   have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).
    163 
    164   If more than 3 bytes are needed to determine instruction length then
    165   decoding degrades from direct to long, or from long to vector.  This
    166   happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
    167   with mod=00 the sib determines whether there's a displacement.
    168 
    169   This affects all MMX and 3DNow instructions, and others with an 0F prefix,
    170   like movzbl.  The modes affected are anything with an index and no
    171   displacement, or an index but no base, and this includes (%esp) which is
    172   really (,%esp,1).
    173 
    174   The cross.pl script detects problem cases.  The workaround is to always
    175   use a displacement, and to do this with Zdisp if it's zero so the
    176   assembler doesn't discard it.
    177 
    178   See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
    179   13-14 and 36-37.
    180 
    181 Calls
    182 
    183 - indirect jumps and calls are not branch predicted, they measure about 6
    184   cycles.
    185 
    186 Various
    187 
    188 - adcl      2 cycles of decode, maybe 2 cycles executing in the X pipe
    189 - bsf       12-27 cycles
    190 - emms      5 cycles
    191 - femms     3 cycles
    192 - jecxz     2 cycles taken, 13 not taken (optimization manual says 7 not taken)
    193 - divl      20 cycles back-to-back
    194 - imull     2 decode, 3 execute
    195 - mull      2 decode, 3 execute (optimization manual decoding sample)
    196 - prefetch  2 cycles
    197 - rcll/rcrl implicit by one bit: 2 cycles
    198             immediate or %cl count: 11 + 2 per bit for dword
    199                                     13 + 4 per bit for byte
    200 - setCC	    2 cycles
    201 - xchgl	%eax,reg  1.5 cycles, back-to-back (strange)
    202         reg,reg   2 cycles, back-to-back
    203 
    204 
    205 
    206 
    207 REFERENCES
    208 
    209 "AMD-K6 Processor Code Optimization Application Note", AMD publication
    210 number 21924, revision D amendment 0, January 2000.  This describes K6-2 and
    211 K6-3.  Available on-line,
    212 
    213 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21924.pdf
    214 
    215 "AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
    216 publication number 21828, revision A amendment 0, August 1997.  This is an
    217 older edition of the above document, describing plain K6.  Available
    218 on-line,
    219 
    220 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21828.pdf
    221 
    222 "3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
    223 This describes the femms and prefetch instructions, but nothing else from
    224 3DNow has been used.  Available on-line,
    225 
    226 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf
    227 
    228 "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
    229 August 1999.  This has some notes on general K6 optimizations as well as
    230 3DNow.  Available on-line,
    231 
    232 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf
    233 
    234 
    235 
    236 ----------------
    237 Local variables:
    238 mode: text
    239 fill-column: 76
    240 End:
    241