Home | History | Annotate | Line # | Download | only in k6
      1 Copyright 2000, 2001 Free Software Foundation, Inc.
      2 
      3 This file is part of the GNU MP Library.
      4 
      5 The GNU MP Library is free software; you can redistribute it and/or modify
      6 it under the terms of either:
      7 
      8   * the GNU Lesser General Public License as published by the Free
      9     Software Foundation; either version 3 of the License, or (at your
     10     option) any later version.
     11 
     12 or
     13 
     14   * the GNU General Public License as published by the Free Software
     15     Foundation; either version 2 of the License, or (at your option) any
     16     later version.
     17 
     18 or both in parallel, as here.
     19 
     20 The GNU MP Library is distributed in the hope that it will be useful, but
     21 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
     22 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
     23 for more details.
     24 
     25 You should have received copies of the GNU General Public License and the
     26 GNU Lesser General Public License along with the GNU MP Library.  If not,
     27 see https://www.gnu.org/licenses/.
     28 
     29 
     30 
     31 
     32 			AMD K6 MPN SUBROUTINES
     33 
     34 
     35 
     36 This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
     37 K6-3.
     38 
     39 The mmx subdirectory has MMX code suiting plain K6, the k62mmx subdirectory
     40 has MMX code suiting K6-2 and K6-3.  All chips in the K6 family have MMX,
     41 the separate directories are just so that ./configure can omit them if the
     42 assembler doesn't support MMX.
     43 
     44 
     45 
     46 
     47 STATUS
     48 
     49 Times for the loops, with all code and data in L1 cache, are as follows.
     50 
     51                                  cycles/limb
     52 
     53 	mpn_add_n/sub_n            3.25 normal, 2.75 in-place
     54 
     55 	mpn_mul_1                  6.25
     56 	mpn_add/submul_1           7.65-8.4  (varying with data values)
     57 
     58 	mpn_mul_basecase           9.25 cycles/crossproduct (approx)
     59 	mpn_sqr_basecase           4.7  cycles/crossproduct (approx)
     60                                    or 9.2 cycles/triangleproduct (approx)
     61 
     62 	mpn_l/rshift               3.0
     63 
     64 	mpn_divrem_1              20.0
     65 	mpn_mod_1                 20.0
     66 	mpn_divexact_by3          11.0
     67 
     68 	mpn_copyi                  1.0
     69 	mpn_copyd                  1.0
     70 
     71 
     72 K6-2 and K6-3 have dual-issue MMX and get the following improvements.
     73 
     74 	mpn_l/rshift               1.75
     75 
     76 
     77 Prefetching of sources hasn't yet given any joy.  With the 3DNow "prefetch"
     78 instruction, code seems to run slower, and with just "mov" loads it doesn't
     79 seem faster.  Results so far are inconsistent.  The K6 does a hardware
     80 prefetch of the second cache line in a sector, so the penalty for not
     81 prefetching in software is reduced.
     82 
     83 
     84 
     85 
     86 NOTES
     87 
     88 All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.
     89 
     90 Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
     91 execute them in both X and Y (and in both together).
     92 
     93 Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
     94 chapter 6 table 12).
     95 
     96 Write-allocate L1 data cache means prefetching of destinations is unnecessary.
     97 Store queue is 7 entries of 64 bits each.
     98 
     99 Floating point multiplications can be done in parallel with integer
    100 multiplications, but there doesn't seem to be any way to make use of this.
    101 
    102 
    103 
    104 OPTIMIZATIONS
    105 
    106 Unrolled loops are used to reduce looping overhead.  The unrolling is
    107 configurable up to 32 limbs/loop for most routines, up to 64 for some.
    108 
    109 Sometimes computed jumps into the unrolling are used to handle sizes not a
    110 multiple of the unrolling.  An attractive feature of this is that times
    111 smoothly increase with operand size, but an indirect jump is about 6 cycles
    112 and the setups about another 6, so it depends on how much the unrolled code
    113 is faster than a simple loop as to whether a computed jump ought to be used.
    114 
    115 Position independent code is implemented using a call to get eip for
    116 computed jumps and a ret is always done, rather than an addl $4,%esp or a
    117 popl, so the CPU return address branch prediction stack stays synchronised
    118 with the actual stack in memory.  Such a call however still costs 4 to 7
    119 cycles.
    120 
    121 Branch prediction, in absence of any history, will guess forward jumps are
    122 not taken and backward jumps are taken.  Where possible it's arranged that
    123 the less likely or less important case is under a taken forward jump.
    124 
    125 
    126 
    127 MMX
    128 
    129 Putting emms or femms as late as possible in a routine seems to be fastest.
    130 Perhaps an emms or femms stalls until all outstanding MMX instructions have
    131 completed, so putting it later gives them a chance to complete on their own,
    132 in parallel with other operations (like register popping).
    133 
    134 The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
    135 at the start of a routine, in case it's been preceded by x87 floating point
    136 operations.  This isn't done because in gmp programs it's expected that x87
    137 floating point won't be much used and that chances are an mpn routine won't
    138 have been preceded by any x87 code.
    139 
    140 
    141 
    142 CODING
    143 
    144 Instructions in general code are shown paired if they can decode and execute
    145 together, meaning two short decode instructions with the second not
    146 depending on the first, only the first using the shifter, no more than one
    147 load, and no more than one store.
    148 
    149 K6 does some out of order execution so the pairings aren't essential, they
    150 just show what slots might be available.  When decoding is the limiting
    151 factor things can be scheduled that might not execute until later.
    152 
    153 
    154 
    155 NOTES
    156 
    157 Code alignment
    158 
    159 - if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
    160   short decode is inhibited.  The cross.pl script detects this.
    161 
    162 - loops and branch targets should be aligned to 16 bytes, or ensure at least
    163   2 instructions before a 32 byte boundary.  This makes use of the 16 byte
    164   cache in the BTB.
    165 
    166 Addressing modes
    167 
    168 - (%esi) degrades decoding from short to vector.  0(%esi) doesn't have this
    169   problem, and can be used as an equivalent, or easier is just to use a
    170   different register, like %ebx.
    171 
    172 - K6 and pre-CXT core K6-2 have the following problem.  (K6-2 CXT and K6-3
    173   have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).
    174 
    175   If more than 3 bytes are needed to determine instruction length then
    176   decoding degrades from direct to long, or from long to vector.  This
    177   happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
    178   with mod=00 the sib determines whether there's a displacement.
    179 
    180   This affects all MMX and 3DNow instructions, and others with an 0F prefix,
    181   like movzbl.  The modes affected are anything with an index and no
    182   displacement, or an index but no base, and this includes (%esp) which is
    183   really (,%esp,1).
    184 
    185   The cross.pl script detects problem cases.  The workaround is to always
    186   use a displacement, and to do this with Zdisp if it's zero so the
    187   assembler doesn't discard it.
    188 
    189   See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
    190   13-14 and 36-37.
    191 
    192 Calls
    193 
    194 - indirect jumps and calls are not branch predicted, they measure about 6
    195   cycles.
    196 
    197 Various
    198 
    199 - adcl      2 cycles of decode, maybe 2 cycles executing in the X pipe
    200 - bsf       12-27 cycles
    201 - emms      5 cycles
    202 - femms     3 cycles
    203 - jecxz     2 cycles taken, 13 not taken (optimization manual says 7 not taken)
    204 - divl      20 cycles back-to-back
    205 - imull     2 decode, 3 execute
    206 - mull      2 decode, 3 execute (optimization manual decoding sample)
    207 - prefetch  2 cycles
    208 - rcll/rcrl implicit by one bit: 2 cycles
    209             immediate or %cl count: 11 + 2 per bit for dword
    210                                     13 + 4 per bit for byte
    211 - setCC	    2 cycles
    212 - xchgl	%eax,reg  1.5 cycles, back-to-back (strange)
    213         reg,reg   2 cycles, back-to-back
    214 
    215 
    216 
    217 
    218 REFERENCES
    219 
    220 "AMD-K6 Processor Code Optimization Application Note", AMD publication
    221 number 21924, revision D amendment 0, January 2000.  This describes K6-2 and
    222 K6-3.  Available on-line,
    223 
    224 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21924.pdf
    225 
    226 "AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
    227 publication number 21828, revision A amendment 0, August 1997.  This is an
    228 older edition of the above document, describing plain K6.  Available
    229 on-line,
    230 
    231 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21828.pdf
    232 
    233 "3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
    234 This describes the femms and prefetch instructions, but nothing else from
    235 3DNow has been used.  Available on-line,
    236 
    237 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf
    238 
    239 "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
    240 August 1999.  This has some notes on general K6 optimizations as well as
    241 3DNow.  Available on-line,
    242 
    243 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf
    244 
    245 
    246 
    247 ----------------
    248 Local variables:
    249 mode: text
    250 fill-column: 76
    251 End:
    252