Home | History | Annotate | only in /src/external/lgpl3/gmp/dist/mpn/x86/k7
Up to higher level directory
NameDateSize
addlsh1_n.asm27-Sep-20204.6K
aors_n.asm22-Aug-20175.7K
aorsmul_1.asm22-Aug-20173.4K
bdiv_q_1.asm22-Aug-20175.1K
dive_1.asm22-Aug-20174.3K
gcd_11.asm27-Sep-20202.4K
gmp-mparam.h27-Sep-202012.8K
invert_limb.asm22-Aug-20177.1K
mmx/25-Feb-2026
mod_1_1.asm22-Aug-20174.2K
mod_1_4.asm22-Aug-20174.8K
mod_34lsub1.asm22-Aug-20173.7K
mode1o.asm22-Aug-20174K
mul_1.asm22-Aug-20174.3K
mul_basecase.asm22-Aug-201711.6K
README22-Aug-20175.7K
sqr_basecase.asm22-Aug-201712K
sublsh1_n.asm27-Sep-20203.9K

README

      1 Copyright 2000, 2001 Free Software Foundation, Inc.
      2 
      3 This file is part of the GNU MP Library.
      4 
      5 The GNU MP Library is free software; you can redistribute it and/or modify
      6 it under the terms of either:
      7 
      8   * the GNU Lesser General Public License as published by the Free
      9     Software Foundation; either version 3 of the License, or (at your
     10     option) any later version.
     11 
     12 or
     13 
     14   * the GNU General Public License as published by the Free Software
     15     Foundation; either version 2 of the License, or (at your option) any
     16     later version.
     17 
     18 or both in parallel, as here.
     19 
     20 The GNU MP Library is distributed in the hope that it will be useful, but
     21 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
     22 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
     23 for more details.
     24 
     25 You should have received copies of the GNU General Public License and the
     26 GNU Lesser General Public License along with the GNU MP Library.  If not,
     27 see https://www.gnu.org/licenses/.
     28 
     29 
     30 
     31 
     32                       AMD K7 MPN SUBROUTINES
     33 
     34 
     35 This directory contains code optimized for the AMD Athlon CPU.
     36 
     37 The mmx subdirectory has routines using MMX instructions.  All Athlons have
     38 MMX, the separate directory is just so that configure can omit it if the
     39 assembler doesn't support MMX.
     40 
     41 
     42 
     43 STATUS
     44 
     45 Times for the loops, with all code and data in L1 cache.
     46 
     47                                cycles/limb
     48 	mpn_add/sub_n             1.6
     49 
     50 	mpn_copyi                 0.75 or 1.0   \ varying with data alignment
     51 	mpn_copyd                 0.75 or 1.0   /
     52 
     53 	mpn_divrem_1             17.0 integer part, 15.0 fractional part
     54 	mpn_mod_1                17.0
     55 	mpn_divexact_by3          8.0
     56 
     57 	mpn_l/rshift              1.2
     58 
     59 	mpn_mul_1                 3.4
     60 	mpn_addmul/submul_1       3.9
     61 
     62 	mpn_mul_basecase          4.42 cycles/crossproduct (approx)
     63         mpn_sqr_basecase          2.3 cycles/crossproduct (approx)
     64 				  or 4.55 cycles/triangleproduct (approx)
     65 
     66 Prefetching of sources hasn't yet been tried.
     67 
     68 
     69 
     70 NOTES
     71 
     72 cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available.
     73 
     74 Write-allocate L1 data cache means prefetching of destinations is unnecessary.
     75 
     76 Floating point multiplications can be done in parallel with integer
     77 multiplications, but there doesn't seem to be any way to make use of this.
     78 
     79 Unsigned "mul"s can be issued every 3 cycles.  This suggests 3 is a limit on
     80 the speed of the multiplication routines.  The documentation shows mul
     81 executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that,
     82 to get near 3 cycles code has to be arranged so that nothing else is issued
     83 to IEU0.  A busy IEU0 could explain why some code takes 4 cycles and other
     84 apparently equivalent code takes 5.
     85 
     86 
     87 
     88 OPTIMIZATIONS
     89 
     90 Unrolled loops are used to reduce looping overhead.  The unrolling is
     91 configurable up to 32 limbs/loop for most routines and up to 64 for some.
     92 The K7 has 64k L1 code cache so quite big unrolling is allowable.
     93 
     94 Computed jumps into the unrolling are used to handle sizes not a multiple of
     95 the unrolling.  An attractive feature of this is that times increase
     96 smoothly with operand size, but it may be that some routines should just
     97 have simple loops to finish up, especially when PIC adds between 2 and 16
     98 cycles to get %eip.
     99 
    100 Position independent code is implemented using a call to get %eip for the
    101 computed jumps and a ret is always done, rather than an addl $4,%esp or a
    102 popl, so the CPU return address branch prediction stack stays synchronised
    103 with the actual stack in memory.
    104 
    105 Branch prediction, in absence of any history, will guess forward jumps are
    106 not taken and backward jumps are taken.  Where possible it's arranged that
    107 the less likely or less important case is under a taken forward jump.
    108 
    109 
    110 
    111 CODING
    112 
    113 Instructions in general code have been shown grouped if they can execute
    114 together, which means up to three direct-path instructions which have no
    115 successive dependencies.  K7 always decodes three and has out-of-order
    116 execution, but the groupings show what slots might be available and what
    117 dependency chains exist.
    118 
    119 When there's vector-path instructions an effort is made to get triplets of
    120 direct-path instructions in between them, even if there's dependencies,
    121 since this maximizes decoding throughput and might save a cycle or two if
    122 decoding is the limiting factor.
    123 
    124 
    125 
    126 INSTRUCTIONS
    127 
    128 adcl       direct
    129 divl       39 cycles back-to-back
    130 lodsl,etc  vector
    131 loop       1 cycle vector (decl/jnz opens up one decode slot)
    132 movd reg   vector
    133 movd mem   direct
    134 mull       issue every 3 cycles, latency 4 cycles low word, 6 cycles high word
    135 popl	   vector (use movl for more than one pop)
    136 pushl	   direct, will pair with a load
    137 shrdl %cl  vector, 3 cycles, seems to be 3 decode too
    138 xorl r,r   false read dependency recognised
    139 
    140 
    141 
    142 REFERENCES
    143 
    144 "AMD Athlon Processor X86 Code Optimization Guide", AMD publication number
    145 22007, revision K, February 2002.  Available on-line,
    146 
    147 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
    148 
    149 "3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
    150 This describes the femms and prefetch instructions.  Available on-line,
    151 
    152 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf
    153 
    154 "AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD
    155 publication number 22466, revision D, March 2000.  This describes
    156 instructions added in the Athlon processor, such as pswapd and the extra
    157 prefetch forms.  Available on-line,
    158 
    159 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22466.pdf
    160 
    161 "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
    162 August 1999.  This has some notes on general Athlon optimizations as well as
    163 3DNow.  Available on-line,
    164 
    165 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf
    166 
    167 
    168 
    169 
    170 ----------------
    171 Local variables:
    172 mode: text
    173 fill-column: 76
    174 End:
    175