Home | History | Annotate | Line # | Download | only in pentium4
README revision 1.1.1.2
      1      1.1  mrg Copyright 2001 Free Software Foundation, Inc.
      2      1.1  mrg 
      3      1.1  mrg This file is part of the GNU MP Library.
      4      1.1  mrg 
      5      1.1  mrg The GNU MP Library is free software; you can redistribute it and/or modify
      6  1.1.1.2  mrg it under the terms of either:
      7  1.1.1.2  mrg 
      8  1.1.1.2  mrg   * the GNU Lesser General Public License as published by the Free
      9  1.1.1.2  mrg     Software Foundation; either version 3 of the License, or (at your
     10  1.1.1.2  mrg     option) any later version.
     11  1.1.1.2  mrg 
     12  1.1.1.2  mrg or
     13  1.1.1.2  mrg 
     14  1.1.1.2  mrg   * the GNU General Public License as published by the Free Software
     15  1.1.1.2  mrg     Foundation; either version 2 of the License, or (at your option) any
     16  1.1.1.2  mrg     later version.
     17  1.1.1.2  mrg 
     18  1.1.1.2  mrg or both in parallel, as here.
     19      1.1  mrg 
     20      1.1  mrg The GNU MP Library is distributed in the hope that it will be useful, but
     21      1.1  mrg WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
     22  1.1.1.2  mrg or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
     23  1.1.1.2  mrg for more details.
     24      1.1  mrg 
     25  1.1.1.2  mrg You should have received copies of the GNU General Public License and the
     26  1.1.1.2  mrg GNU Lesser General Public License along with the GNU MP Library.  If not,
     27  1.1.1.2  mrg see https://www.gnu.org/licenses/.
     28      1.1  mrg 
     29      1.1  mrg 
     30      1.1  mrg 
     31      1.1  mrg 
     32      1.1  mrg                    INTEL PENTIUM-4 MPN SUBROUTINES
     33      1.1  mrg 
     34      1.1  mrg 
     35      1.1  mrg This directory contains mpn functions optimized for Intel Pentium-4.
     36      1.1  mrg 
     37      1.1  mrg The mmx subdirectory has routines using MMX instructions, the sse2
     38      1.1  mrg subdirectory has routines using SSE2 instructions.  All P4s have these, the
     39      1.1  mrg separate directories are just so configure can omit that code if the
     40      1.1  mrg assembler doesn't support it.
     41      1.1  mrg 
     42      1.1  mrg 
     43      1.1  mrg STATUS
     44      1.1  mrg 
     45      1.1  mrg                                 cycles/limb
     46      1.1  mrg 
     47      1.1  mrg 	mpn_add_n/sub_n            4 normal, 6 in-place
     48      1.1  mrg 
     49      1.1  mrg 	mpn_mul_1                  4 normal, 6 in-place
     50      1.1  mrg 	mpn_addmul_1               6
     51      1.1  mrg 	mpn_submul_1               7
     52      1.1  mrg 
     53      1.1  mrg 	mpn_mul_basecase           6 cycles/crossproduct (approx)
     54      1.1  mrg 
     55      1.1  mrg 	mpn_sqr_basecase           3.5 cycles/crossproduct (approx)
     56      1.1  mrg                                    or 7.0 cycles/triangleproduct (approx)
     57      1.1  mrg 
     58      1.1  mrg 	mpn_l/rshift               1.75
     59      1.1  mrg 
     60      1.1  mrg 
     61      1.1  mrg 
     62      1.1  mrg The shifts ought to be able to go at 1.5 c/l, but not much effort has been
     63      1.1  mrg applied to them yet.
     64      1.1  mrg 
     65      1.1  mrg In-place operations, and all addmul, submul, mul_basecase and sqr_basecase
     66      1.1  mrg calls, suffer from pipeline anomalies associated with write combining and
     67      1.1  mrg movd reads and writes to the same or nearby locations.  The movq
     68      1.1  mrg instructions do not trigger the same hardware problems.  Unfortunately,
     69      1.1  mrg using movq and splitting/combining seems to require too many extra
     70      1.1  mrg instructions to help.  Perhaps future chip steppings will be better.
     71      1.1  mrg 
     72      1.1  mrg 
     73      1.1  mrg 
     74      1.1  mrg NOTES
     75      1.1  mrg 
     76      1.1  mrg The Pentium-4 pipeline "Netburst", provides for quite a number of surprises.
     77      1.1  mrg Many traditional x86 instructions run very slowly, requiring use of
     78      1.1  mrg alterative instructions for acceptable performance.
     79      1.1  mrg 
     80      1.1  mrg adcl and sbbl are quite slow at 8 cycles for reg->reg.  paddq of 32-bits
     81      1.1  mrg within a 64-bit mmx register seems better, though the combination
     82      1.1  mrg paddq/psrlq when propagating a carry is still a 4 cycle latency.
     83      1.1  mrg 
     84      1.1  mrg incl and decl should be avoided, instead use add $1 and sub $1.  Apparently
     85      1.1  mrg the carry flag is not separately renamed, so incl and decl depend on all
     86      1.1  mrg previous flags-setting instructions.
     87      1.1  mrg 
     88      1.1  mrg shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest
     89      1.1  mrg integer instructions (addl, subl, orl, andl, and some more).  shldl and
     90      1.1  mrg shrdl seem to have 13 and 15 cycles latency, respectively.  Bizarre.
     91      1.1  mrg 
     92      1.1  mrg movq mmx -> mmx does have 6 cycle latency, as noted in the documentation.
     93      1.1  mrg pxor/por or similar combination at 2 cycles latency can be used instead.
     94      1.1  mrg The movq however executes in the float unit, thereby saving MMX execution
     95      1.1  mrg resources.  With the right juggling, data moves shouldn't be on a dependent
     96      1.1  mrg chain.
     97      1.1  mrg 
     98      1.1  mrg L1 is write-through, but the write-combining sounds like it does enough to
     99      1.1  mrg not require explicit destination prefetching.
    100      1.1  mrg 
    101      1.1  mrg xmm registers so far haven't found a use, but not much effort has been
    102      1.1  mrg expended.  A configure test for whether the operating system knows
    103      1.1  mrg fxsave/fxrestor will be needed if they're used.
    104      1.1  mrg 
    105      1.1  mrg 
    106      1.1  mrg 
    107      1.1  mrg REFERENCES
    108      1.1  mrg 
    109      1.1  mrg Intel Pentium-4 processor manuals,
    110      1.1  mrg 
    111      1.1  mrg 	http://developer.intel.com/design/pentium4/manuals
    112      1.1  mrg 
    113      1.1  mrg "Intel Pentium 4 Processor Optimization Reference Manual", Intel, 2001,
    114      1.1  mrg order number 248966.  Available on-line:
    115      1.1  mrg 
    116      1.1  mrg 	http://developer.intel.com/design/pentium4/manuals/248966.htm
    117      1.1  mrg 
    118      1.1  mrg 
    119      1.1  mrg 
    120      1.1  mrg ----------------
    121      1.1  mrg Local variables:
    122      1.1  mrg mode: text
    123      1.1  mrg fill-column: 76
    124      1.1  mrg End:
    125