OpenGrok

    1.1  mrg Copyright 2001 Free Software Foundation, Inc.
    1.1  mrg
    1.1  mrg This file is part of the GNU MP Library.
    1.1  mrg
    1.1  mrg The GNU MP Library is free software; you can redistribute it and/or modify
1.1.1.2  mrg it under the terms of either:
1.1.1.2  mrg
1.1.1.2  mrg   * the GNU Lesser General Public License as published by the Free
1.1.1.2  mrg     Software Foundation; either version 3 of the License, or (at your
1.1.1.2  mrg     option) any later version.
1.1.1.2  mrg
1.1.1.2  mrg or
1.1.1.2  mrg
1.1.1.2  mrg   * the GNU General Public License as published by the Free Software
1.1.1.2  mrg     Foundation; either version 2 of the License, or (at your option) any
1.1.1.2  mrg     later version.
1.1.1.2  mrg
1.1.1.2  mrg or both in parallel, as here.
    1.1  mrg
    1.1  mrg The GNU MP Library is distributed in the hope that it will be useful, but
    1.1  mrg WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
1.1.1.2  mrg or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
1.1.1.2  mrg for more details.
    1.1  mrg
1.1.1.2  mrg You should have received copies of the GNU General Public License and the
1.1.1.2  mrg GNU Lesser General Public License along with the GNU MP Library.  If not,
1.1.1.2  mrg see https://www.gnu.org/licenses/.
    1.1  mrg
    1.1  mrg
    1.1  mrg
    1.1  mrg
    1.1  mrg                    INTEL PENTIUM-4 MPN SUBROUTINES
    1.1  mrg
    1.1  mrg
    1.1  mrg This directory contains mpn functions optimized for Intel Pentium-4.
    1.1  mrg
    1.1  mrg The mmx subdirectory has routines using MMX instructions, the sse2
    1.1  mrg subdirectory has routines using SSE2 instructions.  All P4s have these, the
    1.1  mrg separate directories are just so configure can omit that code if the
    1.1  mrg assembler doesn't support it.
    1.1  mrg
    1.1  mrg
    1.1  mrg STATUS
    1.1  mrg
    1.1  mrg                                 cycles/limb
    1.1  mrg
    1.1  mrg 	mpn_add_n/sub_n            4 normal, 6 in-place
    1.1  mrg
    1.1  mrg 	mpn_mul_1                  4 normal, 6 in-place
    1.1  mrg 	mpn_addmul_1               6
    1.1  mrg 	mpn_submul_1               7
    1.1  mrg
    1.1  mrg 	mpn_mul_basecase           6 cycles/crossproduct (approx)
    1.1  mrg
    1.1  mrg 	mpn_sqr_basecase           3.5 cycles/crossproduct (approx)
    1.1  mrg                                    or 7.0 cycles/triangleproduct (approx)
    1.1  mrg
    1.1  mrg 	mpn_l/rshift               1.75
    1.1  mrg
    1.1  mrg
    1.1  mrg
    1.1  mrg The shifts ought to be able to go at 1.5 c/l, but not much effort has been
    1.1  mrg applied to them yet.
    1.1  mrg
    1.1  mrg In-place operations, and all addmul, submul, mul_basecase and sqr_basecase
    1.1  mrg calls, suffer from pipeline anomalies associated with write combining and
    1.1  mrg movd reads and writes to the same or nearby locations.  The movq
    1.1  mrg instructions do not trigger the same hardware problems.  Unfortunately,
    1.1  mrg using movq and splitting/combining seems to require too many extra
    1.1  mrg instructions to help.  Perhaps future chip steppings will be better.
    1.1  mrg
    1.1  mrg
    1.1  mrg
    1.1  mrg NOTES
    1.1  mrg
    1.1  mrg The Pentium-4 pipeline "Netburst", provides for quite a number of surprises.
    1.1  mrg Many traditional x86 instructions run very slowly, requiring use of
    1.1  mrg alterative instructions for acceptable performance.
    1.1  mrg
    1.1  mrg adcl and sbbl are quite slow at 8 cycles for reg->reg.  paddq of 32-bits
    1.1  mrg within a 64-bit mmx register seems better, though the combination
    1.1  mrg paddq/psrlq when propagating a carry is still a 4 cycle latency.
    1.1  mrg
    1.1  mrg incl and decl should be avoided, instead use add $1 and sub $1.  Apparently
    1.1  mrg the carry flag is not separately renamed, so incl and decl depend on all
    1.1  mrg previous flags-setting instructions.
    1.1  mrg
    1.1  mrg shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest
    1.1  mrg integer instructions (addl, subl, orl, andl, and some more).  shldl and
    1.1  mrg shrdl seem to have 13 and 15 cycles latency, respectively.  Bizarre.
    1.1  mrg
    1.1  mrg movq mmx -> mmx does have 6 cycle latency, as noted in the documentation.
    1.1  mrg pxor/por or similar combination at 2 cycles latency can be used instead.
    1.1  mrg The movq however executes in the float unit, thereby saving MMX execution
    1.1  mrg resources.  With the right juggling, data moves shouldn't be on a dependent
    1.1  mrg chain.
    1.1  mrg
    1.1  mrg L1 is write-through, but the write-combining sounds like it does enough to
    1.1  mrg not require explicit destination prefetching.
    1.1  mrg
    1.1  mrg xmm registers so far haven't found a use, but not much effort has been
    1.1  mrg expended.  A configure test for whether the operating system knows
    1.1  mrg fxsave/fxrestor will be needed if they're used.
    1.1  mrg
    1.1  mrg
    1.1  mrg
    1.1  mrg REFERENCES
    1.1  mrg
    1.1  mrg Intel Pentium-4 processor manuals,
    1.1  mrg
    1.1  mrg 	http://developer.intel.com/design/pentium4/manuals
    1.1  mrg
    1.1  mrg "Intel Pentium 4 Processor Optimization Reference Manual", Intel, 2001,
    1.1  mrg order number 248966.  Available on-line:
    1.1  mrg
    1.1  mrg 	http://developer.intel.com/design/pentium4/manuals/248966.htm
    1.1  mrg
    1.1  mrg
    1.1  mrg
    1.1  mrg ----------------
    1.1  mrg Local variables:
    1.1  mrg mode: text
    1.1  mrg fill-column: 76
    1.1  mrg End: