1 1.1 mrg Copyright 2001 Free Software Foundation, Inc. 2 1.1 mrg 3 1.1 mrg This file is part of the GNU MP Library. 4 1.1 mrg 5 1.1 mrg The GNU MP Library is free software; you can redistribute it and/or modify 6 1.1.1.2 mrg it under the terms of either: 7 1.1.1.2 mrg 8 1.1.1.2 mrg * the GNU Lesser General Public License as published by the Free 9 1.1.1.2 mrg Software Foundation; either version 3 of the License, or (at your 10 1.1.1.2 mrg option) any later version. 11 1.1.1.2 mrg 12 1.1.1.2 mrg or 13 1.1.1.2 mrg 14 1.1.1.2 mrg * the GNU General Public License as published by the Free Software 15 1.1.1.2 mrg Foundation; either version 2 of the License, or (at your option) any 16 1.1.1.2 mrg later version. 17 1.1.1.2 mrg 18 1.1.1.2 mrg or both in parallel, as here. 19 1.1 mrg 20 1.1 mrg The GNU MP Library is distributed in the hope that it will be useful, but 21 1.1 mrg WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 22 1.1.1.2 mrg or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 23 1.1.1.2 mrg for more details. 24 1.1 mrg 25 1.1.1.2 mrg You should have received copies of the GNU General Public License and the 26 1.1.1.2 mrg GNU Lesser General Public License along with the GNU MP Library. If not, 27 1.1.1.2 mrg see https://www.gnu.org/licenses/. 28 1.1 mrg 29 1.1 mrg 30 1.1 mrg 31 1.1 mrg 32 1.1 mrg INTEL PENTIUM-4 MPN SUBROUTINES 33 1.1 mrg 34 1.1 mrg 35 1.1 mrg This directory contains mpn functions optimized for Intel Pentium-4. 36 1.1 mrg 37 1.1 mrg The mmx subdirectory has routines using MMX instructions, the sse2 38 1.1 mrg subdirectory has routines using SSE2 instructions. All P4s have these, the 39 1.1 mrg separate directories are just so configure can omit that code if the 40 1.1 mrg assembler doesn't support it. 41 1.1 mrg 42 1.1 mrg 43 1.1 mrg STATUS 44 1.1 mrg 45 1.1 mrg cycles/limb 46 1.1 mrg 47 1.1 mrg mpn_add_n/sub_n 4 normal, 6 in-place 48 1.1 mrg 49 1.1 mrg mpn_mul_1 4 normal, 6 in-place 50 1.1 mrg mpn_addmul_1 6 51 1.1 mrg mpn_submul_1 7 52 1.1 mrg 53 1.1 mrg mpn_mul_basecase 6 cycles/crossproduct (approx) 54 1.1 mrg 55 1.1 mrg mpn_sqr_basecase 3.5 cycles/crossproduct (approx) 56 1.1 mrg or 7.0 cycles/triangleproduct (approx) 57 1.1 mrg 58 1.1 mrg mpn_l/rshift 1.75 59 1.1 mrg 60 1.1 mrg 61 1.1 mrg 62 1.1 mrg The shifts ought to be able to go at 1.5 c/l, but not much effort has been 63 1.1 mrg applied to them yet. 64 1.1 mrg 65 1.1 mrg In-place operations, and all addmul, submul, mul_basecase and sqr_basecase 66 1.1 mrg calls, suffer from pipeline anomalies associated with write combining and 67 1.1 mrg movd reads and writes to the same or nearby locations. The movq 68 1.1 mrg instructions do not trigger the same hardware problems. Unfortunately, 69 1.1 mrg using movq and splitting/combining seems to require too many extra 70 1.1 mrg instructions to help. Perhaps future chip steppings will be better. 71 1.1 mrg 72 1.1 mrg 73 1.1 mrg 74 1.1 mrg NOTES 75 1.1 mrg 76 1.1 mrg The Pentium-4 pipeline "Netburst", provides for quite a number of surprises. 77 1.1 mrg Many traditional x86 instructions run very slowly, requiring use of 78 1.1 mrg alterative instructions for acceptable performance. 79 1.1 mrg 80 1.1 mrg adcl and sbbl are quite slow at 8 cycles for reg->reg. paddq of 32-bits 81 1.1 mrg within a 64-bit mmx register seems better, though the combination 82 1.1 mrg paddq/psrlq when propagating a carry is still a 4 cycle latency. 83 1.1 mrg 84 1.1 mrg incl and decl should be avoided, instead use add $1 and sub $1. Apparently 85 1.1 mrg the carry flag is not separately renamed, so incl and decl depend on all 86 1.1 mrg previous flags-setting instructions. 87 1.1 mrg 88 1.1 mrg shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest 89 1.1 mrg integer instructions (addl, subl, orl, andl, and some more). shldl and 90 1.1 mrg shrdl seem to have 13 and 15 cycles latency, respectively. Bizarre. 91 1.1 mrg 92 1.1 mrg movq mmx -> mmx does have 6 cycle latency, as noted in the documentation. 93 1.1 mrg pxor/por or similar combination at 2 cycles latency can be used instead. 94 1.1 mrg The movq however executes in the float unit, thereby saving MMX execution 95 1.1 mrg resources. With the right juggling, data moves shouldn't be on a dependent 96 1.1 mrg chain. 97 1.1 mrg 98 1.1 mrg L1 is write-through, but the write-combining sounds like it does enough to 99 1.1 mrg not require explicit destination prefetching. 100 1.1 mrg 101 1.1 mrg xmm registers so far haven't found a use, but not much effort has been 102 1.1 mrg expended. A configure test for whether the operating system knows 103 1.1 mrg fxsave/fxrestor will be needed if they're used. 104 1.1 mrg 105 1.1 mrg 106 1.1 mrg 107 1.1 mrg REFERENCES 108 1.1 mrg 109 1.1 mrg Intel Pentium-4 processor manuals, 110 1.1 mrg 111 1.1 mrg http://developer.intel.com/design/pentium4/manuals 112 1.1 mrg 113 1.1 mrg "Intel Pentium 4 Processor Optimization Reference Manual", Intel, 2001, 114 1.1 mrg order number 248966. Available on-line: 115 1.1 mrg 116 1.1 mrg http://developer.intel.com/design/pentium4/manuals/248966.htm 117 1.1 mrg 118 1.1 mrg 119 1.1 mrg 120 1.1 mrg ---------------- 121 1.1 mrg Local variables: 122 1.1 mrg mode: text 123 1.1 mrg fill-column: 76 124 1.1 mrg End: 125