README revision 1.1.1.2 1 1.1 mrg Copyright 2001 Free Software Foundation, Inc.
2 1.1 mrg
3 1.1 mrg This file is part of the GNU MP Library.
4 1.1 mrg
5 1.1 mrg The GNU MP Library is free software; you can redistribute it and/or modify
6 1.1.1.2 mrg it under the terms of either:
7 1.1.1.2 mrg
8 1.1.1.2 mrg * the GNU Lesser General Public License as published by the Free
9 1.1.1.2 mrg Software Foundation; either version 3 of the License, or (at your
10 1.1.1.2 mrg option) any later version.
11 1.1.1.2 mrg
12 1.1.1.2 mrg or
13 1.1.1.2 mrg
14 1.1.1.2 mrg * the GNU General Public License as published by the Free Software
15 1.1.1.2 mrg Foundation; either version 2 of the License, or (at your option) any
16 1.1.1.2 mrg later version.
17 1.1.1.2 mrg
18 1.1.1.2 mrg or both in parallel, as here.
19 1.1 mrg
20 1.1 mrg The GNU MP Library is distributed in the hope that it will be useful, but
21 1.1 mrg WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
22 1.1.1.2 mrg or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
23 1.1.1.2 mrg for more details.
24 1.1 mrg
25 1.1.1.2 mrg You should have received copies of the GNU General Public License and the
26 1.1.1.2 mrg GNU Lesser General Public License along with the GNU MP Library. If not,
27 1.1.1.2 mrg see https://www.gnu.org/licenses/.
28 1.1 mrg
29 1.1 mrg
30 1.1 mrg
31 1.1 mrg
32 1.1 mrg INTEL PENTIUM-4 MPN SUBROUTINES
33 1.1 mrg
34 1.1 mrg
35 1.1 mrg This directory contains mpn functions optimized for Intel Pentium-4.
36 1.1 mrg
37 1.1 mrg The mmx subdirectory has routines using MMX instructions, the sse2
38 1.1 mrg subdirectory has routines using SSE2 instructions. All P4s have these, the
39 1.1 mrg separate directories are just so configure can omit that code if the
40 1.1 mrg assembler doesn't support it.
41 1.1 mrg
42 1.1 mrg
43 1.1 mrg STATUS
44 1.1 mrg
45 1.1 mrg cycles/limb
46 1.1 mrg
47 1.1 mrg mpn_add_n/sub_n 4 normal, 6 in-place
48 1.1 mrg
49 1.1 mrg mpn_mul_1 4 normal, 6 in-place
50 1.1 mrg mpn_addmul_1 6
51 1.1 mrg mpn_submul_1 7
52 1.1 mrg
53 1.1 mrg mpn_mul_basecase 6 cycles/crossproduct (approx)
54 1.1 mrg
55 1.1 mrg mpn_sqr_basecase 3.5 cycles/crossproduct (approx)
56 1.1 mrg or 7.0 cycles/triangleproduct (approx)
57 1.1 mrg
58 1.1 mrg mpn_l/rshift 1.75
59 1.1 mrg
60 1.1 mrg
61 1.1 mrg
62 1.1 mrg The shifts ought to be able to go at 1.5 c/l, but not much effort has been
63 1.1 mrg applied to them yet.
64 1.1 mrg
65 1.1 mrg In-place operations, and all addmul, submul, mul_basecase and sqr_basecase
66 1.1 mrg calls, suffer from pipeline anomalies associated with write combining and
67 1.1 mrg movd reads and writes to the same or nearby locations. The movq
68 1.1 mrg instructions do not trigger the same hardware problems. Unfortunately,
69 1.1 mrg using movq and splitting/combining seems to require too many extra
70 1.1 mrg instructions to help. Perhaps future chip steppings will be better.
71 1.1 mrg
72 1.1 mrg
73 1.1 mrg
74 1.1 mrg NOTES
75 1.1 mrg
76 1.1 mrg The Pentium-4 pipeline "Netburst", provides for quite a number of surprises.
77 1.1 mrg Many traditional x86 instructions run very slowly, requiring use of
78 1.1 mrg alterative instructions for acceptable performance.
79 1.1 mrg
80 1.1 mrg adcl and sbbl are quite slow at 8 cycles for reg->reg. paddq of 32-bits
81 1.1 mrg within a 64-bit mmx register seems better, though the combination
82 1.1 mrg paddq/psrlq when propagating a carry is still a 4 cycle latency.
83 1.1 mrg
84 1.1 mrg incl and decl should be avoided, instead use add $1 and sub $1. Apparently
85 1.1 mrg the carry flag is not separately renamed, so incl and decl depend on all
86 1.1 mrg previous flags-setting instructions.
87 1.1 mrg
88 1.1 mrg shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest
89 1.1 mrg integer instructions (addl, subl, orl, andl, and some more). shldl and
90 1.1 mrg shrdl seem to have 13 and 15 cycles latency, respectively. Bizarre.
91 1.1 mrg
92 1.1 mrg movq mmx -> mmx does have 6 cycle latency, as noted in the documentation.
93 1.1 mrg pxor/por or similar combination at 2 cycles latency can be used instead.
94 1.1 mrg The movq however executes in the float unit, thereby saving MMX execution
95 1.1 mrg resources. With the right juggling, data moves shouldn't be on a dependent
96 1.1 mrg chain.
97 1.1 mrg
98 1.1 mrg L1 is write-through, but the write-combining sounds like it does enough to
99 1.1 mrg not require explicit destination prefetching.
100 1.1 mrg
101 1.1 mrg xmm registers so far haven't found a use, but not much effort has been
102 1.1 mrg expended. A configure test for whether the operating system knows
103 1.1 mrg fxsave/fxrestor will be needed if they're used.
104 1.1 mrg
105 1.1 mrg
106 1.1 mrg
107 1.1 mrg REFERENCES
108 1.1 mrg
109 1.1 mrg Intel Pentium-4 processor manuals,
110 1.1 mrg
111 1.1 mrg http://developer.intel.com/design/pentium4/manuals
112 1.1 mrg
113 1.1 mrg "Intel Pentium 4 Processor Optimization Reference Manual", Intel, 2001,
114 1.1 mrg order number 248966. Available on-line:
115 1.1 mrg
116 1.1 mrg http://developer.intel.com/design/pentium4/manuals/248966.htm
117 1.1 mrg
118 1.1 mrg
119 1.1 mrg
120 1.1 mrg ----------------
121 1.1 mrg Local variables:
122 1.1 mrg mode: text
123 1.1 mrg fill-column: 76
124 1.1 mrg End:
125