README revision 1.1 1 1.1 mrg Copyright 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
2 1.1 mrg
3 1.1 mrg This file is part of the GNU MP Library.
4 1.1 mrg
5 1.1 mrg The GNU MP Library is free software; you can redistribute it and/or modify
6 1.1 mrg it under the terms of the GNU Lesser General Public License as published by
7 1.1 mrg the Free Software Foundation; either version 3 of the License, or (at your
8 1.1 mrg option) any later version.
9 1.1 mrg
10 1.1 mrg The GNU MP Library is distributed in the hope that it will be useful, but
11 1.1 mrg WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12 1.1 mrg or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
13 1.1 mrg License for more details.
14 1.1 mrg
15 1.1 mrg You should have received a copy of the GNU Lesser General Public License
16 1.1 mrg along with the GNU MP Library. If not, see http://www.gnu.org/licenses/.
17 1.1 mrg
18 1.1 mrg
19 1.1 mrg
20 1.1 mrg
21 1.1 mrg
22 1.1 mrg This directory contains mpn functions for 64-bit V9 SPARC
23 1.1 mrg
24 1.1 mrg RELEVANT OPTIMIZATION ISSUES
25 1.1 mrg
26 1.1 mrg Notation:
27 1.1 mrg IANY = shift/add/sub/logical/sethi
28 1.1 mrg IADDLOG = add/sub/logical/sethi
29 1.1 mrg MEM = ld*/st*
30 1.1 mrg FA = fadd*/fsub*/f*to*/fmov*
31 1.1 mrg FM = fmul*
32 1.1 mrg
33 1.1 mrg UltraSPARC can issue four instructions per cycle, with these restrictions:
34 1.1 mrg * Two IANY instructions, but only one of these may be a shift. If there is a
35 1.1 mrg shift and an IANY instruction, the shift must precede the IANY instruction.
36 1.1 mrg * One FA.
37 1.1 mrg * One FM.
38 1.1 mrg * One branch.
39 1.1 mrg * One MEM.
40 1.1 mrg * IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle. Taken branches
41 1.1 mrg should not be in slot 4, since that makes the delay insn come from separate
42 1.1 mrg bundle.
43 1.1 mrg * If two IANY/IADDLOG instructions are to be executed in the same cycle and one
44 1.1 mrg of these is setting the condition codes, that instruction must be the second
45 1.1 mrg one.
46 1.1 mrg
47 1.1 mrg To summarize, ignoring branches, these are the bundles that can reach the peak
48 1.1 mrg execution speed:
49 1.1 mrg
50 1.1 mrg insn1 iany iany mem iany iany mem iany iany mem
51 1.1 mrg insn2 iaddlog mem iany mem iaddlog iany mem iaddlog iany
52 1.1 mrg insn3 mem iaddlog iaddlog fa fa fa fm fm fm
53 1.1 mrg insn4 fa/fm fa/fm fa/fm fm fm fm fa fa fa
54 1.1 mrg
55 1.1 mrg The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles,
56 1.1 mrg depending on the position of the most significant bit of the first source
57 1.1 mrg operand. When used for 32x32->64 multiplication, it needs 20 cycles.
58 1.1 mrg Furthermore, it stalls the processor while executing. We stay away from that
59 1.1 mrg instruction, and instead use floating-point operations.
60 1.1 mrg
61 1.1 mrg Floating-point add and multiply units are fully pipelined. The latency for
62 1.1 mrg UltraSPARC-1/2 is 3 cycles and for UltraSPARC-3 it is 4 cycles.
63 1.1 mrg
64 1.1 mrg Integer conditional move instructions cannot dual-issue with other integer
65 1.1 mrg instructions. No conditional move can issue 1-5 cycles after a load. (This
66 1.1 mrg might have been fixed for UltraSPARC-3.)
67 1.1 mrg
68 1.1 mrg The UltraSPARC-3 pipeline is very simular to he one of UltraSPARC-1/2 , but is
69 1.1 mrg somewhat slower. Branches execute slower, and there may be other new stalls.
70 1.1 mrg But integer multiply doesn't stall the entire CPU and also has a much lower
71 1.1 mrg latency. But it's still not pipelined, and thus useless for our needs.
72 1.1 mrg
73 1.1 mrg STATUS
74 1.1 mrg
75 1.1 mrg * mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb on
76 1.1 mrg UltraSPARC-1/2 and 2.65 on UltraSPARC-3. For UltraSPARC-1/2, the IEU0
77 1.1 mrg functional unit is saturated with shifts.
78 1.1 mrg
79 1.1 mrg * mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb on
80 1.1 mrg UltraSPARC-1/2 and 4.5 cycles/limb on UltraSPARC-3. The 4 instruction
81 1.1 mrg recurrency is the speed limiter.
82 1.1 mrg
83 1.1 mrg * mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically on
84 1.1 mrg UltraSPARC-1/2 and 17.5 cycles/limb on UltraSPARC-3. On UltraSPARC-1/2, the
85 1.1 mrg code sustains 4 instructions/cycle. It might be possible to invent a better
86 1.1 mrg way of summing the intermediate 49-bit operands, but it is unlikely that it
87 1.1 mrg will save enough instructions to save an entire cycle.
88 1.1 mrg
89 1.1 mrg The load-use of the u operand is not enough scheduled for good L2 cache
90 1.1 mrg performance. The UltraSPARC-1/2 L1 cache is direct mapped, and since we use
91 1.1 mrg temporary stack slots that will conflict with the u and r operands, we miss
92 1.1 mrg to L2 very often. The load-use of the std/ldx pairs via the stack are
93 1.1 mrg perhaps over-scheduled.
94 1.1 mrg
95 1.1 mrg It would be possible to save two instructions: (1) The mov could be avoided
96 1.1 mrg if the std/ldx were less scheduled. (2) The ldx of the r operand could be
97 1.1 mrg split into two ld instructions, saving the shifts/masks.
98 1.1 mrg
99 1.1 mrg It should be possible to reach 14 cycles/limb for UltraSPARC-3 if the fp
100 1.1 mrg operations where rescheduled for this processor's 4-cycle latency.
101 1.1 mrg
102 1.1 mrg * mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1
103 1.1 mrg code. It would be possible to shave one or two cycles from it, with some
104 1.1 mrg labour.
105 1.1 mrg
106 1.1 mrg * mpn_submul_1: Simpleminded code just calling mpn_mul_1 + mpn_sub_n. This
107 1.1 mrg means that it runs at 18 cycles/limb on UltraSPARC-1/2 and 23 cycles/limb on
108 1.1 mrg UltraSPARC-3. It would be possible to either match the mpn_addmul_1
109 1.1 mrg performance, or in the worst case use one more instruction group.
110 1.1 mrg
111 1.1 mrg * US1/US2 cache conflict resolving. The direct mapped L1 date cache of US1/US2
112 1.1 mrg is a problem for mul_1, addmul_1 (and a prospective submul_1). We should
113 1.1 mrg allocate a larger cache area, and put the stack temp area in a place that
114 1.1 mrg doesn't cause cache conflicts.
115