1 Copyright 2000-2005 Free Software Foundation, Inc.
2
3 This file is part of the GNU MP Library.
4
5 The GNU MP Library is free software; you can redistribute it and/or modify
6 it under the terms of either:
7
8 * the GNU Lesser General Public License as published by the Free
9 Software Foundation; either version 3 of the License, or (at your
10 option) any later version.
11
12 or
13
14 * the GNU General Public License as published by the Free Software
15 Foundation; either version 2 of the License, or (at your option) any
16 later version.
17
18 or both in parallel, as here.
19
20 The GNU MP Library is distributed in the hope that it will be useful, but
21 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
22 or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
23 for more details.
24
25 You should have received copies of the GNU General Public License and the
26 GNU Lesser General Public License along with the GNU MP Library. If not,
27 see https://www.gnu.org/licenses/.
28
29
30
31 IA-64 MPN SUBROUTINES
32
33
34 This directory contains mpn functions for the IA-64 architecture.
35
36
37 CODE ORGANIZATION
38
39 mpn/ia64 itanium-2, and generic ia64
40
41 The code here has been optimized primarily for Itanium 2. Very few Itanium 1
42 chips were ever sold, and Itanium 2 is more powerful, so the latter is what
43 we concentrate on.
44
45
46
47 CHIP NOTES
48
49 The IA-64 ISA keeps instructions three and three in 128 bit bundles.
50 Programmers/compilers need to put explicit breaks `;;' when there are WAW or
51 RAW dependencies, with some notable exceptions. Such "breaks" are typically
52 at the end of a bundle, but can be put between operations within some bundle
53 types too.
54
55 The Itanium 1 and Itanium 2 implementations can under ideal conditions
56 execute two bundles per cycle. The Itanium 1 allows 4 of these instructions
57 to do integer operations, while the Itanium 2 allows all 6 to be integer
58 operations.
59
60 Taken cloop branches seem to insert a bubble into the pipeline most of the
61 time on Itanium 1.
62
63 Loads to the fp registers bypass the L1 cache and thus get extremely long
64 latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2.
65
66 The software pipeline stuff using br.ctop instruction causes delays, since
67 many issue slots are taken up by instructions with zero predicates, and
68 since many extra instructions are needed to set things up. These features
69 are clearly designed for code density, not speed.
70
71 Misc pipeline limitations (Itanium 1):
72 * The getf.sig instruction can only execute in M0.
73 * At most four integer instructions/cycle.
74 * Nops take up resources like any plain instructions.
75
76 Misc pipeline limitations (Itanium 2):
77 * The getf.sig instruction can only execute in M0.
78 * Nops take up resources like any plain instructions.
79
80
81 ASSEMBLY SYNTAX
82
83 .align pads with nops in a text segment, but gas 2.14 and earlier
84 incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making
85 it come out as break instructions. We use the ALIGN() macro in
86 mpn/ia64/ia64-defs.m4 when it might be executed across. That macro
87 suppresses any .align if the problem is detected by configure. Lack of
88 alignment might hurt performance but will at least be correct.
89
90 foo:: to create a global symbol is not accepted by gas. Use separate
91 ".global foo" and "foo:" instead.
92
93 .global is the standard global directive. gas accepts .globl, but hpux "as"
94 doesn't.
95
96 .proc / .endp generates the appropriate .type and .size information for ELF,
97 so the latter directives don't need to be given explicitly.
98
99 .pred.rel "mutex"... is standard for annotating predicate register
100 relationships. gas also accepts .pred.rel.mutex, but hpux "as" doesn't.
101
102 .pred directives can't be put on a line with a label, like
103 ".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that.
104 gas is happy with it, and past versions of HP had seemed ok.
105
106 // is the standard comment sequence, but we prefer "C" since it inhibits m4
107 macro expansion. See comments in ia64-defs.m4.
108
109
110 REGISTER USAGE
111
112 Special:
113 r0: constant 0
114 r1: global pointer (gp)
115 r8: return value
116 r12: stack pointer (sp)
117 r13: thread pointer (tp)
118 Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127
119 Caller-saves but rotating: r32-
120
121
122 ================================================================
123 mpn_add_n, mpn_sub_n:
124
125 The current code runs at 1.25 c/l on Itanium 2.
126
127 ================================================================
128 mpn_mul_1:
129
130 The current code runs at 2 c/l on Itanium 2.
131
132 Using a blocked approach, working off of 4 separate places in the operands,
133 one could make use of the xma accumulation, and approach 1 c/l.
134
135 ldf8 [up]
136 xma.l
137 xma.hu
138 stf8 [wrp]
139
140 ================================================================
141 mpn_addmul_1:
142
143 The current code runs at 2 c/l on Itanium 2.
144
145 It seems possible to use a blocked approach, as with mpn_mul_1. We should
146 read rp[] to integer registers, allowing for just one getf.sig per cycle.
147
148 ld8 [rp]
149 ldf8 [up]
150 xma.l
151 xma.hu
152 getf.sig
153 add+add+cmp+cmp
154 st8 [wrp]
155
156 These 10 instructions can be scheduled to approach 1.667 cycles, and with
157 the 4 cycle latency of xma, this means we need at least 3 blocks. Using
158 ldfp8 we could approach 1.583 c/l.
159
160 ================================================================
161 mpn_submul_1:
162
163 The current code runs at 2.25 c/l on Itanium 2. Getting to 2 c/l requires
164 ldfp8 with all alignment headache that implies.
165
166 ================================================================
167 mpn_addmul_N
168
169 For best speed, we need to give up using mpn_addmul_2 as the main multiply
170 building block, and instead take multiple v limbs per loop. For the Itanium
171 1, we need to take about 8 limbs at a time for full speed. For the Itanium
172 2, something like mpn_addmul_4 should be enough.
173
174 The add+cmp+cmp+add we use on the other codes is optimal for shortening
175 recurrencies (1 cycle) but the sequence takes up 4 execution slots. When
176 recurrency depth is not critical, a more standard 3-cycle add+cmp+add is
177 better.
178
179 /* First load the 8 values from v */
180 ldfp8 v0, v1 = [r35], 16;;
181 ldfp8 v2, v3 = [r35], 16;;
182 ldfp8 v4, v5 = [r35], 16;;
183 ldfp8 v6, v7 = [r35], 16;;
184
185 /* In the inner loop, get a new U limb and store a result limb. */
186 mov lc = un
187 Loop: ldf8 u0 = [r33], 8
188 ld8 r0 = [r32]
189 xma.l lp0 = v0, u0, hp0
190 xma.hu hp0 = v0, u0, hp0
191 xma.l lp1 = v1, u0, hp1
192 xma.hu hp1 = v1, u0, hp1
193 xma.l lp2 = v2, u0, hp2
194 xma.hu hp2 = v2, u0, hp2
195 xma.l lp3 = v3, u0, hp3
196 xma.hu hp3 = v3, u0, hp3
197 xma.l lp4 = v4, u0, hp4
198 xma.hu hp4 = v4, u0, hp4
199 xma.l lp5 = v5, u0, hp5
200 xma.hu hp5 = v5, u0, hp5
201 xma.l lp6 = v6, u0, hp6
202 xma.hu hp6 = v6, u0, hp6
203 xma.l lp7 = v7, u0, hp7
204 xma.hu hp7 = v7, u0, hp7
205 getf.sig l0 = lp0
206 getf.sig l1 = lp1
207 getf.sig l2 = lp2
208 getf.sig l3 = lp3
209 getf.sig l4 = lp4
210 getf.sig l5 = lp5
211 getf.sig l6 = lp6
212 add+cmp+add xx, l0, r0
213 add+cmp+add acc0, acc1, l1
214 add+cmp+add acc1, acc2, l2
215 add+cmp+add acc2, acc3, l3
216 add+cmp+add acc3, acc4, l4
217 add+cmp+add acc4, acc5, l5
218 add+cmp+add acc5, acc6, l6
219 getf.sig acc6 = lp7
220 st8 [r32] = xx, 8
221 br.cloop Loop
222
223 49 insn at max 6 insn/cycle: 8.167 cycles/limb8
224 11 memops at max 2 memops/cycle: 5.5 cycles/limb8
225 16 fpops at max 2 fpops/cycle: 8 cycles/limb8
226 21 intops at max 4 intops/cycle: 5.25 cycles/limb8
227 11+21 memops+intops at max 4/cycle 8 cycles/limb8
228
229 ================================================================
230 mpn_lshift, mpn_rshift
231
232 The current code runs at 1 cycle/limb on Itanium 2.
233
234 Using 63 separate loops, we could use the double-word shrp instruction.
235 That instruction has a plain single-cycle latency. We need 63 loops since
236 this instruction only accept immediate count. That would lead to a somewhat
237 silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp
238 each cycle plus shl/shr going down I1 for a further limb every second
239 cycle).
240
241 ================================================================
242 mpn_copyi, mpn_copyd
243
244 The current code runs at 0.5 c/l on Itanium 2. But that is just for L1
245 cache hit. The 4-way unrolled loop takes just 2 cycles, and thus load-use
246 scheduling isn't great. It might be best to actually use modulo scheduled
247 loops, since that will allow us to do better load-use scheduling without too
248 much unrolling.
249
250 Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium
251 2, according to tune/speed. Cache bank conflicts?
252
253
254
255 REFERENCES
256
257 Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3,
258 Intel document 245317-004, 245318-004, 245319-004 October 2002. Volume 1
259 includes an Itanium optimization guide.
260
261 Intel Itanium Processor-specific Application Binary Interface (ABI), Intel
262 document 245370-003, May 2001. Describes C type sizes, dynamic linking,
263 etc.
264
265 Intel Itanium Architecture Assembly Language Reference Guide, Intel document
266 248801-004, 2000-2002. Describes assembly instruction syntax and other
267 directives.
268
269 Itanium Software Conventions and Runtime Architecture Guide, Intel document
270 245358-003, May 2001. Describes calling conventions, including stack
271 unwinding requirements.
272
273 Intel Itanium Processor Reference Manual for Software Optimization, Intel
274 document 245473-003, November 2001.
275
276 Intel Itanium-2 Processor Reference Manual for Software Development and
277 Optimization, Intel document 251110-003, May 2004.
278
279 All the above documents can be found online at
280
281 http://developer.intel.com/design/itanium/manuals.htm
282