1 1.1 mrg dnl SPARC v9 64-bit mpn_mul_1 -- Multiply a limb vector with a limb and store 2 1.1 mrg dnl the result in a second limb vector. 3 1.1 mrg 4 1.1.1.2 mrg dnl Copyright 1998, 2000-2003 Free Software Foundation, Inc. 5 1.1 mrg 6 1.1 mrg dnl This file is part of the GNU MP Library. 7 1.1.1.2 mrg dnl 8 1.1 mrg dnl The GNU MP Library is free software; you can redistribute it and/or modify 9 1.1.1.2 mrg dnl it under the terms of either: 10 1.1.1.2 mrg dnl 11 1.1.1.2 mrg dnl * the GNU Lesser General Public License as published by the Free 12 1.1.1.2 mrg dnl Software Foundation; either version 3 of the License, or (at your 13 1.1.1.2 mrg dnl option) any later version. 14 1.1.1.2 mrg dnl 15 1.1.1.2 mrg dnl or 16 1.1.1.2 mrg dnl 17 1.1.1.2 mrg dnl * the GNU General Public License as published by the Free Software 18 1.1.1.2 mrg dnl Foundation; either version 2 of the License, or (at your option) any 19 1.1.1.2 mrg dnl later version. 20 1.1.1.2 mrg dnl 21 1.1.1.2 mrg dnl or both in parallel, as here. 22 1.1.1.2 mrg dnl 23 1.1 mrg dnl The GNU MP Library is distributed in the hope that it will be useful, but 24 1.1 mrg dnl WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 25 1.1.1.2 mrg dnl or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 26 1.1.1.2 mrg dnl for more details. 27 1.1.1.2 mrg dnl 28 1.1.1.2 mrg dnl You should have received copies of the GNU General Public License and the 29 1.1.1.2 mrg dnl GNU Lesser General Public License along with the GNU MP Library. If not, 30 1.1.1.2 mrg dnl see https://www.gnu.org/licenses/. 31 1.1 mrg 32 1.1 mrg include(`../config.m4') 33 1.1 mrg 34 1.1 mrg C cycles/limb 35 1.1 mrg C UltraSPARC 1&2: 14 36 1.1 mrg C UltraSPARC 3: 18.5 37 1.1 mrg 38 1.1 mrg C Algorithm: We use eight floating-point multiplies per limb product, with the 39 1.1 mrg C invariant v operand split into four 16-bit pieces, and the s1 operand split 40 1.1 mrg C into 32-bit pieces. We sum pairs of 48-bit partial products using 41 1.1 mrg C floating-point add, then convert the four 49-bit product-sums and transfer 42 1.1 mrg C them to the integer unit. 43 1.1 mrg 44 1.1 mrg C Possible optimizations: 45 1.1 mrg C 1. Align the stack area where we transfer the four 49-bit product-sums 46 1.1 mrg C to a 32-byte boundary. That would minimize the cache collision. 47 1.1 mrg C (UltraSPARC-1/2 use a direct-mapped cache.) (Perhaps even better would 48 1.1 mrg C be to align the area to map to the area immediately before s1?) 49 1.1 mrg C 2. Sum the 4 49-bit quantities using 32-bit operations, as in the 50 1.1 mrg C develop mpn_addmul_2. This would save many integer instructions. 51 1.1 mrg C 3. Unrolling. Questionable if it is worth the code expansion, given that 52 1.1 mrg C it could only save 1 cycle/limb. 53 1.1 mrg C 4. Specialize for particular v values. If its upper 32 bits are zero, we 54 1.1 mrg C could save many operations, in the FPU (fmuld), but more so in the IEU 55 1.1 mrg C since we'll be summing 48-bit quantities, which might be simpler. 56 1.1 mrg C 5. Ideally, we should schedule the f2/f3 and f4/f5 RAW further apart, and 57 1.1 mrg C the i00,i16,i32,i48 RAW less apart. The latter apart-scheduling should 58 1.1 mrg C not be greater than needed for L2 cache latency, and also not so great 59 1.1 mrg C that i16 needs to be copied. 60 1.1 mrg C 6. Avoid performing mem+fa+fm in the same cycle, at least not when we want 61 1.1 mrg C to get high IEU bandwidth. (12 of the 14 cycles will be free for 2 IEU 62 1.1 mrg C ops.) 63 1.1 mrg 64 1.1 mrg C Instruction classification (as per UltraSPARC-1/2 functional units): 65 1.1 mrg C 8 FM 66 1.1 mrg C 10 FA 67 1.1 mrg C 11 MEM 68 1.1 mrg C 9 ISHIFT + 10? IADDLOG 69 1.1 mrg C 1 BRANCH 70 1.1 mrg C 49 insns totally (plus three mov insns that should be optimized out) 71 1.1 mrg 72 1.1 mrg C The loop executes 53 instructions in 14 cycles on UltraSPARC-1/2, i.e we 73 1.1 mrg C sustain 3.79 instructions/cycle. 74 1.1 mrg 75 1.1 mrg C INPUT PARAMETERS 76 1.1 mrg C rp i0 77 1.1 mrg C up i1 78 1.1 mrg C n i2 79 1.1 mrg C v i3 80 1.1 mrg 81 1.1 mrg ASM_START() 82 1.1 mrg REGISTER(%g2,#scratch) 83 1.1 mrg REGISTER(%g3,#scratch) 84 1.1 mrg 85 1.1 mrg define(`p00', `%f8') define(`p16',`%f10') define(`p32',`%f12') define(`p48',`%f14') 86 1.1 mrg define(`r32',`%f16') define(`r48',`%f18') define(`r64',`%f20') define(`r80',`%f22') 87 1.1 mrg define(`v00',`%f24') define(`v16',`%f26') define(`v32',`%f28') define(`v48',`%f30') 88 1.1 mrg define(`u00',`%f32') define(`u32', `%f34') 89 1.1 mrg define(`a00',`%f36') define(`a16',`%f38') define(`a32',`%f40') define(`a48',`%f42') 90 1.1 mrg define(`cy',`%g1') 91 1.1 mrg define(`rlimb',`%g3') 92 1.1 mrg define(`i00',`%l0') define(`i16',`%l1') define(`i32',`%l2') define(`i48',`%l3') 93 1.1 mrg define(`xffffffff',`%l7') 94 1.1 mrg define(`xffff',`%o0') 95 1.1 mrg 96 1.1 mrg PROLOGUE(mpn_mul_1) 97 1.1 mrg 98 1.1 mrg C Initialization. (1) Split v operand into four 16-bit chunks and store them 99 1.1 mrg C as IEEE double in fp registers. (2) Clear upper 32 bits of fp register pairs 100 1.1 mrg C f2 and f4. (3) Store masks in registers aliased to `xffff' and `xffffffff'. 101 1.1 mrg 102 1.1 mrg save %sp, -256, %sp 103 1.1 mrg mov -1, %g4 104 1.1 mrg srlx %g4, 48, xffff C store mask in register `xffff' 105 1.1 mrg and %i3, xffff, %g2 106 1.1 mrg stx %g2, [%sp+2223+0] 107 1.1 mrg srlx %i3, 16, %g3 108 1.1 mrg and %g3, xffff, %g3 109 1.1 mrg stx %g3, [%sp+2223+8] 110 1.1 mrg srlx %i3, 32, %g2 111 1.1 mrg and %g2, xffff, %g2 112 1.1 mrg stx %g2, [%sp+2223+16] 113 1.1 mrg srlx %i3, 48, %g3 114 1.1 mrg stx %g3, [%sp+2223+24] 115 1.1 mrg srlx %g4, 32, xffffffff C store mask in register `xffffffff' 116 1.1 mrg 117 1.1 mrg sllx %i2, 3, %i2 118 1.1 mrg mov 0, cy C clear cy 119 1.1 mrg add %i0, %i2, %i0 120 1.1 mrg add %i1, %i2, %i1 121 1.1 mrg neg %i2 122 1.1 mrg add %i1, 4, %i5 123 1.1 mrg add %i0, -32, %i4 124 1.1 mrg add %i0, -16, %i0 125 1.1 mrg 126 1.1 mrg ldd [%sp+2223+0], v00 127 1.1 mrg ldd [%sp+2223+8], v16 128 1.1 mrg ldd [%sp+2223+16], v32 129 1.1 mrg ldd [%sp+2223+24], v48 130 1.1 mrg ld [%sp+2223+0],%f2 C zero f2 131 1.1 mrg ld [%sp+2223+0],%f4 C zero f4 132 1.1 mrg ld [%i5+%i2], %f3 C read low 32 bits of up[i] 133 1.1 mrg ld [%i1+%i2], %f5 C read high 32 bits of up[i] 134 1.1 mrg fxtod v00, v00 135 1.1 mrg fxtod v16, v16 136 1.1 mrg fxtod v32, v32 137 1.1 mrg fxtod v48, v48 138 1.1 mrg 139 1.1 mrg C Start real work. (We sneakingly read f3 and f5 above...) 140 1.1 mrg C The software pipeline is very deep, requiring 4 feed-in stages. 141 1.1 mrg 142 1.1 mrg fxtod %f2, u00 143 1.1 mrg fxtod %f4, u32 144 1.1 mrg fmuld u00, v00, a00 145 1.1 mrg fmuld u00, v16, a16 146 1.1 mrg fmuld u00, v32, p32 147 1.1 mrg fmuld u32, v00, r32 148 1.1 mrg fmuld u00, v48, p48 149 1.1 mrg addcc %i2, 8, %i2 150 1.1 mrg bnz,pt %xcc, .L_two_or_more 151 1.1 mrg fmuld u32, v16, r48 152 1.1 mrg 153 1.1 mrg .L_one: 154 1.1 mrg fmuld u32, v32, r64 C FIXME not urgent 155 1.1 mrg faddd p32, r32, a32 156 1.1 mrg fdtox a00, a00 157 1.1 mrg faddd p48, r48, a48 158 1.1 mrg fmuld u32, v48, r80 C FIXME not urgent 159 1.1 mrg fdtox a16, a16 160 1.1 mrg fdtox a32, a32 161 1.1 mrg fdtox a48, a48 162 1.1 mrg std a00, [%sp+2223+0] 163 1.1 mrg std a16, [%sp+2223+8] 164 1.1 mrg std a32, [%sp+2223+16] 165 1.1 mrg std a48, [%sp+2223+24] 166 1.1 mrg add %i2, 8, %i2 167 1.1 mrg 168 1.1 mrg fdtox r64, a00 169 1.1 mrg fdtox r80, a16 170 1.1 mrg ldx [%sp+2223+0], i00 171 1.1 mrg ldx [%sp+2223+8], i16 172 1.1 mrg ldx [%sp+2223+16], i32 173 1.1 mrg ldx [%sp+2223+24], i48 174 1.1 mrg std a00, [%sp+2223+0] 175 1.1 mrg std a16, [%sp+2223+8] 176 1.1 mrg add %i2, 8, %i2 177 1.1 mrg 178 1.1 mrg mov i00, %g5 C i00+ now in g5 179 1.1 mrg ldx [%sp+2223+0], i00 180 1.1 mrg srlx i16, 48, %l4 C (i16 >> 48) 181 1.1 mrg mov i16, %g2 182 1.1 mrg ldx [%sp+2223+8], i16 183 1.1 mrg srlx i48, 16, %l5 C (i48 >> 16) 184 1.1 mrg mov i32, %g4 C i32+ now in g4 185 1.1 mrg sllx i48, 32, %l6 C (i48 << 32) 186 1.1 mrg srlx %g4, 32, %o3 C (i32 >> 32) 187 1.1 mrg add %l5, %l4, %o1 C hi64- in %o1 188 1.1 mrg std a00, [%sp+2223+0] 189 1.1 mrg sllx %g4, 16, %o2 C (i32 << 16) 190 1.1 mrg add %o3, %o1, %o1 C hi64 in %o1 1st ASSIGNMENT 191 1.1 mrg std a16, [%sp+2223+8] 192 1.1 mrg sllx %o1, 48, %o3 C (hi64 << 48) 193 1.1 mrg add %g2, %o2, %o2 C mi64- in %o2 194 1.1 mrg add %l6, %o2, %o2 C mi64- in %o2 195 1.1 mrg sub %o2, %o3, %o2 C mi64 in %o2 1st ASSIGNMENT 196 1.1 mrg add cy, %g5, %o4 C x = prev(i00) + cy 197 1.1 mrg b .L_out_1 198 1.1 mrg add %i2, 8, %i2 199 1.1 mrg 200 1.1 mrg .L_two_or_more: 201 1.1 mrg ld [%i5+%i2], %f3 C read low 32 bits of up[i] 202 1.1 mrg fmuld u32, v32, r64 C FIXME not urgent 203 1.1 mrg faddd p32, r32, a32 204 1.1 mrg ld [%i1+%i2], %f5 C read high 32 bits of up[i] 205 1.1 mrg fdtox a00, a00 206 1.1 mrg faddd p48, r48, a48 207 1.1 mrg fmuld u32, v48, r80 C FIXME not urgent 208 1.1 mrg fdtox a16, a16 209 1.1 mrg fdtox a32, a32 210 1.1 mrg fxtod %f2, u00 211 1.1 mrg fxtod %f4, u32 212 1.1 mrg fdtox a48, a48 213 1.1 mrg std a00, [%sp+2223+0] 214 1.1 mrg fmuld u00, v00, p00 215 1.1 mrg std a16, [%sp+2223+8] 216 1.1 mrg fmuld u00, v16, p16 217 1.1 mrg std a32, [%sp+2223+16] 218 1.1 mrg fmuld u00, v32, p32 219 1.1 mrg std a48, [%sp+2223+24] 220 1.1 mrg faddd p00, r64, a00 221 1.1 mrg fmuld u32, v00, r32 222 1.1 mrg faddd p16, r80, a16 223 1.1 mrg fmuld u00, v48, p48 224 1.1 mrg addcc %i2, 8, %i2 225 1.1 mrg bnz,pt %xcc, .L_three_or_more 226 1.1 mrg fmuld u32, v16, r48 227 1.1 mrg 228 1.1 mrg .L_two: 229 1.1 mrg fmuld u32, v32, r64 C FIXME not urgent 230 1.1 mrg faddd p32, r32, a32 231 1.1 mrg fdtox a00, a00 232 1.1 mrg faddd p48, r48, a48 233 1.1 mrg fmuld u32, v48, r80 C FIXME not urgent 234 1.1 mrg fdtox a16, a16 235 1.1 mrg ldx [%sp+2223+0], i00 236 1.1 mrg fdtox a32, a32 237 1.1 mrg ldx [%sp+2223+8], i16 238 1.1 mrg ldx [%sp+2223+16], i32 239 1.1 mrg ldx [%sp+2223+24], i48 240 1.1 mrg fdtox a48, a48 241 1.1 mrg std a00, [%sp+2223+0] 242 1.1 mrg std a16, [%sp+2223+8] 243 1.1 mrg std a32, [%sp+2223+16] 244 1.1 mrg std a48, [%sp+2223+24] 245 1.1 mrg add %i2, 8, %i2 246 1.1 mrg 247 1.1 mrg fdtox r64, a00 248 1.1 mrg mov i00, %g5 C i00+ now in g5 249 1.1 mrg fdtox r80, a16 250 1.1 mrg ldx [%sp+2223+0], i00 251 1.1 mrg srlx i16, 48, %l4 C (i16 >> 48) 252 1.1 mrg mov i16, %g2 253 1.1 mrg ldx [%sp+2223+8], i16 254 1.1 mrg srlx i48, 16, %l5 C (i48 >> 16) 255 1.1 mrg mov i32, %g4 C i32+ now in g4 256 1.1 mrg ldx [%sp+2223+16], i32 257 1.1 mrg sllx i48, 32, %l6 C (i48 << 32) 258 1.1 mrg ldx [%sp+2223+24], i48 259 1.1 mrg srlx %g4, 32, %o3 C (i32 >> 32) 260 1.1 mrg add %l5, %l4, %o1 C hi64- in %o1 261 1.1 mrg std a00, [%sp+2223+0] 262 1.1 mrg sllx %g4, 16, %o2 C (i32 << 16) 263 1.1 mrg add %o3, %o1, %o1 C hi64 in %o1 1st ASSIGNMENT 264 1.1 mrg std a16, [%sp+2223+8] 265 1.1 mrg sllx %o1, 48, %o3 C (hi64 << 48) 266 1.1 mrg add %g2, %o2, %o2 C mi64- in %o2 267 1.1 mrg add %l6, %o2, %o2 C mi64- in %o2 268 1.1 mrg sub %o2, %o3, %o2 C mi64 in %o2 1st ASSIGNMENT 269 1.1 mrg add cy, %g5, %o4 C x = prev(i00) + cy 270 1.1 mrg b .L_out_2 271 1.1 mrg add %i2, 8, %i2 272 1.1 mrg 273 1.1 mrg .L_three_or_more: 274 1.1 mrg ld [%i5+%i2], %f3 C read low 32 bits of up[i] 275 1.1 mrg fmuld u32, v32, r64 C FIXME not urgent 276 1.1 mrg faddd p32, r32, a32 277 1.1 mrg ld [%i1+%i2], %f5 C read high 32 bits of up[i] 278 1.1 mrg fdtox a00, a00 279 1.1 mrg faddd p48, r48, a48 280 1.1 mrg fmuld u32, v48, r80 C FIXME not urgent 281 1.1 mrg fdtox a16, a16 282 1.1 mrg ldx [%sp+2223+0], i00 283 1.1 mrg fdtox a32, a32 284 1.1 mrg ldx [%sp+2223+8], i16 285 1.1 mrg fxtod %f2, u00 286 1.1 mrg ldx [%sp+2223+16], i32 287 1.1 mrg fxtod %f4, u32 288 1.1 mrg ldx [%sp+2223+24], i48 289 1.1 mrg fdtox a48, a48 290 1.1 mrg std a00, [%sp+2223+0] 291 1.1 mrg fmuld u00, v00, p00 292 1.1 mrg std a16, [%sp+2223+8] 293 1.1 mrg fmuld u00, v16, p16 294 1.1 mrg std a32, [%sp+2223+16] 295 1.1 mrg fmuld u00, v32, p32 296 1.1 mrg std a48, [%sp+2223+24] 297 1.1 mrg faddd p00, r64, a00 298 1.1 mrg fmuld u32, v00, r32 299 1.1 mrg faddd p16, r80, a16 300 1.1 mrg fmuld u00, v48, p48 301 1.1 mrg addcc %i2, 8, %i2 302 1.1 mrg bnz,pt %xcc, .L_four_or_more 303 1.1 mrg fmuld u32, v16, r48 304 1.1 mrg 305 1.1 mrg .L_three: 306 1.1 mrg fmuld u32, v32, r64 C FIXME not urgent 307 1.1 mrg faddd p32, r32, a32 308 1.1 mrg fdtox a00, a00 309 1.1 mrg faddd p48, r48, a48 310 1.1 mrg mov i00, %g5 C i00+ now in g5 311 1.1 mrg fmuld u32, v48, r80 C FIXME not urgent 312 1.1 mrg fdtox a16, a16 313 1.1 mrg ldx [%sp+2223+0], i00 314 1.1 mrg fdtox a32, a32 315 1.1 mrg srlx i16, 48, %l4 C (i16 >> 48) 316 1.1 mrg mov i16, %g2 317 1.1 mrg ldx [%sp+2223+8], i16 318 1.1 mrg srlx i48, 16, %l5 C (i48 >> 16) 319 1.1 mrg mov i32, %g4 C i32+ now in g4 320 1.1 mrg ldx [%sp+2223+16], i32 321 1.1 mrg sllx i48, 32, %l6 C (i48 << 32) 322 1.1 mrg ldx [%sp+2223+24], i48 323 1.1 mrg fdtox a48, a48 324 1.1 mrg srlx %g4, 32, %o3 C (i32 >> 32) 325 1.1 mrg add %l5, %l4, %o1 C hi64- in %o1 326 1.1 mrg std a00, [%sp+2223+0] 327 1.1 mrg sllx %g4, 16, %o2 C (i32 << 16) 328 1.1 mrg add %o3, %o1, %o1 C hi64 in %o1 1st ASSIGNMENT 329 1.1 mrg std a16, [%sp+2223+8] 330 1.1 mrg sllx %o1, 48, %o3 C (hi64 << 48) 331 1.1 mrg add %g2, %o2, %o2 C mi64- in %o2 332 1.1 mrg std a32, [%sp+2223+16] 333 1.1 mrg add %l6, %o2, %o2 C mi64- in %o2 334 1.1 mrg std a48, [%sp+2223+24] 335 1.1 mrg sub %o2, %o3, %o2 C mi64 in %o2 1st ASSIGNMENT 336 1.1 mrg add cy, %g5, %o4 C x = prev(i00) + cy 337 1.1 mrg b .L_out_3 338 1.1 mrg add %i2, 8, %i2 339 1.1 mrg 340 1.1 mrg .L_four_or_more: 341 1.1 mrg ld [%i5+%i2], %f3 C read low 32 bits of up[i] 342 1.1 mrg fmuld u32, v32, r64 C FIXME not urgent 343 1.1 mrg faddd p32, r32, a32 344 1.1 mrg ld [%i1+%i2], %f5 C read high 32 bits of up[i] 345 1.1 mrg fdtox a00, a00 346 1.1 mrg faddd p48, r48, a48 347 1.1 mrg mov i00, %g5 C i00+ now in g5 348 1.1 mrg fmuld u32, v48, r80 C FIXME not urgent 349 1.1 mrg fdtox a16, a16 350 1.1 mrg ldx [%sp+2223+0], i00 351 1.1 mrg fdtox a32, a32 352 1.1 mrg srlx i16, 48, %l4 C (i16 >> 48) 353 1.1 mrg mov i16, %g2 354 1.1 mrg ldx [%sp+2223+8], i16 355 1.1 mrg fxtod %f2, u00 356 1.1 mrg srlx i48, 16, %l5 C (i48 >> 16) 357 1.1 mrg mov i32, %g4 C i32+ now in g4 358 1.1 mrg ldx [%sp+2223+16], i32 359 1.1 mrg fxtod %f4, u32 360 1.1 mrg sllx i48, 32, %l6 C (i48 << 32) 361 1.1 mrg ldx [%sp+2223+24], i48 362 1.1 mrg fdtox a48, a48 363 1.1 mrg srlx %g4, 32, %o3 C (i32 >> 32) 364 1.1 mrg add %l5, %l4, %o1 C hi64- in %o1 365 1.1 mrg std a00, [%sp+2223+0] 366 1.1 mrg fmuld u00, v00, p00 367 1.1 mrg sllx %g4, 16, %o2 C (i32 << 16) 368 1.1 mrg add %o3, %o1, %o1 C hi64 in %o1 1st ASSIGNMENT 369 1.1 mrg std a16, [%sp+2223+8] 370 1.1 mrg fmuld u00, v16, p16 371 1.1 mrg sllx %o1, 48, %o3 C (hi64 << 48) 372 1.1 mrg add %g2, %o2, %o2 C mi64- in %o2 373 1.1 mrg std a32, [%sp+2223+16] 374 1.1 mrg fmuld u00, v32, p32 375 1.1 mrg add %l6, %o2, %o2 C mi64- in %o2 376 1.1 mrg std a48, [%sp+2223+24] 377 1.1 mrg faddd p00, r64, a00 378 1.1 mrg fmuld u32, v00, r32 379 1.1 mrg sub %o2, %o3, %o2 C mi64 in %o2 1st ASSIGNMENT 380 1.1 mrg faddd p16, r80, a16 381 1.1 mrg fmuld u00, v48, p48 382 1.1 mrg add cy, %g5, %o4 C x = prev(i00) + cy 383 1.1 mrg addcc %i2, 8, %i2 384 1.1 mrg bnz,pt %xcc, .Loop 385 1.1 mrg fmuld u32, v16, r48 386 1.1 mrg 387 1.1 mrg .L_four: 388 1.1 mrg b,a .L_out_4 389 1.1 mrg 390 1.1 mrg C BEGIN MAIN LOOP 391 1.1 mrg .align 16 392 1.1 mrg .Loop: 393 1.1 mrg C 00 394 1.1 mrg srlx %o4, 16, %o5 C (x >> 16) 395 1.1 mrg ld [%i5+%i2], %f3 C read low 32 bits of up[i] 396 1.1 mrg fmuld u32, v32, r64 C FIXME not urgent 397 1.1 mrg faddd p32, r32, a32 398 1.1 mrg C 01 399 1.1 mrg add %o5, %o2, %o2 C mi64 in %o2 2nd ASSIGNMENT 400 1.1 mrg and %o4, xffff, %o5 C (x & 0xffff) 401 1.1 mrg ld [%i1+%i2], %f5 C read high 32 bits of up[i] 402 1.1 mrg fdtox a00, a00 403 1.1 mrg C 02 404 1.1 mrg faddd p48, r48, a48 405 1.1 mrg C 03 406 1.1 mrg srlx %o2, 48, %o7 C (mi64 >> 48) 407 1.1 mrg mov i00, %g5 C i00+ now in g5 408 1.1 mrg fmuld u32, v48, r80 C FIXME not urgent 409 1.1 mrg fdtox a16, a16 410 1.1 mrg C 04 411 1.1 mrg sllx %o2, 16, %i3 C (mi64 << 16) 412 1.1 mrg add %o7, %o1, cy C new cy 413 1.1 mrg ldx [%sp+2223+0], i00 414 1.1 mrg fdtox a32, a32 415 1.1 mrg C 05 416 1.1 mrg srlx i16, 48, %l4 C (i16 >> 48) 417 1.1 mrg mov i16, %g2 418 1.1 mrg ldx [%sp+2223+8], i16 419 1.1 mrg fxtod %f2, u00 420 1.1 mrg C 06 421 1.1 mrg srlx i48, 16, %l5 C (i48 >> 16) 422 1.1 mrg mov i32, %g4 C i32+ now in g4 423 1.1 mrg ldx [%sp+2223+16], i32 424 1.1 mrg fxtod %f4, u32 425 1.1 mrg C 07 426 1.1 mrg sllx i48, 32, %l6 C (i48 << 32) 427 1.1 mrg or %i3, %o5, %o5 428 1.1 mrg ldx [%sp+2223+24], i48 429 1.1 mrg fdtox a48, a48 430 1.1 mrg C 08 431 1.1 mrg srlx %g4, 32, %o3 C (i32 >> 32) 432 1.1 mrg add %l5, %l4, %o1 C hi64- in %o1 433 1.1 mrg std a00, [%sp+2223+0] 434 1.1 mrg fmuld u00, v00, p00 435 1.1 mrg C 09 436 1.1 mrg sllx %g4, 16, %o2 C (i32 << 16) 437 1.1 mrg add %o3, %o1, %o1 C hi64 in %o1 1st ASSIGNMENT 438 1.1 mrg std a16, [%sp+2223+8] 439 1.1 mrg fmuld u00, v16, p16 440 1.1 mrg C 10 441 1.1 mrg sllx %o1, 48, %o3 C (hi64 << 48) 442 1.1 mrg add %g2, %o2, %o2 C mi64- in %o2 443 1.1 mrg std a32, [%sp+2223+16] 444 1.1 mrg fmuld u00, v32, p32 445 1.1 mrg C 11 446 1.1 mrg add %l6, %o2, %o2 C mi64- in %o2 447 1.1 mrg std a48, [%sp+2223+24] 448 1.1 mrg faddd p00, r64, a00 449 1.1 mrg fmuld u32, v00, r32 450 1.1 mrg C 12 451 1.1 mrg sub %o2, %o3, %o2 C mi64 in %o2 1st ASSIGNMENT 452 1.1 mrg stx %o5, [%i4+%i2] 453 1.1 mrg faddd p16, r80, a16 454 1.1 mrg fmuld u00, v48, p48 455 1.1 mrg C 13 456 1.1 mrg add cy, %g5, %o4 C x = prev(i00) + cy 457 1.1 mrg addcc %i2, 8, %i2 458 1.1 mrg bnz,pt %xcc, .Loop 459 1.1 mrg fmuld u32, v16, r48 460 1.1 mrg C END MAIN LOOP 461 1.1 mrg 462 1.1 mrg .L_out_4: 463 1.1 mrg srlx %o4, 16, %o5 C (x >> 16) 464 1.1 mrg fmuld u32, v32, r64 C FIXME not urgent 465 1.1 mrg faddd p32, r32, a32 466 1.1 mrg add %o5, %o2, %o2 C mi64 in %o2 2nd ASSIGNMENT 467 1.1 mrg and %o4, xffff, %o5 C (x & 0xffff) 468 1.1 mrg fdtox a00, a00 469 1.1 mrg faddd p48, r48, a48 470 1.1 mrg srlx %o2, 48, %o7 C (mi64 >> 48) 471 1.1 mrg mov i00, %g5 C i00+ now in g5 472 1.1 mrg fmuld u32, v48, r80 C FIXME not urgent 473 1.1 mrg fdtox a16, a16 474 1.1 mrg sllx %o2, 16, %i3 C (mi64 << 16) 475 1.1 mrg add %o7, %o1, cy C new cy 476 1.1 mrg ldx [%sp+2223+0], i00 477 1.1 mrg fdtox a32, a32 478 1.1 mrg srlx i16, 48, %l4 C (i16 >> 48) 479 1.1 mrg mov i16, %g2 480 1.1 mrg ldx [%sp+2223+8], i16 481 1.1 mrg srlx i48, 16, %l5 C (i48 >> 16) 482 1.1 mrg mov i32, %g4 C i32+ now in g4 483 1.1 mrg ldx [%sp+2223+16], i32 484 1.1 mrg sllx i48, 32, %l6 C (i48 << 32) 485 1.1 mrg or %i3, %o5, %o5 486 1.1 mrg ldx [%sp+2223+24], i48 487 1.1 mrg fdtox a48, a48 488 1.1 mrg srlx %g4, 32, %o3 C (i32 >> 32) 489 1.1 mrg add %l5, %l4, %o1 C hi64- in %o1 490 1.1 mrg std a00, [%sp+2223+0] 491 1.1 mrg sllx %g4, 16, %o2 C (i32 << 16) 492 1.1 mrg add %o3, %o1, %o1 C hi64 in %o1 1st ASSIGNMENT 493 1.1 mrg std a16, [%sp+2223+8] 494 1.1 mrg sllx %o1, 48, %o3 C (hi64 << 48) 495 1.1 mrg add %g2, %o2, %o2 C mi64- in %o2 496 1.1 mrg std a32, [%sp+2223+16] 497 1.1 mrg add %l6, %o2, %o2 C mi64- in %o2 498 1.1 mrg std a48, [%sp+2223+24] 499 1.1 mrg sub %o2, %o3, %o2 C mi64 in %o2 1st ASSIGNMENT 500 1.1 mrg stx %o5, [%i4+%i2] 501 1.1 mrg add cy, %g5, %o4 C x = prev(i00) + cy 502 1.1 mrg add %i2, 8, %i2 503 1.1 mrg .L_out_3: 504 1.1 mrg srlx %o4, 16, %o5 C (x >> 16) 505 1.1 mrg add %o5, %o2, %o2 C mi64 in %o2 2nd ASSIGNMENT 506 1.1 mrg and %o4, xffff, %o5 C (x & 0xffff) 507 1.1 mrg fdtox r64, a00 508 1.1 mrg srlx %o2, 48, %o7 C (mi64 >> 48) 509 1.1 mrg mov i00, %g5 C i00+ now in g5 510 1.1 mrg fdtox r80, a16 511 1.1 mrg sllx %o2, 16, %i3 C (mi64 << 16) 512 1.1 mrg add %o7, %o1, cy C new cy 513 1.1 mrg ldx [%sp+2223+0], i00 514 1.1 mrg srlx i16, 48, %l4 C (i16 >> 48) 515 1.1 mrg mov i16, %g2 516 1.1 mrg ldx [%sp+2223+8], i16 517 1.1 mrg srlx i48, 16, %l5 C (i48 >> 16) 518 1.1 mrg mov i32, %g4 C i32+ now in g4 519 1.1 mrg ldx [%sp+2223+16], i32 520 1.1 mrg sllx i48, 32, %l6 C (i48 << 32) 521 1.1 mrg or %i3, %o5, %o5 522 1.1 mrg ldx [%sp+2223+24], i48 523 1.1 mrg srlx %g4, 32, %o3 C (i32 >> 32) 524 1.1 mrg add %l5, %l4, %o1 C hi64- in %o1 525 1.1 mrg std a00, [%sp+2223+0] 526 1.1 mrg sllx %g4, 16, %o2 C (i32 << 16) 527 1.1 mrg add %o3, %o1, %o1 C hi64 in %o1 1st ASSIGNMENT 528 1.1 mrg std a16, [%sp+2223+8] 529 1.1 mrg sllx %o1, 48, %o3 C (hi64 << 48) 530 1.1 mrg add %g2, %o2, %o2 C mi64- in %o2 531 1.1 mrg add %l6, %o2, %o2 C mi64- in %o2 532 1.1 mrg sub %o2, %o3, %o2 C mi64 in %o2 1st ASSIGNMENT 533 1.1 mrg stx %o5, [%i4+%i2] 534 1.1 mrg add cy, %g5, %o4 C x = prev(i00) + cy 535 1.1 mrg add %i2, 8, %i2 536 1.1 mrg .L_out_2: 537 1.1 mrg srlx %o4, 16, %o5 C (x >> 16) 538 1.1 mrg add %o5, %o2, %o2 C mi64 in %o2 2nd ASSIGNMENT 539 1.1 mrg and %o4, xffff, %o5 C (x & 0xffff) 540 1.1 mrg srlx %o2, 48, %o7 C (mi64 >> 48) 541 1.1 mrg mov i00, %g5 C i00+ now in g5 542 1.1 mrg sllx %o2, 16, %i3 C (mi64 << 16) 543 1.1 mrg add %o7, %o1, cy C new cy 544 1.1 mrg ldx [%sp+2223+0], i00 545 1.1 mrg srlx i16, 48, %l4 C (i16 >> 48) 546 1.1 mrg mov i16, %g2 547 1.1 mrg ldx [%sp+2223+8], i16 548 1.1 mrg srlx i48, 16, %l5 C (i48 >> 16) 549 1.1 mrg mov i32, %g4 C i32+ now in g4 550 1.1 mrg sllx i48, 32, %l6 C (i48 << 32) 551 1.1 mrg or %i3, %o5, %o5 552 1.1 mrg srlx %g4, 32, %o3 C (i32 >> 32) 553 1.1 mrg add %l5, %l4, %o1 C hi64- in %o1 554 1.1 mrg sllx %g4, 16, %o2 C (i32 << 16) 555 1.1 mrg add %o3, %o1, %o1 C hi64 in %o1 1st ASSIGNMENT 556 1.1 mrg sllx %o1, 48, %o3 C (hi64 << 48) 557 1.1 mrg add %g2, %o2, %o2 C mi64- in %o2 558 1.1 mrg add %l6, %o2, %o2 C mi64- in %o2 559 1.1 mrg sub %o2, %o3, %o2 C mi64 in %o2 1st ASSIGNMENT 560 1.1 mrg stx %o5, [%i4+%i2] 561 1.1 mrg add cy, %g5, %o4 C x = prev(i00) + cy 562 1.1 mrg add %i2, 8, %i2 563 1.1 mrg .L_out_1: 564 1.1 mrg srlx %o4, 16, %o5 C (x >> 16) 565 1.1 mrg add %o5, %o2, %o2 C mi64 in %o2 2nd ASSIGNMENT 566 1.1 mrg and %o4, xffff, %o5 C (x & 0xffff) 567 1.1 mrg srlx %o2, 48, %o7 C (mi64 >> 48) 568 1.1 mrg sllx %o2, 16, %i3 C (mi64 << 16) 569 1.1 mrg add %o7, %o1, cy C new cy 570 1.1 mrg or %i3, %o5, %o5 571 1.1 mrg stx %o5, [%i4+%i2] 572 1.1 mrg 573 1.1 mrg sllx i00, 0, %g2 574 1.1 mrg add %g2, cy, cy 575 1.1 mrg sllx i16, 16, %g3 576 1.1 mrg add %g3, cy, cy 577 1.1 mrg 578 1.1 mrg return %i7+8 579 1.1 mrg mov cy, %o0 580 1.1 mrg EPILOGUE(mpn_mul_1) 581