Home | History | Annotate | only in /src/external/lgpl3/gmp/dist/mpn/ia64
Up to higher level directory
NameDateSize
add_n_sub_n.asm22-Aug-20178.2K
addmul_1.asm22-Aug-201713K
addmul_2.asm22-Aug-201717.4K
aors_n.asm22-Aug-201720.5K
aorsorrlsh1_n.asm22-Aug-20171.5K
aorsorrlsh2_n.asm22-Aug-20171.5K
aorsorrlshC_n.asm22-Aug-20179.7K
bdiv_dbm1c.asm22-Aug-20179.1K
cnd_aors_n.asm22-Aug-20176.4K
copyd.asm22-Aug-20173.5K
copyi.asm22-Aug-20173.3K
dive_1.asm22-Aug-20176.6K
divrem_1.asm22-Aug-201710.6K
divrem_2.asm22-Aug-20176.4K
gcd_11.asm27-Sep-20202.6K
gmp-mparam.h27-Sep-20209.8K
hamdist.asm22-Aug-20178.2K
ia64-defs.m422-Aug-20174.1K
invert_limb.asm22-Aug-20173K
logops_n.asm22-Aug-20177K
lorrshift.asm22-Aug-20176.9K
lshiftc.asm22-Aug-20178.2K
mod_34lsub1.asm22-Aug-20175K
mode1o.asm22-Aug-201711.1K
mul_1.asm22-Aug-201711.3K
mul_2.asm22-Aug-201714.9K
popcount.asm22-Aug-20174.4K
README22-Aug-20179.2K
rsh1aors_n.asm22-Aug-201711.2K
sec_tabselect.asm22-Aug-20173.3K
sqr_diag_addlsh1.asm22-Aug-20174.6K
submul_1.asm22-Aug-201712.3K

README

      1 Copyright 2000-2005 Free Software Foundation, Inc.
      2 
      3 This file is part of the GNU MP Library.
      4 
      5 The GNU MP Library is free software; you can redistribute it and/or modify
      6 it under the terms of either:
      7 
      8   * the GNU Lesser General Public License as published by the Free
      9     Software Foundation; either version 3 of the License, or (at your
     10     option) any later version.
     11 
     12 or
     13 
     14   * the GNU General Public License as published by the Free Software
     15     Foundation; either version 2 of the License, or (at your option) any
     16     later version.
     17 
     18 or both in parallel, as here.
     19 
     20 The GNU MP Library is distributed in the hope that it will be useful, but
     21 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
     22 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
     23 for more details.
     24 
     25 You should have received copies of the GNU General Public License and the
     26 GNU Lesser General Public License along with the GNU MP Library.  If not,
     27 see https://www.gnu.org/licenses/.
     28 
     29 
     30 
     31                       IA-64 MPN SUBROUTINES
     32 
     33 
     34 This directory contains mpn functions for the IA-64 architecture.
     35 
     36 
     37 CODE ORGANIZATION
     38 
     39 	mpn/ia64          itanium-2, and generic ia64
     40 
     41 The code here has been optimized primarily for Itanium 2.  Very few Itanium 1
     42 chips were ever sold, and Itanium 2 is more powerful, so the latter is what
     43 we concentrate on.
     44 
     45 
     46 
     47 CHIP NOTES
     48 
     49 The IA-64 ISA keeps instructions three and three in 128 bit bundles.
     50 Programmers/compilers need to put explicit breaks `;;' when there are WAW or
     51 RAW dependencies, with some notable exceptions.  Such "breaks" are typically
     52 at the end of a bundle, but can be put between operations within some bundle
     53 types too.
     54 
     55 The Itanium 1 and Itanium 2 implementations can under ideal conditions
     56 execute two bundles per cycle.  The Itanium 1 allows 4 of these instructions
     57 to do integer operations, while the Itanium 2 allows all 6 to be integer
     58 operations.
     59 
     60 Taken cloop branches seem to insert a bubble into the pipeline most of the
     61 time on Itanium 1.
     62 
     63 Loads to the fp registers bypass the L1 cache and thus get extremely long
     64 latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2.
     65 
     66 The software pipeline stuff using br.ctop instruction causes delays, since
     67 many issue slots are taken up by instructions with zero predicates, and
     68 since many extra instructions are needed to set things up.  These features
     69 are clearly designed for code density, not speed.
     70 
     71 Misc pipeline limitations (Itanium 1):
     72 * The getf.sig instruction can only execute in M0.
     73 * At most four integer instructions/cycle.
     74 * Nops take up resources like any plain instructions.
     75 
     76 Misc pipeline limitations (Itanium 2):
     77 * The getf.sig instruction can only execute in M0.
     78 * Nops take up resources like any plain instructions.
     79 
     80 
     81 ASSEMBLY SYNTAX
     82 
     83 .align pads with nops in a text segment, but gas 2.14 and earlier
     84 incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making
     85 it come out as break instructions.  We use the ALIGN() macro in
     86 mpn/ia64/ia64-defs.m4 when it might be executed across.  That macro
     87 suppresses any .align if the problem is detected by configure.  Lack of
     88 alignment might hurt performance but will at least be correct.
     89 
     90 foo:: to create a global symbol is not accepted by gas.  Use separate
     91 ".global foo" and "foo:" instead.
     92 
     93 .global is the standard global directive.  gas accepts .globl, but hpux "as"
     94 doesn't.
     95 
     96 .proc / .endp generates the appropriate .type and .size information for ELF,
     97 so the latter directives don't need to be given explicitly.
     98 
     99 .pred.rel "mutex"... is standard for annotating predicate register
    100 relationships.  gas also accepts .pred.rel.mutex, but hpux "as" doesn't.
    101 
    102 .pred directives can't be put on a line with a label, like
    103 ".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that.
    104 gas is happy with it, and past versions of HP had seemed ok.
    105 
    106 // is the standard comment sequence, but we prefer "C" since it inhibits m4
    107 macro expansion.  See comments in ia64-defs.m4.
    108 
    109 
    110 REGISTER USAGE
    111 
    112 Special:
    113    r0: constant 0
    114    r1: global pointer (gp)
    115    r8: return value
    116    r12: stack pointer (sp)
    117    r13: thread pointer (tp)
    118 Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127
    119 Caller-saves but rotating: r32-
    120 
    121 
    122 ================================================================
    123 mpn_add_n, mpn_sub_n:
    124 
    125 The current code runs at 1.25 c/l on Itanium 2.
    126 
    127 ================================================================
    128 mpn_mul_1:
    129 
    130 The current code runs at 2 c/l on Itanium 2.
    131 
    132 Using a blocked approach, working off of 4 separate places in the operands,
    133 one could make use of the xma accumulation, and approach 1 c/l.
    134 
    135 	ldf8 [up]
    136 	xma.l
    137 	xma.hu
    138 	stf8  [wrp]
    139 
    140 ================================================================
    141 mpn_addmul_1:
    142 
    143 The current code runs at 2 c/l on Itanium 2.
    144 
    145 It seems possible to use a blocked approach, as with mpn_mul_1.  We should
    146 read rp[] to integer registers, allowing for just one getf.sig per cycle.
    147 
    148 	ld8  [rp]
    149 	ldf8 [up]
    150 	xma.l
    151 	xma.hu
    152 	getf.sig
    153 	add+add+cmp+cmp
    154 	st8  [wrp]
    155 
    156 These 10 instructions can be scheduled to approach 1.667 cycles, and with
    157 the 4 cycle latency of xma, this means we need at least 3 blocks.  Using
    158 ldfp8 we could approach 1.583 c/l.
    159 
    160 ================================================================
    161 mpn_submul_1:
    162 
    163 The current code runs at 2.25 c/l on Itanium 2.  Getting to 2 c/l requires
    164 ldfp8 with all alignment headache that implies.
    165 
    166 ================================================================
    167 mpn_addmul_N
    168 
    169 For best speed, we need to give up using mpn_addmul_2 as the main multiply
    170 building block, and instead take multiple v limbs per loop.  For the Itanium
    171 1, we need to take about 8 limbs at a time for full speed.  For the Itanium
    172 2, something like mpn_addmul_4 should be enough.
    173 
    174 The add+cmp+cmp+add we use on the other codes is optimal for shortening
    175 recurrencies (1 cycle) but the sequence takes up 4 execution slots.  When
    176 recurrency depth is not critical, a more standard 3-cycle add+cmp+add is
    177 better.
    178 
    179 /* First load the 8 values from v */
    180 	ldfp8		v0, v1 = [r35], 16;;
    181 	ldfp8		v2, v3 = [r35], 16;;
    182 	ldfp8		v4, v5 = [r35], 16;;
    183 	ldfp8		v6, v7 = [r35], 16;;
    184 
    185 /* In the inner loop, get a new U limb and store a result limb. */
    186 	mov		lc = un
    187 Loop:	ldf8		u0 = [r33], 8
    188 	ld8		r0 = [r32]
    189 	xma.l		lp0 = v0, u0, hp0
    190 	xma.hu		hp0 = v0, u0, hp0
    191 	xma.l		lp1 = v1, u0, hp1
    192 	xma.hu		hp1 = v1, u0, hp1
    193 	xma.l		lp2 = v2, u0, hp2
    194 	xma.hu		hp2 = v2, u0, hp2
    195 	xma.l		lp3 = v3, u0, hp3
    196 	xma.hu		hp3 = v3, u0, hp3
    197 	xma.l		lp4 = v4, u0, hp4
    198 	xma.hu		hp4 = v4, u0, hp4
    199 	xma.l		lp5 = v5, u0, hp5
    200 	xma.hu		hp5 = v5, u0, hp5
    201 	xma.l		lp6 = v6, u0, hp6
    202 	xma.hu		hp6 = v6, u0, hp6
    203 	xma.l		lp7 = v7, u0, hp7
    204 	xma.hu		hp7 = v7, u0, hp7
    205 	getf.sig	l0 = lp0
    206 	getf.sig	l1 = lp1
    207 	getf.sig	l2 = lp2
    208 	getf.sig	l3 = lp3
    209 	getf.sig	l4 = lp4
    210 	getf.sig	l5 = lp5
    211 	getf.sig	l6 = lp6
    212 	add+cmp+add	xx, l0, r0
    213 	add+cmp+add	acc0, acc1, l1
    214 	add+cmp+add	acc1, acc2, l2
    215 	add+cmp+add	acc2, acc3, l3
    216 	add+cmp+add	acc3, acc4, l4
    217 	add+cmp+add	acc4, acc5, l5
    218 	add+cmp+add	acc5, acc6, l6
    219 	getf.sig	acc6 = lp7
    220 	st8		[r32] = xx, 8
    221 	br.cloop Loop
    222 
    223 	49 insn at max 6 insn/cycle:		8.167 cycles/limb8
    224 	11 memops at max 2 memops/cycle:	5.5 cycles/limb8
    225 	16 fpops at max 2 fpops/cycle:		8 cycles/limb8
    226 	21 intops at max 4 intops/cycle:	5.25 cycles/limb8
    227 	11+21 memops+intops at max 4/cycle	8 cycles/limb8
    228 
    229 ================================================================
    230 mpn_lshift, mpn_rshift
    231 
    232 The current code runs at 1 cycle/limb on Itanium 2.
    233 
    234 Using 63 separate loops, we could use the double-word shrp instruction.
    235 That instruction has a plain single-cycle latency.  We need 63 loops since
    236 this instruction only accept immediate count.  That would lead to a somewhat
    237 silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp
    238 each cycle plus shl/shr going down I1 for a further limb every second
    239 cycle).
    240 
    241 ================================================================
    242 mpn_copyi, mpn_copyd
    243 
    244 The current code runs at 0.5 c/l on Itanium 2.  But that is just for L1
    245 cache hit.  The 4-way unrolled loop takes just 2 cycles, and thus load-use
    246 scheduling isn't great.  It might be best to actually use modulo scheduled
    247 loops, since that will allow us to do better load-use scheduling without too
    248 much unrolling.
    249 
    250 Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium
    251 2, according to tune/speed.  Cache bank conflicts?
    252 
    253 
    254 
    255 REFERENCES
    256 
    257 Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3,
    258 Intel document 245317-004, 245318-004, 245319-004 October 2002.  Volume 1
    259 includes an Itanium optimization guide.
    260 
    261 Intel Itanium Processor-specific Application Binary Interface (ABI), Intel
    262 document 245370-003, May 2001.  Describes C type sizes, dynamic linking,
    263 etc.
    264 
    265 Intel Itanium Architecture Assembly Language Reference Guide, Intel document
    266 248801-004, 2000-2002.  Describes assembly instruction syntax and other
    267 directives.
    268 
    269 Itanium Software Conventions and Runtime Architecture Guide, Intel document
    270 245358-003, May 2001.  Describes calling conventions, including stack
    271 unwinding requirements.
    272 
    273 Intel Itanium Processor Reference Manual for Software Optimization, Intel
    274 document 245473-003, November 2001.
    275 
    276 Intel Itanium-2 Processor Reference Manual for Software Development and
    277 Optimization, Intel document 251110-003, May 2004.
    278 
    279 All the above documents can be found online at
    280 
    281     http://developer.intel.com/design/itanium/manuals.htm
    282