1 1.1.1.3 mrg Copyright 1997, 1999-2002 Free Software Foundation, Inc. 2 1.1 mrg 3 1.1 mrg This file is part of the GNU MP Library. 4 1.1 mrg 5 1.1 mrg The GNU MP Library is free software; you can redistribute it and/or modify 6 1.1.1.3 mrg it under the terms of either: 7 1.1.1.3 mrg 8 1.1.1.3 mrg * the GNU Lesser General Public License as published by the Free 9 1.1.1.3 mrg Software Foundation; either version 3 of the License, or (at your 10 1.1.1.3 mrg option) any later version. 11 1.1.1.3 mrg 12 1.1.1.3 mrg or 13 1.1.1.3 mrg 14 1.1.1.3 mrg * the GNU General Public License as published by the Free Software 15 1.1.1.3 mrg Foundation; either version 2 of the License, or (at your option) any 16 1.1.1.3 mrg later version. 17 1.1.1.3 mrg 18 1.1.1.3 mrg or both in parallel, as here. 19 1.1 mrg 20 1.1 mrg The GNU MP Library is distributed in the hope that it will be useful, but 21 1.1 mrg WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 22 1.1.1.3 mrg or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 23 1.1.1.3 mrg for more details. 24 1.1 mrg 25 1.1.1.3 mrg You should have received copies of the GNU General Public License and the 26 1.1.1.3 mrg GNU Lesser General Public License along with the GNU MP Library. If not, 27 1.1.1.3 mrg see https://www.gnu.org/licenses/. 28 1.1 mrg 29 1.1 mrg 30 1.1 mrg 31 1.1 mrg 32 1.1 mrg 33 1.1 mrg This directory contains mpn functions for 64-bit V9 SPARC 34 1.1 mrg 35 1.1 mrg RELEVANT OPTIMIZATION ISSUES 36 1.1 mrg 37 1.1 mrg Notation: 38 1.1 mrg IANY = shift/add/sub/logical/sethi 39 1.1 mrg IADDLOG = add/sub/logical/sethi 40 1.1 mrg MEM = ld*/st* 41 1.1 mrg FA = fadd*/fsub*/f*to*/fmov* 42 1.1 mrg FM = fmul* 43 1.1 mrg 44 1.1 mrg UltraSPARC can issue four instructions per cycle, with these restrictions: 45 1.1 mrg * Two IANY instructions, but only one of these may be a shift. If there is a 46 1.1 mrg shift and an IANY instruction, the shift must precede the IANY instruction. 47 1.1 mrg * One FA. 48 1.1 mrg * One FM. 49 1.1 mrg * One branch. 50 1.1 mrg * One MEM. 51 1.1 mrg * IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle. Taken branches 52 1.1 mrg should not be in slot 4, since that makes the delay insn come from separate 53 1.1 mrg bundle. 54 1.1 mrg * If two IANY/IADDLOG instructions are to be executed in the same cycle and one 55 1.1 mrg of these is setting the condition codes, that instruction must be the second 56 1.1 mrg one. 57 1.1 mrg 58 1.1 mrg To summarize, ignoring branches, these are the bundles that can reach the peak 59 1.1 mrg execution speed: 60 1.1 mrg 61 1.1 mrg insn1 iany iany mem iany iany mem iany iany mem 62 1.1 mrg insn2 iaddlog mem iany mem iaddlog iany mem iaddlog iany 63 1.1 mrg insn3 mem iaddlog iaddlog fa fa fa fm fm fm 64 1.1 mrg insn4 fa/fm fa/fm fa/fm fm fm fm fa fa fa 65 1.1 mrg 66 1.1 mrg The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles, 67 1.1 mrg depending on the position of the most significant bit of the first source 68 1.1 mrg operand. When used for 32x32->64 multiplication, it needs 20 cycles. 69 1.1 mrg Furthermore, it stalls the processor while executing. We stay away from that 70 1.1 mrg instruction, and instead use floating-point operations. 71 1.1 mrg 72 1.1 mrg Floating-point add and multiply units are fully pipelined. The latency for 73 1.1 mrg UltraSPARC-1/2 is 3 cycles and for UltraSPARC-3 it is 4 cycles. 74 1.1 mrg 75 1.1 mrg Integer conditional move instructions cannot dual-issue with other integer 76 1.1 mrg instructions. No conditional move can issue 1-5 cycles after a load. (This 77 1.1 mrg might have been fixed for UltraSPARC-3.) 78 1.1 mrg 79 1.1.1.2 mrg The UltraSPARC-3 pipeline is very simular to the one of UltraSPARC-1/2 , but is 80 1.1 mrg somewhat slower. Branches execute slower, and there may be other new stalls. 81 1.1 mrg But integer multiply doesn't stall the entire CPU and also has a much lower 82 1.1 mrg latency. But it's still not pipelined, and thus useless for our needs. 83 1.1 mrg 84 1.1 mrg STATUS 85 1.1 mrg 86 1.1 mrg * mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb on 87 1.1 mrg UltraSPARC-1/2 and 2.65 on UltraSPARC-3. For UltraSPARC-1/2, the IEU0 88 1.1 mrg functional unit is saturated with shifts. 89 1.1 mrg 90 1.1 mrg * mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb on 91 1.1 mrg UltraSPARC-1/2 and 4.5 cycles/limb on UltraSPARC-3. The 4 instruction 92 1.1 mrg recurrency is the speed limiter. 93 1.1 mrg 94 1.1 mrg * mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically on 95 1.1 mrg UltraSPARC-1/2 and 17.5 cycles/limb on UltraSPARC-3. On UltraSPARC-1/2, the 96 1.1 mrg code sustains 4 instructions/cycle. It might be possible to invent a better 97 1.1 mrg way of summing the intermediate 49-bit operands, but it is unlikely that it 98 1.1 mrg will save enough instructions to save an entire cycle. 99 1.1 mrg 100 1.1 mrg The load-use of the u operand is not enough scheduled for good L2 cache 101 1.1 mrg performance. The UltraSPARC-1/2 L1 cache is direct mapped, and since we use 102 1.1 mrg temporary stack slots that will conflict with the u and r operands, we miss 103 1.1 mrg to L2 very often. The load-use of the std/ldx pairs via the stack are 104 1.1 mrg perhaps over-scheduled. 105 1.1 mrg 106 1.1 mrg It would be possible to save two instructions: (1) The mov could be avoided 107 1.1 mrg if the std/ldx were less scheduled. (2) The ldx of the r operand could be 108 1.1 mrg split into two ld instructions, saving the shifts/masks. 109 1.1 mrg 110 1.1 mrg It should be possible to reach 14 cycles/limb for UltraSPARC-3 if the fp 111 1.1 mrg operations where rescheduled for this processor's 4-cycle latency. 112 1.1 mrg 113 1.1 mrg * mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1 114 1.1 mrg code. It would be possible to shave one or two cycles from it, with some 115 1.1 mrg labour. 116 1.1 mrg 117 1.1 mrg * mpn_submul_1: Simpleminded code just calling mpn_mul_1 + mpn_sub_n. This 118 1.1 mrg means that it runs at 18 cycles/limb on UltraSPARC-1/2 and 23 cycles/limb on 119 1.1 mrg UltraSPARC-3. It would be possible to either match the mpn_addmul_1 120 1.1 mrg performance, or in the worst case use one more instruction group. 121 1.1 mrg 122 1.1 mrg * US1/US2 cache conflict resolving. The direct mapped L1 date cache of US1/US2 123 1.1 mrg is a problem for mul_1, addmul_1 (and a prospective submul_1). We should 124 1.1 mrg allocate a larger cache area, and put the stack temp area in a place that 125 1.1 mrg doesn't cause cache conflicts. 126