OpenGrok

1.1  mrg Copyright 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
1.1  mrg
1.1  mrg This file is part of the GNU MP Library.
1.1  mrg
1.1  mrg The GNU MP Library is free software; you can redistribute it and/or modify
1.1  mrg it under the terms of the GNU Lesser General Public License as published by
1.1  mrg the Free Software Foundation; either version 3 of the License, or (at your
1.1  mrg option) any later version.
1.1  mrg
1.1  mrg The GNU MP Library is distributed in the hope that it will be useful, but
1.1  mrg WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
1.1  mrg or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
1.1  mrg License for more details.
1.1  mrg
1.1  mrg You should have received a copy of the GNU Lesser General Public License
1.1  mrg along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
1.1  mrg
1.1  mrg
1.1  mrg
1.1  mrg
1.1  mrg
1.1  mrg This directory contains mpn functions for 64-bit V9 SPARC
1.1  mrg
1.1  mrg RELEVANT OPTIMIZATION ISSUES
1.1  mrg
1.1  mrg Notation:
1.1  mrg   IANY = shift/add/sub/logical/sethi
1.1  mrg   IADDLOG = add/sub/logical/sethi
1.1  mrg   MEM = ld*/st*
1.1  mrg   FA = fadd*/fsub*/f*to*/fmov*
1.1  mrg   FM = fmul*
1.1  mrg
1.1  mrg UltraSPARC can issue four instructions per cycle, with these restrictions:
1.1  mrg * Two IANY instructions, but only one of these may be a shift.  If there is a
1.1  mrg   shift and an IANY instruction, the shift must precede the IANY instruction.
1.1  mrg * One FA.
1.1  mrg * One FM.
1.1  mrg * One branch.
1.1  mrg * One MEM.
1.1  mrg * IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle.  Taken branches
1.1  mrg   should not be in slot 4, since that makes the delay insn come from separate
1.1  mrg   bundle.
1.1  mrg * If two IANY/IADDLOG instructions are to be executed in the same cycle and one
1.1  mrg   of these is setting the condition codes, that instruction must be the second
1.1  mrg   one.
1.1  mrg
1.1  mrg To summarize, ignoring branches, these are the bundles that can reach the peak
1.1  mrg execution speed:
1.1  mrg
1.1  mrg insn1	iany	iany	mem	iany	iany	mem	iany	iany	mem
1.1  mrg insn2	iaddlog	mem	iany	mem	iaddlog	iany	mem	iaddlog	iany
1.1  mrg insn3	mem	iaddlog	iaddlog	fa	fa	fa	fm	fm	fm
1.1  mrg insn4	fa/fm	fa/fm	fa/fm	fm	fm	fm	fa	fa	fa
1.1  mrg
1.1  mrg The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles,
1.1  mrg depending on the position of the most significant bit of the first source
1.1  mrg operand.  When used for 32x32->64 multiplication, it needs 20 cycles.
1.1  mrg Furthermore, it stalls the processor while executing.  We stay away from that
1.1  mrg instruction, and instead use floating-point operations.
1.1  mrg
1.1  mrg Floating-point add and multiply units are fully pipelined.  The latency for
1.1  mrg UltraSPARC-1/2 is 3 cycles and for UltraSPARC-3 it is 4 cycles.
1.1  mrg
1.1  mrg Integer conditional move instructions cannot dual-issue with other integer
1.1  mrg instructions.  No conditional move can issue 1-5 cycles after a load.  (This
1.1  mrg might have been fixed for UltraSPARC-3.)
1.1  mrg
1.1  mrg The UltraSPARC-3 pipeline is very simular to he one of UltraSPARC-1/2 , but is
1.1  mrg somewhat slower.  Branches execute slower, and there may be other new stalls.
1.1  mrg But integer multiply doesn't stall the entire CPU and also has a much lower
1.1  mrg latency.  But it's still not pipelined, and thus useless for our needs.
1.1  mrg
1.1  mrg STATUS
1.1  mrg
1.1  mrg * mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb on
1.1  mrg   UltraSPARC-1/2 and 2.65 on UltraSPARC-3.  For UltraSPARC-1/2, the IEU0
1.1  mrg   functional unit is saturated with shifts.
1.1  mrg
1.1  mrg * mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb on
1.1  mrg   UltraSPARC-1/2 and 4.5 cycles/limb on UltraSPARC-3.  The 4 instruction
1.1  mrg   recurrency is the speed limiter.
1.1  mrg
1.1  mrg * mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically on
1.1  mrg   UltraSPARC-1/2 and 17.5 cycles/limb on UltraSPARC-3.  On UltraSPARC-1/2, the
1.1  mrg   code sustains 4 instructions/cycle.  It might be possible to invent a better
1.1  mrg   way of summing the intermediate 49-bit operands, but it is unlikely that it
1.1  mrg   will save enough instructions to save an entire cycle.
1.1  mrg
1.1  mrg   The load-use of the u operand is not enough scheduled for good L2 cache
1.1  mrg   performance.  The UltraSPARC-1/2 L1 cache is direct mapped, and since we use
1.1  mrg   temporary stack slots that will conflict with the u and r operands, we miss
1.1  mrg   to L2 very often.  The load-use of the std/ldx pairs via the stack are
1.1  mrg   perhaps over-scheduled.
1.1  mrg
1.1  mrg   It would be possible to save two instructions: (1) The mov could be avoided
1.1  mrg   if the std/ldx were less scheduled.  (2) The ldx of the r operand could be
1.1  mrg   split into two ld instructions, saving the shifts/masks.
1.1  mrg
1.1  mrg   It should be possible to reach 14 cycles/limb for UltraSPARC-3 if the fp
1.1  mrg   operations where rescheduled for this processor's 4-cycle latency.
1.1  mrg
1.1  mrg * mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1
1.1  mrg   code.  It would be possible to shave one or two cycles from it, with some
1.1  mrg   labour.
1.1  mrg
1.1  mrg * mpn_submul_1: Simpleminded code just calling mpn_mul_1 + mpn_sub_n.  This
1.1  mrg   means that it runs at 18 cycles/limb on UltraSPARC-1/2 and 23 cycles/limb on
1.1  mrg   UltraSPARC-3.  It would be possible to either match the mpn_addmul_1
1.1  mrg   performance, or in the worst case use one more instruction group.
1.1  mrg
1.1  mrg * US1/US2 cache conflict resolving.  The direct mapped L1 date cache of US1/US2
1.1  mrg   is a problem for mul_1, addmul_1 (and a prospective submul_1).  We should
1.1  mrg   allocate a larger cache area, and put the stack temp area in a place that
1.1  mrg   doesn't cause cache conflicts.