Home | History | Annotate | Line # | Download | only in sparc64
README revision 1.1
      1  1.1  mrg Copyright 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
      2  1.1  mrg 
      3  1.1  mrg This file is part of the GNU MP Library.
      4  1.1  mrg 
      5  1.1  mrg The GNU MP Library is free software; you can redistribute it and/or modify
      6  1.1  mrg it under the terms of the GNU Lesser General Public License as published by
      7  1.1  mrg the Free Software Foundation; either version 3 of the License, or (at your
      8  1.1  mrg option) any later version.
      9  1.1  mrg 
     10  1.1  mrg The GNU MP Library is distributed in the hope that it will be useful, but
     11  1.1  mrg WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
     12  1.1  mrg or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
     13  1.1  mrg License for more details.
     14  1.1  mrg 
     15  1.1  mrg You should have received a copy of the GNU Lesser General Public License
     16  1.1  mrg along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
     17  1.1  mrg 
     18  1.1  mrg 
     19  1.1  mrg 
     20  1.1  mrg 
     21  1.1  mrg 
     22  1.1  mrg This directory contains mpn functions for 64-bit V9 SPARC
     23  1.1  mrg 
     24  1.1  mrg RELEVANT OPTIMIZATION ISSUES
     25  1.1  mrg 
     26  1.1  mrg Notation:
     27  1.1  mrg   IANY = shift/add/sub/logical/sethi
     28  1.1  mrg   IADDLOG = add/sub/logical/sethi
     29  1.1  mrg   MEM = ld*/st*
     30  1.1  mrg   FA = fadd*/fsub*/f*to*/fmov*
     31  1.1  mrg   FM = fmul*
     32  1.1  mrg 
     33  1.1  mrg UltraSPARC can issue four instructions per cycle, with these restrictions:
     34  1.1  mrg * Two IANY instructions, but only one of these may be a shift.  If there is a
     35  1.1  mrg   shift and an IANY instruction, the shift must precede the IANY instruction.
     36  1.1  mrg * One FA.
     37  1.1  mrg * One FM.
     38  1.1  mrg * One branch.
     39  1.1  mrg * One MEM.
     40  1.1  mrg * IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle.  Taken branches
     41  1.1  mrg   should not be in slot 4, since that makes the delay insn come from separate
     42  1.1  mrg   bundle.
     43  1.1  mrg * If two IANY/IADDLOG instructions are to be executed in the same cycle and one
     44  1.1  mrg   of these is setting the condition codes, that instruction must be the second
     45  1.1  mrg   one.
     46  1.1  mrg 
     47  1.1  mrg To summarize, ignoring branches, these are the bundles that can reach the peak
     48  1.1  mrg execution speed:
     49  1.1  mrg 
     50  1.1  mrg insn1	iany	iany	mem	iany	iany	mem	iany	iany	mem
     51  1.1  mrg insn2	iaddlog	mem	iany	mem	iaddlog	iany	mem	iaddlog	iany
     52  1.1  mrg insn3	mem	iaddlog	iaddlog	fa	fa	fa	fm	fm	fm
     53  1.1  mrg insn4	fa/fm	fa/fm	fa/fm	fm	fm	fm	fa	fa	fa
     54  1.1  mrg 
     55  1.1  mrg The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles,
     56  1.1  mrg depending on the position of the most significant bit of the first source
     57  1.1  mrg operand.  When used for 32x32->64 multiplication, it needs 20 cycles.
     58  1.1  mrg Furthermore, it stalls the processor while executing.  We stay away from that
     59  1.1  mrg instruction, and instead use floating-point operations.
     60  1.1  mrg 
     61  1.1  mrg Floating-point add and multiply units are fully pipelined.  The latency for
     62  1.1  mrg UltraSPARC-1/2 is 3 cycles and for UltraSPARC-3 it is 4 cycles.
     63  1.1  mrg 
     64  1.1  mrg Integer conditional move instructions cannot dual-issue with other integer
     65  1.1  mrg instructions.  No conditional move can issue 1-5 cycles after a load.  (This
     66  1.1  mrg might have been fixed for UltraSPARC-3.)
     67  1.1  mrg 
     68  1.1  mrg The UltraSPARC-3 pipeline is very simular to he one of UltraSPARC-1/2 , but is
     69  1.1  mrg somewhat slower.  Branches execute slower, and there may be other new stalls.
     70  1.1  mrg But integer multiply doesn't stall the entire CPU and also has a much lower
     71  1.1  mrg latency.  But it's still not pipelined, and thus useless for our needs.
     72  1.1  mrg 
     73  1.1  mrg STATUS
     74  1.1  mrg 
     75  1.1  mrg * mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb on
     76  1.1  mrg   UltraSPARC-1/2 and 2.65 on UltraSPARC-3.  For UltraSPARC-1/2, the IEU0
     77  1.1  mrg   functional unit is saturated with shifts.
     78  1.1  mrg 
     79  1.1  mrg * mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb on
     80  1.1  mrg   UltraSPARC-1/2 and 4.5 cycles/limb on UltraSPARC-3.  The 4 instruction
     81  1.1  mrg   recurrency is the speed limiter.
     82  1.1  mrg 
     83  1.1  mrg * mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically on
     84  1.1  mrg   UltraSPARC-1/2 and 17.5 cycles/limb on UltraSPARC-3.  On UltraSPARC-1/2, the
     85  1.1  mrg   code sustains 4 instructions/cycle.  It might be possible to invent a better
     86  1.1  mrg   way of summing the intermediate 49-bit operands, but it is unlikely that it
     87  1.1  mrg   will save enough instructions to save an entire cycle.
     88  1.1  mrg 
     89  1.1  mrg   The load-use of the u operand is not enough scheduled for good L2 cache
     90  1.1  mrg   performance.  The UltraSPARC-1/2 L1 cache is direct mapped, and since we use
     91  1.1  mrg   temporary stack slots that will conflict with the u and r operands, we miss
     92  1.1  mrg   to L2 very often.  The load-use of the std/ldx pairs via the stack are
     93  1.1  mrg   perhaps over-scheduled.
     94  1.1  mrg 
     95  1.1  mrg   It would be possible to save two instructions: (1) The mov could be avoided
     96  1.1  mrg   if the std/ldx were less scheduled.  (2) The ldx of the r operand could be
     97  1.1  mrg   split into two ld instructions, saving the shifts/masks.
     98  1.1  mrg 
     99  1.1  mrg   It should be possible to reach 14 cycles/limb for UltraSPARC-3 if the fp
    100  1.1  mrg   operations where rescheduled for this processor's 4-cycle latency.
    101  1.1  mrg 
    102  1.1  mrg * mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1
    103  1.1  mrg   code.  It would be possible to shave one or two cycles from it, with some
    104  1.1  mrg   labour.
    105  1.1  mrg 
    106  1.1  mrg * mpn_submul_1: Simpleminded code just calling mpn_mul_1 + mpn_sub_n.  This
    107  1.1  mrg   means that it runs at 18 cycles/limb on UltraSPARC-1/2 and 23 cycles/limb on
    108  1.1  mrg   UltraSPARC-3.  It would be possible to either match the mpn_addmul_1
    109  1.1  mrg   performance, or in the worst case use one more instruction group.
    110  1.1  mrg 
    111  1.1  mrg * US1/US2 cache conflict resolving.  The direct mapped L1 date cache of US1/US2
    112  1.1  mrg   is a problem for mul_1, addmul_1 (and a prospective submul_1).  We should
    113  1.1  mrg   allocate a larger cache area, and put the stack temp area in a place that
    114  1.1  mrg   doesn't cause cache conflicts.
    115