Home | History | Annotate | Line # | Download | only in cray
README revision 1.1.1.1
      1 Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
      2 
      3 This file is part of the GNU MP Library.
      4 
      5 The GNU MP Library is free software; you can redistribute it and/or modify
      6 it under the terms of the GNU Lesser General Public License as published by
      7 the Free Software Foundation; either version 3 of the License, or (at your
      8 option) any later version.
      9 
     10 The GNU MP Library is distributed in the hope that it will be useful, but
     11 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
     12 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
     13 License for more details.
     14 
     15 You should have received a copy of the GNU Lesser General Public License
     16 along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
     17 
     18 
     19 
     20 
     21 
     22 
     23 The code in this directory works for Cray vector systems such as C90,
     24 J90, T90 (both the CFP variant and the IEEE variant) and SV1.  (For
     25 the T3E and T3D systems, see the `alpha' subdirectory at the same
     26 level as the directory containing this file.)
     27 
     28 The cfp subdirectory is for systems utilizing the traditional Cray
     29 floating-point format, and the ieee subdirectory is for the newer
     30 systems that use the IEEE floating-point format.
     31 
     32 There are several issues that reduces speed on Cray systems.  For
     33 systems with cfp floating point, the main obstacle is the forming of
     34 128-bit products.  For IEEE systems, adding, and in particular
     35 computing carry is the main issue.  There are no vectorizing
     36 unsigned-less-than instructions, and the sequence that implement that
     37 operation is very long.
     38 
     39 Shifting is the only operation that is simple to make fast.  All Cray
     40 systems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) that
     41 should be really useful.
     42 
     43 For best speed for cfp systems, we need a mul_basecase, since that
     44 reduces the need for carry propagation to a minimum.  Depending on the
     45 size (vn) of the smaller of the two operands (V), we should split U and V
     46 in different chunk sizes:
     47 
     48 U split in 2 32-bit parts
     49 V split according to the table:
     50 parts			4	5	6	7	8
     51 bits/part		16	13	11	10	8
     52 max allowed vn		1	8	32	64	256
     53 number of multiplies	8	10	12	14	16
     54 peak cycles/limb	4	5	6	7	8
     55 
     56 U split in 3 22-bit parts
     57 V split according to the table:
     58 parts			3	4	5
     59 bits/part		22	16	13
     60 max allowed vn		16	1024	8192
     61 number of multiplies	9	12	15
     62 peak cycles/limb	4.5	6	7.5
     63 
     64 U split in 4 16-bit parts
     65 V split according to the table:
     66 parts			4
     67 bits/part		16
     68 max allowed vn		65536
     69 number of multiplies	16
     70 peak cycles/limb	8
     71 
     72 (A T90 CPU can accumulate two products per cycle.)
     73 
     74 IDEA:
     75 * Rewrite mpn_add_n:
     76     short cy[n + 1];
     77     #pragma _CRI ivdep
     78       for (i = 0; i < n; i++)
     79 	{ s = up[i] + vp[i];
     80 	  rp[i] = s;
     81 	  cy[i + 1] = s < up[i]; }
     82       more_carries = 0;
     83     #pragma _CRI ivdep
     84       for (i = 1; i < n; i++)
     85 	{ s = rp[i] + cy[i];
     86 	  rp[i] = s;
     87 	  more_carries += s < cy[i]; }
     88       cys = 0;
     89       if (more_carries)
     90 	{
     91 	  cys = rp[1] < cy[1];
     92 	  for (i = 2; i < n; i++)
     93 	    { rp[i] += cys;
     94 	      cys = rp[i] < cys; }
     95 	}
     96       return cys + cy[n];
     97 
     98 * Write mpn_add3_n for adding three operands.  First add operands 1
     99   and 2, and generate cy[].  Then add operand 3 to the partial result,
    100   and accumulate carry into cy[].  Finally propagate carry just like
    101   in the new mpn_add_n.
    102 
    103 IDEA:
    104 
    105 Store fewer bits, perhaps 62, per limb.  That brings mpn_add_n time
    106 down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb.  By
    107 storing even fewer bits per limb, perhaps 56, it would be possible to
    108 write a mul_mul_basecase that would run at effectively 1 cycle/limb.
    109 (Use VM here to better handle the romb-shaped multiply area, perhaps
    110 rouding operand sizes up to the next power of 2.)
    111