OpenGrok

1.1  mrg Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
1.1  mrg
1.1  mrg This file is part of the GNU MP Library.
1.1  mrg
1.1  mrg The GNU MP Library is free software; you can redistribute it and/or modify
1.1  mrg it under the terms of the GNU Lesser General Public License as published by
1.1  mrg the Free Software Foundation; either version 3 of the License, or (at your
1.1  mrg option) any later version.
1.1  mrg
1.1  mrg The GNU MP Library is distributed in the hope that it will be useful, but
1.1  mrg WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
1.1  mrg or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
1.1  mrg License for more details.
1.1  mrg
1.1  mrg You should have received a copy of the GNU Lesser General Public License
1.1  mrg along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
1.1  mrg
1.1  mrg
1.1  mrg
1.1  mrg
1.1  mrg
1.1  mrg
1.1  mrg The code in this directory works for Cray vector systems such as C90,
1.1  mrg J90, T90 (both the CFP variant and the IEEE variant) and SV1.  (For
1.1  mrg the T3E and T3D systems, see the `alpha' subdirectory at the same
1.1  mrg level as the directory containing this file.)
1.1  mrg
1.1  mrg The cfp subdirectory is for systems utilizing the traditional Cray
1.1  mrg floating-point format, and the ieee subdirectory is for the newer
1.1  mrg systems that use the IEEE floating-point format.
1.1  mrg
1.1  mrg There are several issues that reduces speed on Cray systems.  For
1.1  mrg systems with cfp floating point, the main obstacle is the forming of
1.1  mrg 128-bit products.  For IEEE systems, adding, and in particular
1.1  mrg computing carry is the main issue.  There are no vectorizing
1.1  mrg unsigned-less-than instructions, and the sequence that implement that
1.1  mrg operation is very long.
1.1  mrg
1.1  mrg Shifting is the only operation that is simple to make fast.  All Cray
1.1  mrg systems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) that
1.1  mrg should be really useful.
1.1  mrg
1.1  mrg For best speed for cfp systems, we need a mul_basecase, since that
1.1  mrg reduces the need for carry propagation to a minimum.  Depending on the
1.1  mrg size (vn) of the smaller of the two operands (V), we should split U and V
1.1  mrg in different chunk sizes:
1.1  mrg
1.1  mrg U split in 2 32-bit parts
1.1  mrg V split according to the table:
1.1  mrg parts			4	5	6	7	8
1.1  mrg bits/part		16	13	11	10	8
1.1  mrg max allowed vn		1	8	32	64	256
1.1  mrg number of multiplies	8	10	12	14	16
1.1  mrg peak cycles/limb	4	5	6	7	8
1.1  mrg
1.1  mrg U split in 3 22-bit parts
1.1  mrg V split according to the table:
1.1  mrg parts			3	4	5
1.1  mrg bits/part		22	16	13
1.1  mrg max allowed vn		16	1024	8192
1.1  mrg number of multiplies	9	12	15
1.1  mrg peak cycles/limb	4.5	6	7.5
1.1  mrg
1.1  mrg U split in 4 16-bit parts
1.1  mrg V split according to the table:
1.1  mrg parts			4
1.1  mrg bits/part		16
1.1  mrg max allowed vn		65536
1.1  mrg number of multiplies	16
1.1  mrg peak cycles/limb	8
1.1  mrg
1.1  mrg (A T90 CPU can accumulate two products per cycle.)
1.1  mrg
1.1  mrg IDEA:
1.1  mrg * Rewrite mpn_add_n:
1.1  mrg     short cy[n + 1];
1.1  mrg     #pragma _CRI ivdep
1.1  mrg       for (i = 0; i < n; i++)
1.1  mrg 	{ s = up[i] + vp[i];
1.1  mrg 	  rp[i] = s;
1.1  mrg 	  cy[i + 1] = s < up[i]; }
1.1  mrg       more_carries = 0;
1.1  mrg     #pragma _CRI ivdep
1.1  mrg       for (i = 1; i < n; i++)
1.1  mrg 	{ s = rp[i] + cy[i];
1.1  mrg 	  rp[i] = s;
1.1  mrg 	  more_carries += s < cy[i]; }
1.1  mrg       cys = 0;
1.1  mrg       if (more_carries)
1.1  mrg 	{
1.1  mrg 	  cys = rp[1] < cy[1];
1.1  mrg 	  for (i = 2; i < n; i++)
1.1  mrg 	    { rp[i] += cys;
1.1  mrg 	      cys = rp[i] < cys; }
1.1  mrg 	}
1.1  mrg       return cys + cy[n];
1.1  mrg
1.1  mrg * Write mpn_add3_n for adding three operands.  First add operands 1
1.1  mrg   and 2, and generate cy[].  Then add operand 3 to the partial result,
1.1  mrg   and accumulate carry into cy[].  Finally propagate carry just like
1.1  mrg   in the new mpn_add_n.
1.1  mrg
1.1  mrg IDEA:
1.1  mrg
1.1  mrg Store fewer bits, perhaps 62, per limb.  That brings mpn_add_n time
1.1  mrg down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb.  By
1.1  mrg storing even fewer bits per limb, perhaps 56, it would be possible to
1.1  mrg write a mul_mul_basecase that would run at effectively 1 cycle/limb.
1.1  mrg (Use VM here to better handle the romb-shaped multiply area, perhaps
1.1  mrg rouding operand sizes up to the next power of 2.)