README revision 1.1.1.1 1 1.1 mrg Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
2 1.1 mrg
3 1.1 mrg This file is part of the GNU MP Library.
4 1.1 mrg
5 1.1 mrg The GNU MP Library is free software; you can redistribute it and/or modify
6 1.1 mrg it under the terms of the GNU Lesser General Public License as published by
7 1.1 mrg the Free Software Foundation; either version 3 of the License, or (at your
8 1.1 mrg option) any later version.
9 1.1 mrg
10 1.1 mrg The GNU MP Library is distributed in the hope that it will be useful, but
11 1.1 mrg WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12 1.1 mrg or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
13 1.1 mrg License for more details.
14 1.1 mrg
15 1.1 mrg You should have received a copy of the GNU Lesser General Public License
16 1.1 mrg along with the GNU MP Library. If not, see http://www.gnu.org/licenses/.
17 1.1 mrg
18 1.1 mrg
19 1.1 mrg
20 1.1 mrg
21 1.1 mrg
22 1.1 mrg
23 1.1 mrg The code in this directory works for Cray vector systems such as C90,
24 1.1 mrg J90, T90 (both the CFP variant and the IEEE variant) and SV1. (For
25 1.1 mrg the T3E and T3D systems, see the `alpha' subdirectory at the same
26 1.1 mrg level as the directory containing this file.)
27 1.1 mrg
28 1.1 mrg The cfp subdirectory is for systems utilizing the traditional Cray
29 1.1 mrg floating-point format, and the ieee subdirectory is for the newer
30 1.1 mrg systems that use the IEEE floating-point format.
31 1.1 mrg
32 1.1 mrg There are several issues that reduces speed on Cray systems. For
33 1.1 mrg systems with cfp floating point, the main obstacle is the forming of
34 1.1 mrg 128-bit products. For IEEE systems, adding, and in particular
35 1.1 mrg computing carry is the main issue. There are no vectorizing
36 1.1 mrg unsigned-less-than instructions, and the sequence that implement that
37 1.1 mrg operation is very long.
38 1.1 mrg
39 1.1 mrg Shifting is the only operation that is simple to make fast. All Cray
40 1.1 mrg systems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) that
41 1.1 mrg should be really useful.
42 1.1 mrg
43 1.1 mrg For best speed for cfp systems, we need a mul_basecase, since that
44 1.1 mrg reduces the need for carry propagation to a minimum. Depending on the
45 1.1 mrg size (vn) of the smaller of the two operands (V), we should split U and V
46 1.1 mrg in different chunk sizes:
47 1.1 mrg
48 1.1 mrg U split in 2 32-bit parts
49 1.1 mrg V split according to the table:
50 1.1 mrg parts 4 5 6 7 8
51 1.1 mrg bits/part 16 13 11 10 8
52 1.1 mrg max allowed vn 1 8 32 64 256
53 1.1 mrg number of multiplies 8 10 12 14 16
54 1.1 mrg peak cycles/limb 4 5 6 7 8
55 1.1 mrg
56 1.1 mrg U split in 3 22-bit parts
57 1.1 mrg V split according to the table:
58 1.1 mrg parts 3 4 5
59 1.1 mrg bits/part 22 16 13
60 1.1 mrg max allowed vn 16 1024 8192
61 1.1 mrg number of multiplies 9 12 15
62 1.1 mrg peak cycles/limb 4.5 6 7.5
63 1.1 mrg
64 1.1 mrg U split in 4 16-bit parts
65 1.1 mrg V split according to the table:
66 1.1 mrg parts 4
67 1.1 mrg bits/part 16
68 1.1 mrg max allowed vn 65536
69 1.1 mrg number of multiplies 16
70 1.1 mrg peak cycles/limb 8
71 1.1 mrg
72 1.1 mrg (A T90 CPU can accumulate two products per cycle.)
73 1.1 mrg
74 1.1 mrg IDEA:
75 1.1 mrg * Rewrite mpn_add_n:
76 1.1 mrg short cy[n + 1];
77 1.1 mrg #pragma _CRI ivdep
78 1.1 mrg for (i = 0; i < n; i++)
79 1.1 mrg { s = up[i] + vp[i];
80 1.1 mrg rp[i] = s;
81 1.1 mrg cy[i + 1] = s < up[i]; }
82 1.1 mrg more_carries = 0;
83 1.1 mrg #pragma _CRI ivdep
84 1.1 mrg for (i = 1; i < n; i++)
85 1.1 mrg { s = rp[i] + cy[i];
86 1.1 mrg rp[i] = s;
87 1.1 mrg more_carries += s < cy[i]; }
88 1.1 mrg cys = 0;
89 1.1 mrg if (more_carries)
90 1.1 mrg {
91 1.1 mrg cys = rp[1] < cy[1];
92 1.1 mrg for (i = 2; i < n; i++)
93 1.1 mrg { rp[i] += cys;
94 1.1 mrg cys = rp[i] < cys; }
95 1.1 mrg }
96 1.1 mrg return cys + cy[n];
97 1.1 mrg
98 1.1 mrg * Write mpn_add3_n for adding three operands. First add operands 1
99 1.1 mrg and 2, and generate cy[]. Then add operand 3 to the partial result,
100 1.1 mrg and accumulate carry into cy[]. Finally propagate carry just like
101 1.1 mrg in the new mpn_add_n.
102 1.1 mrg
103 1.1 mrg IDEA:
104 1.1 mrg
105 1.1 mrg Store fewer bits, perhaps 62, per limb. That brings mpn_add_n time
106 1.1 mrg down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb. By
107 1.1 mrg storing even fewer bits per limb, perhaps 56, it would be possible to
108 1.1 mrg write a mul_mul_basecase that would run at effectively 1 cycle/limb.
109 1.1 mrg (Use VM here to better handle the romb-shaped multiply area, perhaps
110 1.1 mrg rouding operand sizes up to the next power of 2.)
111