setox.sa revision 1.1 1 * MOTOROLA MICROPROCESSOR & MEMORY TECHNOLOGY GROUP
2 * M68000 Hi-Performance Microprocessor Division
3 * M68040 Software Package
4 *
5 * M68040 Software Package Copyright (c) 1993, 1994 Motorola Inc.
6 * All rights reserved.
7 *
8 * THE SOFTWARE is provided on an "AS IS" basis and without warranty.
9 * To the maximum extent permitted by applicable law,
10 * MOTOROLA DISCLAIMS ALL WARRANTIES WHETHER EXPRESS OR IMPLIED,
11 * INCLUDING IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
12 * PARTICULAR PURPOSE and any warranty against infringement with
13 * regard to the SOFTWARE (INCLUDING ANY MODIFIED VERSIONS THEREOF)
14 * and any accompanying written materials.
15 *
16 * To the maximum extent permitted by applicable law,
17 * IN NO EVENT SHALL MOTOROLA BE LIABLE FOR ANY DAMAGES WHATSOEVER
18 * (INCLUDING WITHOUT LIMITATION, DAMAGES FOR LOSS OF BUSINESS
19 * PROFITS, BUSINESS INTERRUPTION, LOSS OF BUSINESS INFORMATION, OR
20 * OTHER PECUNIARY LOSS) ARISING OF THE USE OR INABILITY TO USE THE
21 * SOFTWARE. Motorola assumes no responsibility for the maintenance
22 * and support of the SOFTWARE.
23 *
24 * You are hereby granted a copyright license to use, modify, and
25 * distribute the SOFTWARE so long as this entire notice is retained
26 * without alteration in any modified and/or redistributed versions,
27 * and that such modified versions are clearly identified as such.
28 * No licenses are granted by implication, estoppel or otherwise
29 * under any patents or trademarks of Motorola, Inc.
30
31 *
32 * setox.sa 3.1 12/10/90
33 *
34 * The entry point setox computes the exponential of a value.
35 * setoxd does the same except the input value is a denormalized
36 * number. setoxm1 computes exp(X)-1, and setoxm1d computes
37 * exp(X)-1 for denormalized X.
38 *
39 * INPUT
40 * -----
41 * Double-extended value in memory location pointed to by address
42 * register a0.
43 *
44 * OUTPUT
45 * ------
46 * exp(X) or exp(X)-1 returned in floating-point register fp0.
47 *
48 * ACCURACY and MONOTONICITY
49 * -------------------------
50 * The returned result is within 0.85 ulps in 64 significant bit, i.e.
51 * within 0.5001 ulp to 53 bits if the result is subsequently rounded
52 * to double precision. The result is provably monotonic in double
53 * precision.
54 *
55 * SPEED
56 * -----
57 * Two timings are measured, both in the copy-back mode. The
58 * first one is measured when the function is invoked the first time
59 * (so the instructions and data are not in cache), and the
60 * second one is measured when the function is reinvoked at the same
61 * input argument.
62 *
63 * The program setox takes approximately 210/190 cycles for input
64 * argument X whose magnitude is less than 16380 log2, which
65 * is the usual situation. For the less common arguments,
66 * depending on their values, the program may run faster or slower --
67 * but no worse than 10% slower even in the extreme cases.
68 *
69 * The program setoxm1 takes approximately ???/??? cycles for input
70 * argument X, 0.25 <= |X| < 70log2. For |X| < 0.25, it takes
71 * approximately ???/??? cycles. For the less common arguments,
72 * depending on their values, the program may run faster or slower --
73 * but no worse than 10% slower even in the extreme cases.
74 *
75 * ALGORITHM and IMPLEMENTATION NOTES
76 * ----------------------------------
77 *
78 * setoxd
79 * ------
80 * Step 1. Set ans := 1.0
81 *
82 * Step 2. Return ans := ans + sign(X)*2^(-126). Exit.
83 * Notes: This will always generate one exception -- inexact.
84 *
85 *
86 * setox
87 * -----
88 *
89 * Step 1. Filter out extreme cases of input argument.
90 * 1.1 If |X| >= 2^(-65), go to Step 1.3.
91 * 1.2 Go to Step 7.
92 * 1.3 If |X| < 16380 log(2), go to Step 2.
93 * 1.4 Go to Step 8.
94 * Notes: The usual case should take the branches 1.1 -> 1.3 -> 2.
95 * To avoid the use of floating-point comparisons, a
96 * compact representation of |X| is used. This format is a
97 * 32-bit integer, the upper (more significant) 16 bits are
98 * the sign and biased exponent field of |X|; the lower 16
99 * bits are the 16 most significant fraction (including the
100 * explicit bit) bits of |X|. Consequently, the comparisons
101 * in Steps 1.1 and 1.3 can be performed by integer comparison.
102 * Note also that the constant 16380 log(2) used in Step 1.3
103 * is also in the compact form. Thus taking the branch
104 * to Step 2 guarantees |X| < 16380 log(2). There is no harm
105 * to have a small number of cases where |X| is less than,
106 * but close to, 16380 log(2) and the branch to Step 9 is
107 * taken.
108 *
109 * Step 2. Calculate N = round-to-nearest-int( X * 64/log2 ).
110 * 2.1 Set AdjFlag := 0 (indicates the branch 1.3 -> 2 was taken)
111 * 2.2 N := round-to-nearest-integer( X * 64/log2 ).
112 * 2.3 Calculate J = N mod 64; so J = 0,1,2,..., or 63.
113 * 2.4 Calculate M = (N - J)/64; so N = 64M + J.
114 * 2.5 Calculate the address of the stored value of 2^(J/64).
115 * 2.6 Create the value Scale = 2^M.
116 * Notes: The calculation in 2.2 is really performed by
117 *
118 * Z := X * constant
119 * N := round-to-nearest-integer(Z)
120 *
121 * where
122 *
123 * constant := single-precision( 64/log 2 ).
124 *
125 * Using a single-precision constant avoids memory access.
126 * Another effect of using a single-precision "constant" is
127 * that the calculated value Z is
128 *
129 * Z = X*(64/log2)*(1+eps), |eps| <= 2^(-24).
130 *
131 * This error has to be considered later in Steps 3 and 4.
132 *
133 * Step 3. Calculate X - N*log2/64.
134 * 3.1 R := X + N*L1, where L1 := single-precision(-log2/64).
135 * 3.2 R := R + N*L2, L2 := extended-precision(-log2/64 - L1).
136 * Notes: a) The way L1 and L2 are chosen ensures L1+L2 approximate
137 * the value -log2/64 to 88 bits of accuracy.
138 * b) N*L1 is exact because N is no longer than 22 bits and
139 * L1 is no longer than 24 bits.
140 * c) The calculation X+N*L1 is also exact due to cancellation.
141 * Thus, R is practically X+N(L1+L2) to full 64 bits.
142 * d) It is important to estimate how large can |R| be after
143 * Step 3.2.
144 *
145 * N = rnd-to-int( X*64/log2 (1+eps) ), |eps|<=2^(-24)
146 * X*64/log2 (1+eps) = N + f, |f| <= 0.5
147 * X*64/log2 - N = f - eps*X 64/log2
148 * X - N*log2/64 = f*log2/64 - eps*X
149 *
150 *
151 * Now |X| <= 16446 log2, thus
152 *
153 * |X - N*log2/64| <= (0.5 + 16446/2^(18))*log2/64
154 * <= 0.57 log2/64.
155 * This bound will be used in Step 4.
156 *
157 * Step 4. Approximate exp(R)-1 by a polynomial
158 * p = R + R*R*(A1 + R*(A2 + R*(A3 + R*(A4 + R*A5))))
159 * Notes: a) In order to reduce memory access, the coefficients are
160 * made as "short" as possible: A1 (which is 1/2), A4 and A5
161 * are single precision; A2 and A3 are double precision.
162 * b) Even with the restrictions above,
163 * |p - (exp(R)-1)| < 2^(-68.8) for all |R| <= 0.0062.
164 * Note that 0.0062 is slightly bigger than 0.57 log2/64.
165 * c) To fully utilize the pipeline, p is separated into
166 * two independent pieces of roughly equal complexities
167 * p = [ R + R*S*(A2 + S*A4) ] +
168 * [ S*(A1 + S*(A3 + S*A5)) ]
169 * where S = R*R.
170 *
171 * Step 5. Compute 2^(J/64)*exp(R) = 2^(J/64)*(1+p) by
172 * ans := T + ( T*p + t)
173 * where T and t are the stored values for 2^(J/64).
174 * Notes: 2^(J/64) is stored as T and t where T+t approximates
175 * 2^(J/64) to roughly 85 bits; T is in extended precision
176 * and t is in single precision. Note also that T is rounded
177 * to 62 bits so that the last two bits of T are zero. The
178 * reason for such a special form is that T-1, T-2, and T-8
179 * will all be exact --- a property that will give much
180 * more accurate computation of the function EXPM1.
181 *
182 * Step 6. Reconstruction of exp(X)
183 * exp(X) = 2^M * 2^(J/64) * exp(R).
184 * 6.1 If AdjFlag = 0, go to 6.3
185 * 6.2 ans := ans * AdjScale
186 * 6.3 Restore the user FPCR
187 * 6.4 Return ans := ans * Scale. Exit.
188 * Notes: If AdjFlag = 0, we have X = Mlog2 + Jlog2/64 + R,
189 * |M| <= 16380, and Scale = 2^M. Moreover, exp(X) will
190 * neither overflow nor underflow. If AdjFlag = 1, that
191 * means that
192 * X = (M1+M)log2 + Jlog2/64 + R, |M1+M| >= 16380.
193 * Hence, exp(X) may overflow or underflow or neither.
194 * When that is the case, AdjScale = 2^(M1) where M1 is
195 * approximately M. Thus 6.2 will never cause over/underflow.
196 * Possible exception in 6.4 is overflow or underflow.
197 * The inexact exception is not generated in 6.4. Although
198 * one can argue that the inexact flag should always be
199 * raised, to simulate that exception cost to much than the
200 * flag is worth in practical uses.
201 *
202 * Step 7. Return 1 + X.
203 * 7.1 ans := X
204 * 7.2 Restore user FPCR.
205 * 7.3 Return ans := 1 + ans. Exit
206 * Notes: For non-zero X, the inexact exception will always be
207 * raised by 7.3. That is the only exception raised by 7.3.
208 * Note also that we use the FMOVEM instruction to move X
209 * in Step 7.1 to avoid unnecessary trapping. (Although
210 * the FMOVEM may not seem relevant since X is normalized,
211 * the precaution will be useful in the library version of
212 * this code where the separate entry for denormalized inputs
213 * will be done away with.)
214 *
215 * Step 8. Handle exp(X) where |X| >= 16380log2.
216 * 8.1 If |X| > 16480 log2, go to Step 9.
217 * (mimic 2.2 - 2.6)
218 * 8.2 N := round-to-integer( X * 64/log2 )
219 * 8.3 Calculate J = N mod 64, J = 0,1,...,63
220 * 8.4 K := (N-J)/64, M1 := truncate(K/2), M = K-M1, AdjFlag := 1.
221 * 8.5 Calculate the address of the stored value 2^(J/64).
222 * 8.6 Create the values Scale = 2^M, AdjScale = 2^M1.
223 * 8.7 Go to Step 3.
224 * Notes: Refer to notes for 2.2 - 2.6.
225 *
226 * Step 9. Handle exp(X), |X| > 16480 log2.
227 * 9.1 If X < 0, go to 9.3
228 * 9.2 ans := Huge, go to 9.4
229 * 9.3 ans := Tiny.
230 * 9.4 Restore user FPCR.
231 * 9.5 Return ans := ans * ans. Exit.
232 * Notes: Exp(X) will surely overflow or underflow, depending on
233 * X's sign. "Huge" and "Tiny" are respectively large/tiny
234 * extended-precision numbers whose square over/underflow
235 * with an inexact result. Thus, 9.5 always raises the
236 * inexact together with either overflow or underflow.
237 *
238 *
239 * setoxm1d
240 * --------
241 *
242 * Step 1. Set ans := 0
243 *
244 * Step 2. Return ans := X + ans. Exit.
245 * Notes: This will return X with the appropriate rounding
246 * precision prescribed by the user FPCR.
247 *
248 * setoxm1
249 * -------
250 *
251 * Step 1. Check |X|
252 * 1.1 If |X| >= 1/4, go to Step 1.3.
253 * 1.2 Go to Step 7.
254 * 1.3 If |X| < 70 log(2), go to Step 2.
255 * 1.4 Go to Step 10.
256 * Notes: The usual case should take the branches 1.1 -> 1.3 -> 2.
257 * However, it is conceivable |X| can be small very often
258 * because EXPM1 is intended to evaluate exp(X)-1 accurately
259 * when |X| is small. For further details on the comparisons,
260 * see the notes on Step 1 of setox.
261 *
262 * Step 2. Calculate N = round-to-nearest-int( X * 64/log2 ).
263 * 2.1 N := round-to-nearest-integer( X * 64/log2 ).
264 * 2.2 Calculate J = N mod 64; so J = 0,1,2,..., or 63.
265 * 2.3 Calculate M = (N - J)/64; so N = 64M + J.
266 * 2.4 Calculate the address of the stored value of 2^(J/64).
267 * 2.5 Create the values Sc = 2^M and OnebySc := -2^(-M).
268 * Notes: See the notes on Step 2 of setox.
269 *
270 * Step 3. Calculate X - N*log2/64.
271 * 3.1 R := X + N*L1, where L1 := single-precision(-log2/64).
272 * 3.2 R := R + N*L2, L2 := extended-precision(-log2/64 - L1).
273 * Notes: Applying the analysis of Step 3 of setox in this case
274 * shows that |R| <= 0.0055 (note that |X| <= 70 log2 in
275 * this case).
276 *
277 * Step 4. Approximate exp(R)-1 by a polynomial
278 * p = R+R*R*(A1+R*(A2+R*(A3+R*(A4+R*(A5+R*A6)))))
279 * Notes: a) In order to reduce memory access, the coefficients are
280 * made as "short" as possible: A1 (which is 1/2), A5 and A6
281 * are single precision; A2, A3 and A4 are double precision.
282 * b) Even with the restriction above,
283 * |p - (exp(R)-1)| < |R| * 2^(-72.7)
284 * for all |R| <= 0.0055.
285 * c) To fully utilize the pipeline, p is separated into
286 * two independent pieces of roughly equal complexity
287 * p = [ R*S*(A2 + S*(A4 + S*A6)) ] +
288 * [ R + S*(A1 + S*(A3 + S*A5)) ]
289 * where S = R*R.
290 *
291 * Step 5. Compute 2^(J/64)*p by
292 * p := T*p
293 * where T and t are the stored values for 2^(J/64).
294 * Notes: 2^(J/64) is stored as T and t where T+t approximates
295 * 2^(J/64) to roughly 85 bits; T is in extended precision
296 * and t is in single precision. Note also that T is rounded
297 * to 62 bits so that the last two bits of T are zero. The
298 * reason for such a special form is that T-1, T-2, and T-8
299 * will all be exact --- a property that will be exploited
300 * in Step 6 below. The total relative error in p is no
301 * bigger than 2^(-67.7) compared to the final result.
302 *
303 * Step 6. Reconstruction of exp(X)-1
304 * exp(X)-1 = 2^M * ( 2^(J/64) + p - 2^(-M) ).
305 * 6.1 If M <= 63, go to Step 6.3.
306 * 6.2 ans := T + (p + (t + OnebySc)). Go to 6.6
307 * 6.3 If M >= -3, go to 6.5.
308 * 6.4 ans := (T + (p + t)) + OnebySc. Go to 6.6
309 * 6.5 ans := (T + OnebySc) + (p + t).
310 * 6.6 Restore user FPCR.
311 * 6.7 Return ans := Sc * ans. Exit.
312 * Notes: The various arrangements of the expressions give accurate
313 * evaluations.
314 *
315 * Step 7. exp(X)-1 for |X| < 1/4.
316 * 7.1 If |X| >= 2^(-65), go to Step 9.
317 * 7.2 Go to Step 8.
318 *
319 * Step 8. Calculate exp(X)-1, |X| < 2^(-65).
320 * 8.1 If |X| < 2^(-16312), goto 8.3
321 * 8.2 Restore FPCR; return ans := X - 2^(-16382). Exit.
322 * 8.3 X := X * 2^(140).
323 * 8.4 Restore FPCR; ans := ans - 2^(-16382).
324 * Return ans := ans*2^(140). Exit
325 * Notes: The idea is to return "X - tiny" under the user
326 * precision and rounding modes. To avoid unnecessary
327 * inefficiency, we stay away from denormalized numbers the
328 * best we can. For |X| >= 2^(-16312), the straightforward
329 * 8.2 generates the inexact exception as the case warrants.
330 *
331 * Step 9. Calculate exp(X)-1, |X| < 1/4, by a polynomial
332 * p = X + X*X*(B1 + X*(B2 + ... + X*B12))
333 * Notes: a) In order to reduce memory access, the coefficients are
334 * made as "short" as possible: B1 (which is 1/2), B9 to B12
335 * are single precision; B3 to B8 are double precision; and
336 * B2 is double extended.
337 * b) Even with the restriction above,
338 * |p - (exp(X)-1)| < |X| 2^(-70.6)
339 * for all |X| <= 0.251.
340 * Note that 0.251 is slightly bigger than 1/4.
341 * c) To fully preserve accuracy, the polynomial is computed
342 * as X + ( S*B1 + Q ) where S = X*X and
343 * Q = X*S*(B2 + X*(B3 + ... + X*B12))
344 * d) To fully utilize the pipeline, Q is separated into
345 * two independent pieces of roughly equal complexity
346 * Q = [ X*S*(B2 + S*(B4 + ... + S*B12)) ] +
347 * [ S*S*(B3 + S*(B5 + ... + S*B11)) ]
348 *
349 * Step 10. Calculate exp(X)-1 for |X| >= 70 log 2.
350 * 10.1 If X >= 70log2 , exp(X) - 1 = exp(X) for all practical
351 * purposes. Therefore, go to Step 1 of setox.
352 * 10.2 If X <= -70log2, exp(X) - 1 = -1 for all practical purposes.
353 * ans := -1
354 * Restore user FPCR
355 * Return ans := ans + 2^(-126). Exit.
356 * Notes: 10.2 will always create an inexact and return -1 + tiny
357 * in the user rounding precision and mode.
358 *
359
360 setox IDNT 2,1 Motorola 040 Floating Point Software Package
361
362 section 8
363
364 include fpsp.h
365
366 L2 DC.L $3FDC0000,$82E30865,$4361C4C6,$00000000
367
368 EXPA3 DC.L $3FA55555,$55554431
369 EXPA2 DC.L $3FC55555,$55554018
370
371 HUGE DC.L $7FFE0000,$FFFFFFFF,$FFFFFFFF,$00000000
372 TINY DC.L $00010000,$FFFFFFFF,$FFFFFFFF,$00000000
373
374 EM1A4 DC.L $3F811111,$11174385
375 EM1A3 DC.L $3FA55555,$55554F5A
376
377 EM1A2 DC.L $3FC55555,$55555555,$00000000,$00000000
378
379 EM1B8 DC.L $3EC71DE3,$A5774682
380 EM1B7 DC.L $3EFA01A0,$19D7CB68
381
382 EM1B6 DC.L $3F2A01A0,$1A019DF3
383 EM1B5 DC.L $3F56C16C,$16C170E2
384
385 EM1B4 DC.L $3F811111,$11111111
386 EM1B3 DC.L $3FA55555,$55555555
387
388 EM1B2 DC.L $3FFC0000,$AAAAAAAA,$AAAAAAAB
389 DC.L $00000000
390
391 TWO140 DC.L $48B00000,$00000000
392 TWON140 DC.L $37300000,$00000000
393
394 EXPTBL
395 DC.L $3FFF0000,$80000000,$00000000,$00000000
396 DC.L $3FFF0000,$8164D1F3,$BC030774,$9F841A9B
397 DC.L $3FFF0000,$82CD8698,$AC2BA1D8,$9FC1D5B9
398 DC.L $3FFF0000,$843A28C3,$ACDE4048,$A0728369
399 DC.L $3FFF0000,$85AAC367,$CC487B14,$1FC5C95C
400 DC.L $3FFF0000,$871F6196,$9E8D1010,$1EE85C9F
401 DC.L $3FFF0000,$88980E80,$92DA8528,$9FA20729
402 DC.L $3FFF0000,$8A14D575,$496EFD9C,$A07BF9AF
403 DC.L $3FFF0000,$8B95C1E3,$EA8BD6E8,$A0020DCF
404 DC.L $3FFF0000,$8D1ADF5B,$7E5BA9E4,$205A63DA
405 DC.L $3FFF0000,$8EA4398B,$45CD53C0,$1EB70051
406 DC.L $3FFF0000,$9031DC43,$1466B1DC,$1F6EB029
407 DC.L $3FFF0000,$91C3D373,$AB11C338,$A0781494
408 DC.L $3FFF0000,$935A2B2F,$13E6E92C,$9EB319B0
409 DC.L $3FFF0000,$94F4EFA8,$FEF70960,$2017457D
410 DC.L $3FFF0000,$96942D37,$20185A00,$1F11D537
411 DC.L $3FFF0000,$9837F051,$8DB8A970,$9FB952DD
412 DC.L $3FFF0000,$99E04593,$20B7FA64,$1FE43087
413 DC.L $3FFF0000,$9B8D39B9,$D54E5538,$1FA2A818
414 DC.L $3FFF0000,$9D3ED9A7,$2CFFB750,$1FDE494D
415 DC.L $3FFF0000,$9EF53260,$91A111AC,$20504890
416 DC.L $3FFF0000,$A0B0510F,$B9714FC4,$A073691C
417 DC.L $3FFF0000,$A2704303,$0C496818,$1F9B7A05
418 DC.L $3FFF0000,$A43515AE,$09E680A0,$A0797126
419 DC.L $3FFF0000,$A5FED6A9,$B15138EC,$A071A140
420 DC.L $3FFF0000,$A7CD93B4,$E9653568,$204F62DA
421 DC.L $3FFF0000,$A9A15AB4,$EA7C0EF8,$1F283C4A
422 DC.L $3FFF0000,$AB7A39B5,$A93ED338,$9F9A7FDC
423 DC.L $3FFF0000,$AD583EEA,$42A14AC8,$A05B3FAC
424 DC.L $3FFF0000,$AF3B78AD,$690A4374,$1FDF2610
425 DC.L $3FFF0000,$B123F581,$D2AC2590,$9F705F90
426 DC.L $3FFF0000,$B311C412,$A9112488,$201F678A
427 DC.L $3FFF0000,$B504F333,$F9DE6484,$1F32FB13
428 DC.L $3FFF0000,$B6FD91E3,$28D17790,$20038B30
429 DC.L $3FFF0000,$B8FBAF47,$62FB9EE8,$200DC3CC
430 DC.L $3FFF0000,$BAFF5AB2,$133E45FC,$9F8B2AE6
431 DC.L $3FFF0000,$BD08A39F,$580C36C0,$A02BBF70
432 DC.L $3FFF0000,$BF1799B6,$7A731084,$A00BF518
433 DC.L $3FFF0000,$C12C4CCA,$66709458,$A041DD41
434 DC.L $3FFF0000,$C346CCDA,$24976408,$9FDF137B
435 DC.L $3FFF0000,$C5672A11,$5506DADC,$201F1568
436 DC.L $3FFF0000,$C78D74C8,$ABB9B15C,$1FC13A2E
437 DC.L $3FFF0000,$C9B9BD86,$6E2F27A4,$A03F8F03
438 DC.L $3FFF0000,$CBEC14FE,$F2727C5C,$1FF4907D
439 DC.L $3FFF0000,$CE248C15,$1F8480E4,$9E6E53E4
440 DC.L $3FFF0000,$D06333DA,$EF2B2594,$1FD6D45C
441 DC.L $3FFF0000,$D2A81D91,$F12AE45C,$A076EDB9
442 DC.L $3FFF0000,$D4F35AAB,$CFEDFA20,$9FA6DE21
443 DC.L $3FFF0000,$D744FCCA,$D69D6AF4,$1EE69A2F
444 DC.L $3FFF0000,$D99D15C2,$78AFD7B4,$207F439F
445 DC.L $3FFF0000,$DBFBB797,$DAF23754,$201EC207
446 DC.L $3FFF0000,$DE60F482,$5E0E9124,$9E8BE175
447 DC.L $3FFF0000,$E0CCDEEC,$2A94E110,$20032C4B
448 DC.L $3FFF0000,$E33F8972,$BE8A5A50,$2004DFF5
449 DC.L $3FFF0000,$E5B906E7,$7C8348A8,$1E72F47A
450 DC.L $3FFF0000,$E8396A50,$3C4BDC68,$1F722F22
451 DC.L $3FFF0000,$EAC0C6E7,$DD243930,$A017E945
452 DC.L $3FFF0000,$ED4F301E,$D9942B84,$1F401A5B
453 DC.L $3FFF0000,$EFE4B99B,$DCDAF5CC,$9FB9A9E3
454 DC.L $3FFF0000,$F281773C,$59FFB138,$20744C05
455 DC.L $3FFF0000,$F5257D15,$2486CC2C,$1F773A19
456 DC.L $3FFF0000,$F7D0DF73,$0AD13BB8,$1FFE90D5
457 DC.L $3FFF0000,$FA83B2DB,$722A033C,$A041ED22
458 DC.L $3FFF0000,$FD3E0C0C,$F486C174,$1F853F3A
459
460 ADJFLAG equ L_SCR2
461 SCALE equ FP_SCR1
462 ADJSCALE equ FP_SCR2
463 SC equ FP_SCR3
464 ONEBYSC equ FP_SCR4
465
466 xref t_frcinx
467 xref t_extdnrm
468 xref t_unfl
469 xref t_ovfl
470
471 xdef setoxd
472 setoxd:
473 *--entry point for EXP(X), X is denormalized
474 MOVE.L (a0),d0
475 ANDI.L #$80000000,d0
476 ORI.L #$00800000,d0 ...sign(X)*2^(-126)
477 MOVE.L d0,-(sp)
478 FMOVE.S #:3F800000,fp0
479 fmove.l d1,fpcr
480 FADD.S (sp)+,fp0
481 bra t_frcinx
482
483 xdef setox
484 setox:
485 *--entry point for EXP(X), here X is finite, non-zero, and not NaN's
486
487 *--Step 1.
488 MOVE.L (a0),d0 ...load part of input X
489 ANDI.L #$7FFF0000,d0 ...biased expo. of X
490 CMPI.L #$3FBE0000,d0 ...2^(-65)
491 BGE.B EXPC1 ...normal case
492 BRA.W EXPSM
493
494 EXPC1:
495 *--The case |X| >= 2^(-65)
496 MOVE.W 4(a0),d0 ...expo. and partial sig. of |X|
497 CMPI.L #$400CB167,d0 ...16380 log2 trunc. 16 bits
498 BLT.B EXPMAIN ...normal case
499 BRA.W EXPBIG
500
501 EXPMAIN:
502 *--Step 2.
503 *--This is the normal branch: 2^(-65) <= |X| < 16380 log2.
504 FMOVE.X (a0),fp0 ...load input from (a0)
505
506 FMOVE.X fp0,fp1
507 FMUL.S #:42B8AA3B,fp0 ...64/log2 * X
508 fmovem.x fp2/fp3,-(a7) ...save fp2
509 MOVE.L #0,ADJFLAG(a6)
510 FMOVE.L fp0,d0 ...N = int( X * 64/log2 )
511 LEA EXPTBL,a1
512 FMOVE.L d0,fp0 ...convert to floating-format
513
514 MOVE.L d0,L_SCR1(a6) ...save N temporarily
515 ANDI.L #$3F,d0 ...D0 is J = N mod 64
516 LSL.L #4,d0
517 ADDA.L d0,a1 ...address of 2^(J/64)
518 MOVE.L L_SCR1(a6),d0
519 ASR.L #6,d0 ...D0 is M
520 ADDI.W #$3FFF,d0 ...biased expo. of 2^(M)
521 MOVE.W L2,L_SCR1(a6) ...prefetch L2, no need in CB
522
523 EXPCONT1:
524 *--Step 3.
525 *--fp1,fp2 saved on the stack. fp0 is N, fp1 is X,
526 *--a0 points to 2^(J/64), D0 is biased expo. of 2^(M)
527 FMOVE.X fp0,fp2
528 FMUL.S #:BC317218,fp0 ...N * L1, L1 = lead(-log2/64)
529 FMUL.X L2,fp2 ...N * L2, L1+L2 = -log2/64
530 FADD.X fp1,fp0 ...X + N*L1
531 FADD.X fp2,fp0 ...fp0 is R, reduced arg.
532 * MOVE.W #$3FA5,EXPA3 ...load EXPA3 in cache
533
534 *--Step 4.
535 *--WE NOW COMPUTE EXP(R)-1 BY A POLYNOMIAL
536 *-- R + R*R*(A1 + R*(A2 + R*(A3 + R*(A4 + R*A5))))
537 *--TO FULLY UTILIZE THE PIPELINE, WE COMPUTE S = R*R
538 *--[R+R*S*(A2+S*A4)] + [S*(A1+S*(A3+S*A5))]
539
540 FMOVE.X fp0,fp1
541 FMUL.X fp1,fp1 ...fp1 IS S = R*R
542
543 FMOVE.S #:3AB60B70,fp2 ...fp2 IS A5
544 * MOVE.W #0,2(a1) ...load 2^(J/64) in cache
545
546 FMUL.X fp1,fp2 ...fp2 IS S*A5
547 FMOVE.X fp1,fp3
548 FMUL.S #:3C088895,fp3 ...fp3 IS S*A4
549
550 FADD.D EXPA3,fp2 ...fp2 IS A3+S*A5
551 FADD.D EXPA2,fp3 ...fp3 IS A2+S*A4
552
553 FMUL.X fp1,fp2 ...fp2 IS S*(A3+S*A5)
554 MOVE.W d0,SCALE(a6) ...SCALE is 2^(M) in extended
555 clr.w SCALE+2(a6)
556 move.l #$80000000,SCALE+4(a6)
557 clr.l SCALE+8(a6)
558
559 FMUL.X fp1,fp3 ...fp3 IS S*(A2+S*A4)
560
561 FADD.S #:3F000000,fp2 ...fp2 IS A1+S*(A3+S*A5)
562 FMUL.X fp0,fp3 ...fp3 IS R*S*(A2+S*A4)
563
564 FMUL.X fp1,fp2 ...fp2 IS S*(A1+S*(A3+S*A5))
565 FADD.X fp3,fp0 ...fp0 IS R+R*S*(A2+S*A4),
566 * ...fp3 released
567
568 FMOVE.X (a1)+,fp1 ...fp1 is lead. pt. of 2^(J/64)
569 FADD.X fp2,fp0 ...fp0 is EXP(R) - 1
570 * ...fp2 released
571
572 *--Step 5
573 *--final reconstruction process
574 *--EXP(X) = 2^M * ( 2^(J/64) + 2^(J/64)*(EXP(R)-1) )
575
576 FMUL.X fp1,fp0 ...2^(J/64)*(Exp(R)-1)
577 fmovem.x (a7)+,fp2/fp3 ...fp2 restored
578 FADD.S (a1),fp0 ...accurate 2^(J/64)
579
580 FADD.X fp1,fp0 ...2^(J/64) + 2^(J/64)*...
581 MOVE.L ADJFLAG(a6),d0
582
583 *--Step 6
584 TST.L D0
585 BEQ.B NORMAL
586 ADJUST:
587 FMUL.X ADJSCALE(a6),fp0
588 NORMAL:
589 FMOVE.L d1,FPCR ...restore user FPCR
590 FMUL.X SCALE(a6),fp0 ...multiply 2^(M)
591 bra t_frcinx
592
593 EXPSM:
594 *--Step 7
595 FMOVEM.X (a0),fp0 ...in case X is denormalized
596 FMOVE.L d1,FPCR
597 FADD.S #:3F800000,fp0 ...1+X in user mode
598 bra t_frcinx
599
600 EXPBIG:
601 *--Step 8
602 CMPI.L #$400CB27C,d0 ...16480 log2
603 BGT.B EXP2BIG
604 *--Steps 8.2 -- 8.6
605 FMOVE.X (a0),fp0 ...load input from (a0)
606
607 FMOVE.X fp0,fp1
608 FMUL.S #:42B8AA3B,fp0 ...64/log2 * X
609 fmovem.x fp2/fp3,-(a7) ...save fp2
610 MOVE.L #1,ADJFLAG(a6)
611 FMOVE.L fp0,d0 ...N = int( X * 64/log2 )
612 LEA EXPTBL,a1
613 FMOVE.L d0,fp0 ...convert to floating-format
614 MOVE.L d0,L_SCR1(a6) ...save N temporarily
615 ANDI.L #$3F,d0 ...D0 is J = N mod 64
616 LSL.L #4,d0
617 ADDA.L d0,a1 ...address of 2^(J/64)
618 MOVE.L L_SCR1(a6),d0
619 ASR.L #6,d0 ...D0 is K
620 MOVE.L d0,L_SCR1(a6) ...save K temporarily
621 ASR.L #1,d0 ...D0 is M1
622 SUB.L d0,L_SCR1(a6) ...a1 is M
623 ADDI.W #$3FFF,d0 ...biased expo. of 2^(M1)
624 MOVE.W d0,ADJSCALE(a6) ...ADJSCALE := 2^(M1)
625 clr.w ADJSCALE+2(a6)
626 move.l #$80000000,ADJSCALE+4(a6)
627 clr.l ADJSCALE+8(a6)
628 MOVE.L L_SCR1(a6),d0 ...D0 is M
629 ADDI.W #$3FFF,d0 ...biased expo. of 2^(M)
630 BRA.W EXPCONT1 ...go back to Step 3
631
632 EXP2BIG:
633 *--Step 9
634 FMOVE.L d1,FPCR
635 MOVE.L (a0),d0
636 bclr.b #sign_bit,(a0) ...setox always returns positive
637 CMPI.L #0,d0
638 BLT t_unfl
639 BRA t_ovfl
640
641 xdef setoxm1d
642 setoxm1d:
643 *--entry point for EXPM1(X), here X is denormalized
644 *--Step 0.
645 bra t_extdnrm
646
647
648 xdef setoxm1
649 setoxm1:
650 *--entry point for EXPM1(X), here X is finite, non-zero, non-NaN
651
652 *--Step 1.
653 *--Step 1.1
654 MOVE.L (a0),d0 ...load part of input X
655 ANDI.L #$7FFF0000,d0 ...biased expo. of X
656 CMPI.L #$3FFD0000,d0 ...1/4
657 BGE.B EM1CON1 ...|X| >= 1/4
658 BRA.W EM1SM
659
660 EM1CON1:
661 *--Step 1.3
662 *--The case |X| >= 1/4
663 MOVE.W 4(a0),d0 ...expo. and partial sig. of |X|
664 CMPI.L #$4004C215,d0 ...70log2 rounded up to 16 bits
665 BLE.B EM1MAIN ...1/4 <= |X| <= 70log2
666 BRA.W EM1BIG
667
668 EM1MAIN:
669 *--Step 2.
670 *--This is the case: 1/4 <= |X| <= 70 log2.
671 FMOVE.X (a0),fp0 ...load input from (a0)
672
673 FMOVE.X fp0,fp1
674 FMUL.S #:42B8AA3B,fp0 ...64/log2 * X
675 fmovem.x fp2/fp3,-(a7) ...save fp2
676 * MOVE.W #$3F81,EM1A4 ...prefetch in CB mode
677 FMOVE.L fp0,d0 ...N = int( X * 64/log2 )
678 LEA EXPTBL,a1
679 FMOVE.L d0,fp0 ...convert to floating-format
680
681 MOVE.L d0,L_SCR1(a6) ...save N temporarily
682 ANDI.L #$3F,d0 ...D0 is J = N mod 64
683 LSL.L #4,d0
684 ADDA.L d0,a1 ...address of 2^(J/64)
685 MOVE.L L_SCR1(a6),d0
686 ASR.L #6,d0 ...D0 is M
687 MOVE.L d0,L_SCR1(a6) ...save a copy of M
688 * MOVE.W #$3FDC,L2 ...prefetch L2 in CB mode
689
690 *--Step 3.
691 *--fp1,fp2 saved on the stack. fp0 is N, fp1 is X,
692 *--a0 points to 2^(J/64), D0 and a1 both contain M
693 FMOVE.X fp0,fp2
694 FMUL.S #:BC317218,fp0 ...N * L1, L1 = lead(-log2/64)
695 FMUL.X L2,fp2 ...N * L2, L1+L2 = -log2/64
696 FADD.X fp1,fp0 ...X + N*L1
697 FADD.X fp2,fp0 ...fp0 is R, reduced arg.
698 * MOVE.W #$3FC5,EM1A2 ...load EM1A2 in cache
699 ADDI.W #$3FFF,d0 ...D0 is biased expo. of 2^M
700
701 *--Step 4.
702 *--WE NOW COMPUTE EXP(R)-1 BY A POLYNOMIAL
703 *-- R + R*R*(A1 + R*(A2 + R*(A3 + R*(A4 + R*(A5 + R*A6)))))
704 *--TO FULLY UTILIZE THE PIPELINE, WE COMPUTE S = R*R
705 *--[R*S*(A2+S*(A4+S*A6))] + [R+S*(A1+S*(A3+S*A5))]
706
707 FMOVE.X fp0,fp1
708 FMUL.X fp1,fp1 ...fp1 IS S = R*R
709
710 FMOVE.S #:3950097B,fp2 ...fp2 IS a6
711 * MOVE.W #0,2(a1) ...load 2^(J/64) in cache
712
713 FMUL.X fp1,fp2 ...fp2 IS S*A6
714 FMOVE.X fp1,fp3
715 FMUL.S #:3AB60B6A,fp3 ...fp3 IS S*A5
716
717 FADD.D EM1A4,fp2 ...fp2 IS A4+S*A6
718 FADD.D EM1A3,fp3 ...fp3 IS A3+S*A5
719 MOVE.W d0,SC(a6) ...SC is 2^(M) in extended
720 clr.w SC+2(a6)
721 move.l #$80000000,SC+4(a6)
722 clr.l SC+8(a6)
723
724 FMUL.X fp1,fp2 ...fp2 IS S*(A4+S*A6)
725 MOVE.L L_SCR1(a6),d0 ...D0 is M
726 NEG.W D0 ...D0 is -M
727 FMUL.X fp1,fp3 ...fp3 IS S*(A3+S*A5)
728 ADDI.W #$3FFF,d0 ...biased expo. of 2^(-M)
729 FADD.D EM1A2,fp2 ...fp2 IS A2+S*(A4+S*A6)
730 FADD.S #:3F000000,fp3 ...fp3 IS A1+S*(A3+S*A5)
731
732 FMUL.X fp1,fp2 ...fp2 IS S*(A2+S*(A4+S*A6))
733 ORI.W #$8000,d0 ...signed/expo. of -2^(-M)
734 MOVE.W d0,ONEBYSC(a6) ...OnebySc is -2^(-M)
735 clr.w ONEBYSC+2(a6)
736 move.l #$80000000,ONEBYSC+4(a6)
737 clr.l ONEBYSC+8(a6)
738 FMUL.X fp3,fp1 ...fp1 IS S*(A1+S*(A3+S*A5))
739 * ...fp3 released
740
741 FMUL.X fp0,fp2 ...fp2 IS R*S*(A2+S*(A4+S*A6))
742 FADD.X fp1,fp0 ...fp0 IS R+S*(A1+S*(A3+S*A5))
743 * ...fp1 released
744
745 FADD.X fp2,fp0 ...fp0 IS EXP(R)-1
746 * ...fp2 released
747 fmovem.x (a7)+,fp2/fp3 ...fp2 restored
748
749 *--Step 5
750 *--Compute 2^(J/64)*p
751
752 FMUL.X (a1),fp0 ...2^(J/64)*(Exp(R)-1)
753
754 *--Step 6
755 *--Step 6.1
756 MOVE.L L_SCR1(a6),d0 ...retrieve M
757 CMPI.L #63,d0
758 BLE.B MLE63
759 *--Step 6.2 M >= 64
760 FMOVE.S 12(a1),fp1 ...fp1 is t
761 FADD.X ONEBYSC(a6),fp1 ...fp1 is t+OnebySc
762 FADD.X fp1,fp0 ...p+(t+OnebySc), fp1 released
763 FADD.X (a1),fp0 ...T+(p+(t+OnebySc))
764 BRA.B EM1SCALE
765 MLE63:
766 *--Step 6.3 M <= 63
767 CMPI.L #-3,d0
768 BGE.B MGEN3
769 MLTN3:
770 *--Step 6.4 M <= -4
771 FADD.S 12(a1),fp0 ...p+t
772 FADD.X (a1),fp0 ...T+(p+t)
773 FADD.X ONEBYSC(a6),fp0 ...OnebySc + (T+(p+t))
774 BRA.B EM1SCALE
775 MGEN3:
776 *--Step 6.5 -3 <= M <= 63
777 FMOVE.X (a1)+,fp1 ...fp1 is T
778 FADD.S (a1),fp0 ...fp0 is p+t
779 FADD.X ONEBYSC(a6),fp1 ...fp1 is T+OnebySc
780 FADD.X fp1,fp0 ...(T+OnebySc)+(p+t)
781
782 EM1SCALE:
783 *--Step 6.6
784 FMOVE.L d1,FPCR
785 FMUL.X SC(a6),fp0
786
787 bra t_frcinx
788
789 EM1SM:
790 *--Step 7 |X| < 1/4.
791 CMPI.L #$3FBE0000,d0 ...2^(-65)
792 BGE.B EM1POLY
793
794 EM1TINY:
795 *--Step 8 |X| < 2^(-65)
796 CMPI.L #$00330000,d0 ...2^(-16312)
797 BLT.B EM12TINY
798 *--Step 8.2
799 MOVE.L #$80010000,SC(a6) ...SC is -2^(-16382)
800 move.l #$80000000,SC+4(a6)
801 clr.l SC+8(a6)
802 FMOVE.X (a0),fp0
803 FMOVE.L d1,FPCR
804 FADD.X SC(a6),fp0
805
806 bra t_frcinx
807
808 EM12TINY:
809 *--Step 8.3
810 FMOVE.X (a0),fp0
811 FMUL.D TWO140,fp0
812 MOVE.L #$80010000,SC(a6)
813 move.l #$80000000,SC+4(a6)
814 clr.l SC+8(a6)
815 FADD.X SC(a6),fp0
816 FMOVE.L d1,FPCR
817 FMUL.D TWON140,fp0
818
819 bra t_frcinx
820
821 EM1POLY:
822 *--Step 9 exp(X)-1 by a simple polynomial
823 FMOVE.X (a0),fp0 ...fp0 is X
824 FMUL.X fp0,fp0 ...fp0 is S := X*X
825 fmovem.x fp2/fp3,-(a7) ...save fp2
826 FMOVE.S #:2F30CAA8,fp1 ...fp1 is B12
827 FMUL.X fp0,fp1 ...fp1 is S*B12
828 FMOVE.S #:310F8290,fp2 ...fp2 is B11
829 FADD.S #:32D73220,fp1 ...fp1 is B10+S*B12
830
831 FMUL.X fp0,fp2 ...fp2 is S*B11
832 FMUL.X fp0,fp1 ...fp1 is S*(B10 + ...
833
834 FADD.S #:3493F281,fp2 ...fp2 is B9+S*...
835 FADD.D EM1B8,fp1 ...fp1 is B8+S*...
836
837 FMUL.X fp0,fp2 ...fp2 is S*(B9+...
838 FMUL.X fp0,fp1 ...fp1 is S*(B8+...
839
840 FADD.D EM1B7,fp2 ...fp2 is B7+S*...
841 FADD.D EM1B6,fp1 ...fp1 is B6+S*...
842
843 FMUL.X fp0,fp2 ...fp2 is S*(B7+...
844 FMUL.X fp0,fp1 ...fp1 is S*(B6+...
845
846 FADD.D EM1B5,fp2 ...fp2 is B5+S*...
847 FADD.D EM1B4,fp1 ...fp1 is B4+S*...
848
849 FMUL.X fp0,fp2 ...fp2 is S*(B5+...
850 FMUL.X fp0,fp1 ...fp1 is S*(B4+...
851
852 FADD.D EM1B3,fp2 ...fp2 is B3+S*...
853 FADD.X EM1B2,fp1 ...fp1 is B2+S*...
854
855 FMUL.X fp0,fp2 ...fp2 is S*(B3+...
856 FMUL.X fp0,fp1 ...fp1 is S*(B2+...
857
858 FMUL.X fp0,fp2 ...fp2 is S*S*(B3+...)
859 FMUL.X (a0),fp1 ...fp1 is X*S*(B2...
860
861 FMUL.S #:3F000000,fp0 ...fp0 is S*B1
862 FADD.X fp2,fp1 ...fp1 is Q
863 * ...fp2 released
864
865 fmovem.x (a7)+,fp2/fp3 ...fp2 restored
866
867 FADD.X fp1,fp0 ...fp0 is S*B1+Q
868 * ...fp1 released
869
870 FMOVE.L d1,FPCR
871 FADD.X (a0),fp0
872
873 bra t_frcinx
874
875 EM1BIG:
876 *--Step 10 |X| > 70 log2
877 MOVE.L (a0),d0
878 CMPI.L #0,d0
879 BGT.W EXPC1
880 *--Step 10.2
881 FMOVE.S #:BF800000,fp0 ...fp0 is -1
882 FMOVE.L d1,FPCR
883 FADD.S #:00800000,fp0 ...-1 + 2^(-126)
884
885 bra t_frcinx
886
887 end
888