softfloat.txt revision 1.1 1 1.1 bjh21 $NetBSD: softfloat.txt,v 1.1 2000/06/06 08:15:10 bjh21 Exp $
2 1.1 bjh21
3 1.1 bjh21 SoftFloat Release 2a General Documentation
4 1.1 bjh21
5 1.1 bjh21 John R. Hauser
6 1.1 bjh21 1998 December 13
7 1.1 bjh21
8 1.1 bjh21
9 1.1 bjh21 -------------------------------------------------------------------------------
10 1.1 bjh21 Introduction
11 1.1 bjh21
12 1.1 bjh21 SoftFloat is a software implementation of floating-point that conforms to
13 1.1 bjh21 the IEC/IEEE Standard for Binary Floating-Point Arithmetic. As many as four
14 1.1 bjh21 formats are supported: single precision, double precision, extended double
15 1.1 bjh21 precision, and quadruple precision. All operations required by the standard
16 1.1 bjh21 are implemented, except for conversions to and from decimal.
17 1.1 bjh21
18 1.1 bjh21 This document gives information about the types defined and the routines
19 1.1 bjh21 implemented by SoftFloat. It does not attempt to define or explain the
20 1.1 bjh21 IEC/IEEE Floating-Point Standard. Details about the standard are available
21 1.1 bjh21 elsewhere.
22 1.1 bjh21
23 1.1 bjh21
24 1.1 bjh21 -------------------------------------------------------------------------------
25 1.1 bjh21 Limitations
26 1.1 bjh21
27 1.1 bjh21 SoftFloat is written in C and is designed to work with other C code. The
28 1.1 bjh21 SoftFloat header files assume an ISO/ANSI-style C compiler. No attempt
29 1.1 bjh21 has been made to accomodate compilers that are not ISO-conformant. In
30 1.1 bjh21 particular, the distributed header files will not be acceptable to any
31 1.1 bjh21 compiler that does not recognize function prototypes.
32 1.1 bjh21
33 1.1 bjh21 Support for the extended double-precision and quadruple-precision formats
34 1.1 bjh21 depends on a C compiler that implements 64-bit integer arithmetic. If the
35 1.1 bjh21 largest integer format supported by the C compiler is 32 bits, SoftFloat is
36 1.1 bjh21 limited to only single and double precisions. When that is the case, all
37 1.1 bjh21 references in this document to the extended double precision, quadruple
38 1.1 bjh21 precision, and 64-bit integers should be ignored.
39 1.1 bjh21
40 1.1 bjh21
41 1.1 bjh21 -------------------------------------------------------------------------------
42 1.1 bjh21 Contents
43 1.1 bjh21
44 1.1 bjh21 Introduction
45 1.1 bjh21 Limitations
46 1.1 bjh21 Contents
47 1.1 bjh21 Legal Notice
48 1.1 bjh21 Types and Functions
49 1.1 bjh21 Rounding Modes
50 1.1 bjh21 Extended Double-Precision Rounding Precision
51 1.1 bjh21 Exceptions and Exception Flags
52 1.1 bjh21 Function Details
53 1.1 bjh21 Conversion Functions
54 1.1 bjh21 Standard Arithmetic Functions
55 1.1 bjh21 Remainder Functions
56 1.1 bjh21 Round-to-Integer Functions
57 1.1 bjh21 Comparison Functions
58 1.1 bjh21 Signaling NaN Test Functions
59 1.1 bjh21 Raise-Exception Function
60 1.1 bjh21 Contact Information
61 1.1 bjh21
62 1.1 bjh21
63 1.1 bjh21
64 1.1 bjh21 -------------------------------------------------------------------------------
65 1.1 bjh21 Legal Notice
66 1.1 bjh21
67 1.1 bjh21 SoftFloat was written by John R. Hauser. This work was made possible in
68 1.1 bjh21 part by the International Computer Science Institute, located at Suite 600,
69 1.1 bjh21 1947 Center Street, Berkeley, California 94704. Funding was partially
70 1.1 bjh21 provided by the National Science Foundation under grant MIP-9311980. The
71 1.1 bjh21 original version of this code was written as part of a project to build
72 1.1 bjh21 a fixed-point vector processor in collaboration with the University of
73 1.1 bjh21 California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek.
74 1.1 bjh21
75 1.1 bjh21 THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort
76 1.1 bjh21 has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT
77 1.1 bjh21 TIMES RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO
78 1.1 bjh21 PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY
79 1.1 bjh21 AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE.
80 1.1 bjh21
81 1.1 bjh21
82 1.1 bjh21 -------------------------------------------------------------------------------
83 1.1 bjh21 Types and Functions
84 1.1 bjh21
85 1.1 bjh21 When 64-bit integers are supported by the compiler, the `softfloat.h' header
86 1.1 bjh21 file defines four types: `float32' (single precision), `float64' (double
87 1.1 bjh21 precision), `floatx80' (extended double precision), and `float128'
88 1.1 bjh21 (quadruple precision). The `float32' and `float64' types are defined in
89 1.1 bjh21 terms of 32-bit and 64-bit integer types, respectively, while the `float128'
90 1.1 bjh21 type is defined as a structure of two 64-bit integers, taking into account
91 1.1 bjh21 the byte order of the particular machine being used. The `floatx80' type
92 1.1 bjh21 is defined as a structure containing one 16-bit and one 64-bit integer, with
93 1.1 bjh21 the machine's byte order again determining the order of the `high' and `low'
94 1.1 bjh21 fields.
95 1.1 bjh21
96 1.1 bjh21 When 64-bit integers are _not_ supported by the compiler, the `softfloat.h'
97 1.1 bjh21 header file defines only two types: `float32' and `float64'. Because
98 1.1 bjh21 ISO/ANSI C guarantees at least one built-in integer type of 32 bits,
99 1.1 bjh21 the `float32' type is identified with an appropriate integer type. The
100 1.1 bjh21 `float64' type is defined as a structure of two 32-bit integers, with the
101 1.1 bjh21 machine's byte order determining the order of the fields.
102 1.1 bjh21
103 1.1 bjh21 In either case, the types in `softfloat.h' are defined such that if a system
104 1.1 bjh21 implements the usual C `float' and `double' types according to the IEC/IEEE
105 1.1 bjh21 Standard, then the `float32' and `float64' types should be indistinguishable
106 1.1 bjh21 in memory from the native `float' and `double' types. (On the other hand,
107 1.1 bjh21 when `float32' or `float64' values are placed in processor registers by
108 1.1 bjh21 the compiler, the type of registers used may differ from those used for the
109 1.1 bjh21 native `float' and `double' types.)
110 1.1 bjh21
111 1.1 bjh21 SoftFloat implements the following arithmetic operations:
112 1.1 bjh21
113 1.1 bjh21 -- Conversions among all the floating-point formats, and also between
114 1.1 bjh21 integers (32-bit and 64-bit) and any of the floating-point formats.
115 1.1 bjh21
116 1.1 bjh21 -- The usual add, subtract, multiply, divide, and square root operations
117 1.1 bjh21 for all floating-point formats.
118 1.1 bjh21
119 1.1 bjh21 -- For each format, the floating-point remainder operation defined by the
120 1.1 bjh21 IEC/IEEE Standard.
121 1.1 bjh21
122 1.1 bjh21 -- For each floating-point format, a ``round to integer'' operation that
123 1.1 bjh21 rounds to the nearest integer value in the same format. (The floating-
124 1.1 bjh21 point formats can hold integer values, of course.)
125 1.1 bjh21
126 1.1 bjh21 -- Comparisons between two values in the same floating-point format.
127 1.1 bjh21
128 1.1 bjh21 The only functions required by the IEC/IEEE Standard that are not provided
129 1.1 bjh21 are conversions to and from decimal.
130 1.1 bjh21
131 1.1 bjh21
132 1.1 bjh21 -------------------------------------------------------------------------------
133 1.1 bjh21 Rounding Modes
134 1.1 bjh21
135 1.1 bjh21 All four rounding modes prescribed by the IEC/IEEE Standard are implemented
136 1.1 bjh21 for all operations that require rounding. The rounding mode is selected
137 1.1 bjh21 by the global variable `float_rounding_mode'. This variable may be set
138 1.1 bjh21 to one of the values `float_round_nearest_even', `float_round_to_zero',
139 1.1 bjh21 `float_round_down', or `float_round_up'. The rounding mode is initialized
140 1.1 bjh21 to nearest/even.
141 1.1 bjh21
142 1.1 bjh21
143 1.1 bjh21 -------------------------------------------------------------------------------
144 1.1 bjh21 Extended Double-Precision Rounding Precision
145 1.1 bjh21
146 1.1 bjh21 For extended double precision (`floatx80') only, the rounding precision
147 1.1 bjh21 of the standard arithmetic operations is controlled by the global variable
148 1.1 bjh21 `floatx80_rounding_precision'. The operations affected are:
149 1.1 bjh21
150 1.1 bjh21 floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt
151 1.1 bjh21
152 1.1 bjh21 When `floatx80_rounding_precision' is set to its default value of 80, these
153 1.1 bjh21 operations are rounded (as usual) to the full precision of the extended
154 1.1 bjh21 double-precision format. Setting `floatx80_rounding_precision' to 32
155 1.1 bjh21 or to 64 causes the operations listed to be rounded to reduced precision
156 1.1 bjh21 equivalent to single precision (`float32') or to double precision
157 1.1 bjh21 (`float64'), respectively. When rounding to reduced precision, additional
158 1.1 bjh21 bits in the result significand beyond the rounding point are set to zero.
159 1.1 bjh21 The consequences of setting `floatx80_rounding_precision' to a value other
160 1.1 bjh21 than 32, 64, or 80 is not specified. Operations other than the ones listed
161 1.1 bjh21 above are not affected by `floatx80_rounding_precision'.
162 1.1 bjh21
163 1.1 bjh21
164 1.1 bjh21 -------------------------------------------------------------------------------
165 1.1 bjh21 Exceptions and Exception Flags
166 1.1 bjh21
167 1.1 bjh21 All five exception flags required by the IEC/IEEE Standard are
168 1.1 bjh21 implemented. Each flag is stored as a unique bit in the global variable
169 1.1 bjh21 `float_exception_flags'. The positions of the exception flag bits within
170 1.1 bjh21 this variable are determined by the bit masks `float_flag_inexact',
171 1.1 bjh21 `float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and
172 1.1 bjh21 `float_flag_invalid'. The exception flags variable is initialized to all 0,
173 1.1 bjh21 meaning no exceptions.
174 1.1 bjh21
175 1.1 bjh21 An individual exception flag can be cleared with the statement
176 1.1 bjh21
177 1.1 bjh21 float_exception_flags &= ~ float_flag_<exception>;
178 1.1 bjh21
179 1.1 bjh21 where `<exception>' is the appropriate name. To raise a floating-point
180 1.1 bjh21 exception, the SoftFloat function `float_raise' should be used (see below).
181 1.1 bjh21
182 1.1 bjh21 In the terminology of the IEC/IEEE Standard, SoftFloat can detect tininess
183 1.1 bjh21 for underflow either before or after rounding. The choice is made by
184 1.1 bjh21 the global variable `float_detect_tininess', which can be set to either
185 1.1 bjh21 `float_tininess_before_rounding' or `float_tininess_after_rounding'.
186 1.1 bjh21 Detecting tininess after rounding is better because it results in fewer
187 1.1 bjh21 spurious underflow signals. The other option is provided for compatibility
188 1.1 bjh21 with some systems. Like most systems, SoftFloat always detects loss of
189 1.1 bjh21 accuracy for underflow as an inexact result.
190 1.1 bjh21
191 1.1 bjh21
192 1.1 bjh21 -------------------------------------------------------------------------------
193 1.1 bjh21 Function Details
194 1.1 bjh21
195 1.1 bjh21 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
196 1.1 bjh21 Conversion Functions
197 1.1 bjh21
198 1.1 bjh21 All conversions among the floating-point formats are supported, as are all
199 1.1 bjh21 conversions between a floating-point format and 32-bit and 64-bit signed
200 1.1 bjh21 integers. The complete set of conversion functions is:
201 1.1 bjh21
202 1.1 bjh21 int32_to_float32 int64_to_float32
203 1.1 bjh21 int32_to_float64 int64_to_float32
204 1.1 bjh21 int32_to_floatx80 int64_to_floatx80
205 1.1 bjh21 int32_to_float128 int64_to_float128
206 1.1 bjh21
207 1.1 bjh21 float32_to_int32 float32_to_int64
208 1.1 bjh21 float32_to_int32 float64_to_int64
209 1.1 bjh21 floatx80_to_int32 floatx80_to_int64
210 1.1 bjh21 float128_to_int32 float128_to_int64
211 1.1 bjh21
212 1.1 bjh21 float32_to_float64 float32_to_floatx80 float32_to_float128
213 1.1 bjh21 float64_to_float32 float64_to_floatx80 float64_to_float128
214 1.1 bjh21 floatx80_to_float32 floatx80_to_float64 floatx80_to_float128
215 1.1 bjh21 float128_to_float32 float128_to_float64 float128_to_floatx80
216 1.1 bjh21
217 1.1 bjh21 Each conversion function takes one operand of the appropriate type and
218 1.1 bjh21 returns one result. Conversions from a smaller to a larger floating-point
219 1.1 bjh21 format are always exact and so require no rounding. Conversions from 32-bit
220 1.1 bjh21 integers to double precision and larger formats are also exact, and likewise
221 1.1 bjh21 for conversions from 64-bit integers to extended double and quadruple
222 1.1 bjh21 precisions.
223 1.1 bjh21
224 1.1 bjh21 Conversions from floating-point to integer raise the invalid exception if
225 1.1 bjh21 the source value cannot be rounded to a representable integer of the desired
226 1.1 bjh21 size (32 or 64 bits). If the floating-point operand is a NaN, the largest
227 1.1 bjh21 positive integer is returned. Otherwise, if the conversion overflows, the
228 1.1 bjh21 largest integer with the same sign as the operand is returned.
229 1.1 bjh21
230 1.1 bjh21 On conversions to integer, if the floating-point operand is not already an
231 1.1 bjh21 integer value, the operand is rounded according to the current rounding
232 1.1 bjh21 mode as specified by `float_rounding_mode'. Because C (and perhaps other
233 1.1 bjh21 languages) require that conversions to integers be rounded toward zero, the
234 1.1 bjh21 following functions are provided for improved speed and convenience:
235 1.1 bjh21
236 1.1 bjh21 float32_to_int32_round_to_zero float32_to_int64_round_to_zero
237 1.1 bjh21 float64_to_int32_round_to_zero float64_to_int64_round_to_zero
238 1.1 bjh21 floatx80_to_int32_round_to_zero floatx80_to_int64_round_to_zero
239 1.1 bjh21 float128_to_int32_round_to_zero float128_to_int64_round_to_zero
240 1.1 bjh21
241 1.1 bjh21 These variant functions ignore `float_rounding_mode' and always round toward
242 1.1 bjh21 zero.
243 1.1 bjh21
244 1.1 bjh21 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
245 1.1 bjh21 Standard Arithmetic Functions
246 1.1 bjh21
247 1.1 bjh21 The following standard arithmetic functions are provided:
248 1.1 bjh21
249 1.1 bjh21 float32_add float32_sub float32_mul float32_div float32_sqrt
250 1.1 bjh21 float64_add float64_sub float64_mul float64_div float64_sqrt
251 1.1 bjh21 floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt
252 1.1 bjh21 float128_add float128_sub float128_mul float128_div float128_sqrt
253 1.1 bjh21
254 1.1 bjh21 Each function takes two operands, except for `sqrt' which takes only one.
255 1.1 bjh21 The operands and result are all of the same type.
256 1.1 bjh21
257 1.1 bjh21 Rounding of the extended double-precision (`floatx80') functions is affected
258 1.1 bjh21 by the `floatx80_rounding_precision' variable, as explained above in the
259 1.1 bjh21 section _Extended_Double-Precision_Rounding_Precision_.
260 1.1 bjh21
261 1.1 bjh21 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
262 1.1 bjh21 Remainder Functions
263 1.1 bjh21
264 1.1 bjh21 For each format, SoftFloat implements the remainder function according to
265 1.1 bjh21 the IEC/IEEE Standard. The remainder functions are:
266 1.1 bjh21
267 1.1 bjh21 float32_rem
268 1.1 bjh21 float64_rem
269 1.1 bjh21 floatx80_rem
270 1.1 bjh21 float128_rem
271 1.1 bjh21
272 1.1 bjh21 Each remainder function takes two operands. The operands and result are all
273 1.1 bjh21 of the same type. Given operands x and y, the remainder functions return
274 1.1 bjh21 the value x - n*y, where n is the integer closest to x/y. If x/y is exactly
275 1.1 bjh21 halfway between two integers, n is the even integer closest to x/y. The
276 1.1 bjh21 remainder functions are always exact and so require no rounding.
277 1.1 bjh21
278 1.1 bjh21 Depending on the relative magnitudes of the operands, the remainder
279 1.1 bjh21 functions can take considerably longer to execute than the other SoftFloat
280 1.1 bjh21 functions. This is inherent in the remainder operation itself and is not a
281 1.1 bjh21 flaw in the SoftFloat implementation.
282 1.1 bjh21
283 1.1 bjh21 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
284 1.1 bjh21 Round-to-Integer Functions
285 1.1 bjh21
286 1.1 bjh21 For each format, SoftFloat implements the round-to-integer function
287 1.1 bjh21 specified by the IEC/IEEE Standard. The functions are:
288 1.1 bjh21
289 1.1 bjh21 float32_round_to_int
290 1.1 bjh21 float64_round_to_int
291 1.1 bjh21 floatx80_round_to_int
292 1.1 bjh21 float128_round_to_int
293 1.1 bjh21
294 1.1 bjh21 Each function takes a single floating-point operand and returns a result of
295 1.1 bjh21 the same type. (Note that the result is not an integer type.) The operand
296 1.1 bjh21 is rounded to an exact integer according to the current rounding mode, and
297 1.1 bjh21 the resulting integer value is returned in the same floating-point format.
298 1.1 bjh21
299 1.1 bjh21 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
300 1.1 bjh21 Comparison Functions
301 1.1 bjh21
302 1.1 bjh21 The following floating-point comparison functions are provided:
303 1.1 bjh21
304 1.1 bjh21 float32_eq float32_le float32_lt
305 1.1 bjh21 float64_eq float64_le float64_lt
306 1.1 bjh21 floatx80_eq floatx80_le floatx80_lt
307 1.1 bjh21 float128_eq float128_le float128_lt
308 1.1 bjh21
309 1.1 bjh21 Each function takes two operands of the same type and returns a 1 or 0
310 1.1 bjh21 representing either _true_ or _false_. The abbreviation `eq' stands for
311 1.1 bjh21 ``equal'' (=); `le' stands for ``less than or equal'' (<=); and `lt' stands
312 1.1 bjh21 for ``less than'' (<).
313 1.1 bjh21
314 1.1 bjh21 The standard greater-than (>), greater-than-or-equal (>=), and not-equal
315 1.1 bjh21 (!=) functions are easily obtained using the functions provided. The
316 1.1 bjh21 not-equal function is just the logical complement of the equal function.
317 1.1 bjh21 The greater-than-or-equal function is identical to the less-than-or-equal
318 1.1 bjh21 function with the operands reversed; and the greater-than function can be
319 1.1 bjh21 obtained from the less-than function in the same way.
320 1.1 bjh21
321 1.1 bjh21 The IEC/IEEE Standard specifies that the less-than-or-equal and less-than
322 1.1 bjh21 functions raise the invalid exception if either input is any kind of NaN.
323 1.1 bjh21 The equal functions, on the other hand, are defined not to raise the invalid
324 1.1 bjh21 exception on quiet NaNs. For completeness, SoftFloat provides the following
325 1.1 bjh21 additional functions:
326 1.1 bjh21
327 1.1 bjh21 float32_eq_signaling float32_le_quiet float32_lt_quiet
328 1.1 bjh21 float64_eq_signaling float64_le_quiet float64_lt_quiet
329 1.1 bjh21 floatx80_eq_signaling floatx80_le_quiet floatx80_lt_quiet
330 1.1 bjh21 float128_eq_signaling float128_le_quiet float128_lt_quiet
331 1.1 bjh21
332 1.1 bjh21 The `signaling' equal functions are identical to the standard functions
333 1.1 bjh21 except that the invalid exception is raised for any NaN input. Likewise,
334 1.1 bjh21 the `quiet' comparison functions are identical to their counterparts except
335 1.1 bjh21 that the invalid exception is not raised for quiet NaNs.
336 1.1 bjh21
337 1.1 bjh21 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
338 1.1 bjh21 Signaling NaN Test Functions
339 1.1 bjh21
340 1.1 bjh21 The following functions test whether a floating-point value is a signaling
341 1.1 bjh21 NaN:
342 1.1 bjh21
343 1.1 bjh21 float32_is_signaling_nan
344 1.1 bjh21 float64_is_signaling_nan
345 1.1 bjh21 floatx80_is_signaling_nan
346 1.1 bjh21 float128_is_signaling_nan
347 1.1 bjh21
348 1.1 bjh21 The functions take one operand and return 1 if the operand is a signaling
349 1.1 bjh21 NaN and 0 otherwise.
350 1.1 bjh21
351 1.1 bjh21 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
352 1.1 bjh21 Raise-Exception Function
353 1.1 bjh21
354 1.1 bjh21 SoftFloat provides a function for raising floating-point exceptions:
355 1.1 bjh21
356 1.1 bjh21 float_raise
357 1.1 bjh21
358 1.1 bjh21 The function takes a mask indicating the set of exceptions to raise. No
359 1.1 bjh21 result is returned. In addition to setting the specified exception flags,
360 1.1 bjh21 this function may cause a trap or abort appropriate for the current system.
361 1.1 bjh21
362 1.1 bjh21 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
363 1.1 bjh21
364 1.1 bjh21
365 1.1 bjh21 -------------------------------------------------------------------------------
366 1.1 bjh21 Contact Information
367 1.1 bjh21
368 1.1 bjh21 At the time of this writing, the most up-to-date information about
369 1.1 bjh21 SoftFloat and the latest release can be found at the Web page `http://
370 1.1 bjh21 HTTP.CS.Berkeley.EDU/~jhauser/arithmetic/SoftFloat.html'.
371 1.1 bjh21
372 1.1 bjh21
373