Decimal Arithmetic Specification, version 1.70
Copyright (c) IBM Corporation, 2009. All rights reserved. ©
| 7 Apr 2009 |
[previous | contents | next]
|
The full specification in the body of this document defines a
decimal floating-point arithmetic which gives exact results and
preserves exponents where possible. If insufficient precision is
available for this, then numbers are handled according to the rules of
IEEE 854. The use of IEEE 854 rules implies that special values
(infinities and NaNs) are allowed, as subnormal values and the value
–0.
For some applications and programming languages (especially those
intended for use by people who are not mathematically sophisticated),
it may be appropriate to provide an arithmetic where infinite, NaN, or
subnormal results are always treated as errors, –0 results are
hidden, and other (largely cosmetic) changes are provided to aid
acceptance of results.
The arithmetic defined in ANSI X3.274 is such an arithmetic; this
appendix describes the differences between this and the full specification.
Implementations which support this subset explicitly might provide the
subset behavior under the control of a parameter in the context[1]
or might provide a different interface (additional or parameterized
methods, for example).
Simplified number set
In the subset arithmetic, a reduced set of number values is supported
and (where appropriate) numbers with positive exponents have their
exponent reduced to zero.
Specifically:
-
In the to-number conversion, if the coefficient for a
finite number has the value zero, then the sign and the
exponent are both set to 0.
-
If the coefficient in a result has the value zero,
then the sign is set to 0 and (unless the operation is
quantize) the exponent is set to 0.[2]
-
In the to-number conversion, strings which represent special
values are not permitted. (That is, only finite numbers are accepted.)
-
Subnormal numbers are not permitted. If the result from a conversion or
operation would be subnormal then an Underflow error results (see
below).
-
After any operation and the rounding of its result (unless the operation
is quantize), a result with a positive exponent is converted to
an integer provided that the resulting coefficient would have
no more than precision digits.
In other words, in this case a positive exponent is reduced to 0 by
multiplying the coefficient by 10exponent
(which has the effect of suffixing exponent zeros).[3]
Operation differences
In the subset arithmetic, operands are rounded before use if necessary
(as in Numerical Turing[4]
and Rexx),
the Lost digits condition is added to the context,
the results of some operations are trimmed,
the rounding rule after a subtraction is less conservative, and raising
0 to the power 0 is not treated as an error.
Specifically:
-
If the number of decimal digits in the coefficient of an
operand to an operation is greater than the current precision
in the context then the operand is rounded to precision
significant digits using the rounding algorithm described by
the context before being used in the computation.
In other words, an automatic ‘convert to shorter’ is applied before
the operation.
-
During an add or subtract operation,
if either number is zero then the other number, rounded to
precision if necessary, is used as the result (with sign
adjustment as appropriate).[5]
-
The Lost digits condition is added to the abstract context; it
should be set to 0 in default contexts.
This condition is raised when non-zero digits are discarded before an
operation. This can occur when an operand which has more leading
significant digits in its coefficient than the
precision setting is rounded to precision digits
before use
Note that the lost digits test does not treat trailing decimal zeros in
the coefficient as significant. For example, if
precision had the value 5, then the operands
[0,12345,-5]
[0,12345,-2]
[0,12345,0]
[1,12345,0]
[0,123450000,-4]
[0,1234500000,0]
would not cause an exception (whereas [0,123451,-1]
or [0,1234500001,0] would).
-
After a divide or power operation is complete and the
result has been rounded, any insignificant trailing zeros are removed.
That is, if the exponent is not zero and the
coefficient is a multiple of a positive power of ten then the
coefficient is divided by that power of ten and the
exponent increased accordingly. If the exponent was
negative it will not be increased above zero.
-
After an addition operation, the result is rounded to precision
digits if necessary, taking into account any extra (carry) digit on the
left after an addition, but otherwise counting from the position
corresponding to the most significant digit of the operands being added
or subtracted (rather than the most significant digit of the result).
-
For the max and min operations, the first
(left-hand) operand is chosen if the operands are numerically equal.
-
If both operands to a power operation are zero then the result
is 1 (instead of being an error); however, if the left-hand operand
is zero the right-hand operand must not be negative.
-
The right-hand operand to a power operation may be an
integer, and subset implementations are only required to provide the
power function for integer powers.
In this case the algorithm described below may be used for
calculating the result.
-
The fused-multiply-add operation is not defined for subset
implementations, because the rounding of operands rule conflicts with
the requirement for fused-multiply-add to deliver a result with only
one rounding.
Exceptional condition and rounding mode rules
In the subset arithmetic, exceptional conditions other than the
informational conditions (Lost digits, Inexact, Rounded, and Subnormal)
must be treated as errors, and results after these errors are undefined.
Special values and subnormal numbers, therefore, are not part of the
arithmetic.
In the subset, only the Lost digits trap enabler is required. Inexact,
Rounded, and Subnormal trap enablers are optional, and the others are
(in effect) always set. Similarly, the status bits in the
context are optional.
Only the round-half-up rounding mode is required.
Calculating an integer power
Subset implementations are only required to provide the power
function for integer powers. To calculate this, the number
(left-hand operand) is in theory multiplied by itself for the number
of times expressed by the power.
If the right-hand operand is negative, the left-hand operand is used
as-is, the absolute value of the right-hand operand is used, and the
final result is inverted.[6]
In practice (see the note below for the reasons), the power is often
calculated by the process of left-to-right binary reduction.
For power(x, n): ‘n’ is converted to
binary, and a temporary accumulator is set to 1.
If ‘n’ has the value 0 then the initial calculation is
complete.
Otherwise each bit (starting at the first non-zero bit) is inspected
from left to right.
If the current bit is 1 then the accumulator is multiplied by
‘x’.
If all bits have now been inspected then the initial calculation is
complete, otherwise the accumulator is squared by multiplication and the
next bit is inspected.
The multiplications and any final division are done under the normal
arithmetic operation rules, using the precision supplied for the
operation, except that the multiplications (and the division, if
needed) are carried out using an increased precision of
precision+elength+1 digits.
Here, elength is the length in decimal digits of the integer
part (coefficient) of the whole number ‘n’ (i.e., excluding
any sign, decimal part, decimal point, or insignificant leading zeros.[7]
If, when raising to a negative power, an underflow occurs during the
division into 1, the operation is not halted at that point but
continues.[8]
Notes:
-
A particular algorithm for calculating integer powers is described,
since it is efficient (though not optimal) and considerably reduces
the number of actual multiplications performed.
It therefore gives better performance than the simpler definition of
repeated multiplication.
Since results can occasionally differ from those of repeated
multiplication, the algorithm is defined here so that different
implementations which use it will give identical results for the
power operation on the same values, and may therefore use the same
testcases. Other algorithms for the power operation may also be
used, so long as the result is within 1 ulp (unit in last place) of
the correct result.
-
Implementations are encouraged to provide a power operator which will
accept a non-integral right-hand operand when the left-hand operand
is non-negative, as described in the body of this specification.
Footnotes:
[1] |
The decNumber package, for example, provides the subset behavior if the
extended bit is set to 0.
|
[2] |
This rule, together with the to-number definition, ensures that
numbers with values such as -0 or 0.0000 will not result from
general operations in the subset arithmetic.
This allows a concrete representation for the subset to comprise simply
two integers in two’s complement form.
|
[3] |
The underlying intent here is that positive exponents in the operands
are reduced to zero before the operation, so that all
operations take place on numbers that could be expressed as
‘plain’ decimal numbers with no exponent. The rule is expressed
as a constraint on the result because it is often more convenient or
efficient to implement it in this way.
The rule also preserves integers as specified by ANSI X3.274, and in
particular ensures that the results of the divide and
divide-integer operations are identical when the result is an
exact integer.
|
[4] |
See: T. E. Hull, A. Abrham, M. S. Cohen, A. F. X. Curley, C. B. Hall, D.
A. Penny, and J. T. M. Sawchuk,
Numerical Turing,
SIGNUM Newsletter, vol. 20 #3, pp26-34, ACM, May 1985.
|
[5] |
In the subset arithmetic, zeros have no exponent.
|
[6] |
This rule is slightly more complicated than inverting before the
calculation, in that it requires special treatment of overflow and
underflow conditions (which were not an issue in X3.274).
|
[7] |
The precision specified for the intermediate calculations ensures
that the final result will differ by at most 1, in the least
significant position, from the ‘true’ result
(given that the operands are expressed precisely under the current
setting of digits).
Half of this maximum error comes from the intermediate calculation, and
half from the final rounding.
|
[8] |
It can only be halted early if the result becomes zero.
|
[previous | contents | next]