Decimal Arithmetic Specification, version 1.70
Copyright (c) IBM Corporation, 2009. All rights reserved. ©
| 7 Apr 2009 |
[previous | contents | next]
|
This section describes the arithmetic operations on, and some other
functions of, numbers, including subnormal numbers, negative zeros,
and special values (see also IEEE 754 §5 and §6).
The operations described are:
Arithmetic operation notation
In this section, a simplified notation is used to illustrate arithmetic
operations: a number is shown as the string that would result from using
the to-scientific-string operation.
Single quotes are used to indicate that a number converted from an
abstract representation is implied.
Also, operations are indicated as functions (taking up to three
operands), and the sequence ==> means ‘results in’.
Hence:
add(’12’, ’7.00’) ==> ’19.00’
means that the result of the add operation with the
operands [0,12,0] and [0,700,-2] is [0,1900,-2].
Finally, in this example and in the examples below, the context is
assumed to have precision set to 9, rounding set to
round-half-up, and all trap-enablers set to 0.
Arithmetic operation rules
The following general rules apply to all arithmetic operations except
where stated below.
-
Every operation on finite numbers is carried out (as described under the
individual operations below) as though an exact mathematical result is
computed, using integer arithmetic on the coefficient where possible.
If the coefficient of the theoretical exact result has no more than
precision digits, then (unless there is an underflow or
overflow) it is used for the result without change.
Otherwise (it has more than precision digits) it is rounded
(shortened) to exactly precision digits, using the current
rounding algorithm, and the exponent is increased by
the number of digits removed.
If the value of the adjusted exponent of
the result is less than Emin (that is, the result is zero
or subnormal), the calculated coefficient and exponent form the
result, unless the value of the exponent is less than
Etiny, in which case the exponent will be set to
Etiny, the coefficient will be rounded (if necessary, and
possibly to zero) to match the adjustment of the exponent, and
the sign is unchanged.
If the result (before rounding) was non-zero and subnormal then the
Subnormal exceptional condition is raised.
If rounding of a subnormal result leads to an inexact result then the
Underflow exceptional condition is also raised.
If the value of the adjusted exponent of a non-zero result
is larger than Emax, then an
exceptional condition (overflow) results. In this case, the result
is as defined under the Overflow exceptional
condition, and may be infinite.
It will have the same sign as the theoretical result.[1]
-
Arithmetic using the special value infinity follows the usual
rules, where [1,inf] is less than every finite number
and [0,inf] is greater than every finite number. Under these
rules, an infinite result is always exact. Certain uses of infinity
raise an Invalid operation condition.
-
signaling NaNs always raise the Invalid operation condition
when used as an operand to an arithmetic operation.
-
The Invalid operation condition may also be raised when an operand to
an operation is invalid (for example, if it exceeds the bounds that
an implementation can handle, or the operation is a logarithm and the
operand is negative).
-
The result of any arithmetic operation which has an operand which is a NaN
(a quiet NaN or a signaling NaN) is [s,qNaN]
or [s,qNaN,d]. The sign and any diagnostic information is copied
from the first operand which is a signaling NaN, or if neither is
signaling then from the first operand which is a NaN.
Whenever a result is a NaN, the sign of the result depends only on the
copied operand (the following rules do not apply).
-
The sign of the result of a multiplication or division will be 1
only if the operands have different signs.
-
The sign of the result of an addition or subtraction will be 1
only if the result is less than zero, except for the special cases below
where the result is a negative 0.
- A result which is a negative zero ([1,0,n]) can occur
only when:
-
a result is rounded to zero, and the value before rounding had a
sign of 1.
-
the operation is an addition of [1,0,n]
to [1,0,n], or a subtraction of [0,0,n]
from [1,0,n]
-
the operation is an addition of operands with opposite signs (or is a
subtraction of operands with the same sign), the result has a
coefficient of 0, and the rounding is
round-floor.
-
the operation is a multiplication or division and the result has a
coefficient of 0 and the signs of the operands are different.
-
the operation is power, the left-hand operand
is [1,0,n], and the right-hand operand is positive,
integral, and odd.
-
the operation is power, the left-hand operand
is [1,inf], and the right-hand operand is negative,
integral, and odd.
-
the operation is quantize or a round-to-integral,
the left-hand operand is negative, and the magnitude of the result is
zero. In either case the final exponent may be non-zero.
-
the operation is square-root and the operand
is [1,0,n].
-
the operation is one of the operations max,
max-magnitude, min, min-magnitude,
next-plus, next-toward, reduce, or is a
copy operation.
Examples involving special values:
add(’Infinity’, ’1’) ==> ’Infinity’
add(’NaN’, ’1’) ==> ’NaN’
add(’NaN’, ’Infinity’) ==> ’NaN’
subtract(’1’, ’Infinity’) ==> ’-Infinity’
multiply(’-1’, ’Infinity’) ==> ’-Infinity’
subtract(’-0’, ’0’) ==> ’-0’
multiply(’-1’, ’0’) ==> ’-0’
divide(’1’, ’0’) ==> ’Infinity’
divide(’1’, ’-0’) ==> ’-Infinity’
divide(’-1’, ’0’) ==> ’-Infinity’
Notes:
-
Operands may have more than precision digits and are not
rounded before use.
-
The context (precision and rounding, etc.) for an operation
might be wholly implied, or be a global or local setting, or be
passed to operations individually – depending on the
implementation of the specification (for example, in a programming
language).
-
NaNs propagate any associated diagnostic information as described in
IEEE 854 §6.2. The meaning of any such diagnostic information is
outside the scope of this specification, but typically indicates the
origin of the NaN.
In IEEE 754-2008, this information is only held in the coefficient of
decimal numbers and does not use the first digit of the coefficient.
-
The rules above imply that the compare operation can return
a quiet NaN as a result, which indicates an ‘unordered’
comparison (see IEEE 754 §5.11).
-
An implementation may use the compare operation ‘under the
covers’ to implement a closed set of comparison operations (greater
than, equal, etc.) if desired. In this case, the additional constraints
detailed in IEEE 754 §5.11 will apply; that is, a comparison (such a
‘greater than’) which does not explicitly allow for an
‘unordered’ result yet would require an unordered result will give
rise to an Invalid operation condition.
-
If a result is rounded, remains finite, and is not subnormal,
its coefficient will have exactly precision digits (except
after the quantize or round-to-integral operations,
as described below). That is, only unrounded or subnormal
coefficients can have fewer than precision digits.
-
Trailing zeros are not removed after operations.
The reduce operation may be used to remove trailing zeros if
desired.
abs takes one operand.
If the operand is negative, the result is the same as using the
minus operation on the operand.
Otherwise, the result is the same as using the
plus operation on the operand.
Examples:
abs(’2.1’) ==> ’2.1’
abs(’-100’) ==> ’100’
abs(’101.5’) ==> ’101.5’
abs(’-101.5’) ==> ’101.5’
Note that the result of this operation is affected by context and may
set flags. The copy-abs
operation may be used if this is not desired.
add and subtract both take two operands.
If either operand is a special value then the general rules
apply.
Otherwise, the operands are added (after inverting the sign
used for the second operand if the operation is a subtraction), as
follows:
-
The coefficient of the result is computed by adding or
subtracting the aligned coefficients of the two operands. The aligned
coefficients are computed by comparing the exponents of the operands:
- If they have the same exponent, the aligned coefficients are the
same as the original coefficients.
-
Otherwise the aligned coefficient of the number with the larger exponent
is its original coefficient multiplied by 10n,
where n is the absolute difference between the exponents, and the
aligned coefficient of the other operand is the same as its original
coefficient.
If the signs of the operands differ then the smaller aligned coefficient
is subtracted from the larger; otherwise they are added.
-
The exponent of the result is the minimum of the exponents of
the two operands.
-
The sign of the result is determined as follows:
-
If the result is non-zero then the sign of the result is the sign of the
operand having the larger absolute value.
-
Otherwise, the sign of a zero result is 0 unless either both
operands were negative or the signs of the operands were different and
the rounding is round-floor.
The result is then rounded to precision digits if necessary,
counting from the most significant digit of the result.
Examples:
add(’12’, ’7.00’) ==> ’19.00’
add(’1E+2’, ’1E+4’) ==> ’1.01E+4’
subtract(’1.3’, ’1.07’) ==> ’0.23’
subtract(’1.3’, ’1.30’) ==> ’0.00’
subtract(’1.3’, ’2.07’) ==> ’-0.77’
compare takes two operands and compares their values numerically.
If either operand is a special value then the general rules
apply. No flags are set unless an operand is a signaling NaN.
Otherwise, the operands are compared as follows.
If the signs of the operands differ, a value representing each operand
(’-1’ if the operand is less than zero, ’0’ if the operand
is zero or negative zero, or ’1’ if the operand is greater than
zero) is used in place of that operand for the comparison instead of the
actual operand.[2]
The comparison is then effected by subtracting the second operand from
the first
and then returning a value according to the result of the
subtraction: ’-1’ if the result is less than zero, ’0’ if
the result is zero or negative zero, or ’1’ if the result is
greater than zero.
An implementation may use this operation ‘under the covers’ to
implement a closed set of comparison operations (greater than, equal,
etc.) if desired. It need not, in this case, expose the
compare operation itself.
Examples:
compare(’2.1’, ’3’) ==> ’-1’
compare(’2.1’, ’2.1’) ==> ’0’
compare(’2.1’, ’2.10’) ==> ’0’
compare(’3’, ’2.1’) ==> ’1’
compare(’2.1’, ’-3’) ==> ’1’
compare(’-3’, ’2.1’) ==> ’-1’
Notes:
-
The result of compare is always exact and unrounded, and may
be a NaN.
-
The compare-total operation can be used
for a non-numerical comparison which provides a total ordering over
the abstract representation of values.
compare-signal takes two operands and compares their values
numerically. This operation is identical to compare, except
that if neither operand is a signaling NaN then any quiet NaN operand
is treated as though it were a signaling NaN. (That is, all NaNs
signal, with signaling NaNs taking precedence over quiet NaNs.)
divide takes two operands.
If either operand is a special value then the general rules
apply.
Otherwise, if the divisor is zero then either the Division undefined
condition is raised (if the dividend is zero) and the result is NaN, or
the Division by zero condition is raised and the result is an Infinity
with a sign which is the exclusive or of the signs of the
operands.
Otherwise, a ‘long division’ is effected,
as follows:
-
An integer variable, adjust, is initialized to 0.
-
If the dividend is non-zero, the coefficient of the result is
computed as follows (using working copies of the operand coefficients,
as necessary):
-
The operand coefficients are adjusted so that the coefficient of the
dividend is greater than or equal to the coefficient of the divisor and
is also less than ten times the coefficient of the divisor, thus:
-
While the coefficient of the dividend is less than the coefficient of
the divisor it is multiplied by 10 and adjust is incremented by
1.
-
While the coefficient of the dividend is greater than or equal to ten
times the coefficient of the divisor the coefficient of the divisor is
multiplied by 10 and adjust is decremented by 1.
-
The result coefficient is initialized to 0.
-
The following steps are then repeated until the division is complete:
-
While the coefficient of the divisor is smaller than or equal to the
coefficient of the dividend the former is subtracted from the latter and
the coefficient of the result is incremented by 1.
-
If the coefficient of the dividend is now 0 and adjust is
greater than or equal to 0, or if the coefficient of the result has
precision digits, the division is complete.
Otherwise, the coefficients of the result and the dividend are
multiplied by 10 and adjust is incremented by 1.
-
Any remainder (the final coefficient of the dividend) is recorded and
taken into account for rounding.[3]
Otherwise (the dividend is zero), the coefficient of the result is zero
and adjust is unchanged (is 0).
-
The exponent of the result is computed by subtracting the sum
of the original exponent of the divisor and the value of adjust
at the end of the coefficient calculation from the original exponent of
the dividend.
-
The sign of the result is the exclusive or of the signs
of the operands.
The result is then rounded to precision digits, if necessary,
according to the rounding algorithm and taking into account the
remainder from the division.
Examples:
divide(’1’, ’3’ ) ==> ’0.333333333’
divide(’2’, ’3’ ) ==> ’0.666666667’
divide(’5’, ’2’ ) ==> ’2.5’
divide(’1’, ’10’ ) ==> ’0.1’
divide(’12’, ’12’) ==> ’1’
divide(’8.00’, ’2’) ==> ’4.00’
divide(’2.400’, ’2.0’) ==> ’1.20’
divide(’1000’, ’100’) ==> ’10’
divide(’1000’, ’1’) ==> ’1000’
divide(’2.40E+6’, ’2’) ==> ’1.20E+6’
Note that the results as described above can alternatively be expressed
as follows:
-
The ideal (simplest) exponent for the result of a division is
the exponent of the dividend less the exponent of the divisor.
-
After the division, if the result is exact then the coefficient and
exponent giving the correct value and with the exponent closest to the
ideal exponent is returned. If the result is inexact, the coefficient
will have exactly precision digits (unless the result is
subnormal), and the exponent will be set appropriately.
divide-integer takes two operands; it divides two numbers and
returns the integer part of the result.
If either operand is a special value then the general rules
apply.
Otherwise, the result returned is defined to be that which would result
from repeatedly subtracting the divisor from the dividend while the
dividend is larger than or equal to the divisor. During this
subtraction, the absolute values of both the dividend and the divisor
are used: the sign of the final result is the same as that which
would result if normal division were used.
In other words, if the operands x and y were given to the
divide-integer and remainder operations, resulting in
i and r respectively, then the identity
holds.
The exponent of the result must be 0. Hence, if the result
cannot be expressed exactly within precision digits, the
operation is in error and will fail – that is, the result cannot
have more digits than the value of precision in effect for the
operation, and will not be rounded.
For example, divide-integer(’10000000000’, ’3’) requires ten
digits to express the result exactly (’3333333333’) and would
therefore fail if precision were in the range 1
through 9.
Notes:
-
The divide-integer operation may not give the same result as truncating
normal division (which could be affected by rounding and might be
Inexact).
-
The divide-integer and remainder operations are defined so that they
may be calculated as a by-product of the standard division operation
(described above). The division process is ended as soon as the
integer result is available; the residue of the dividend is the
remainder.
-
The divide and divide-integer operation on the same operands give
results of the same numerical value if no error occurs and there is no
residue from the divide-integer operation.
Examples:
divide-integer(’2’, ’3’) ==> ’0’
divide-integer(’10’, ’3’) ==> ’3’
divide-integer(’1’, ’0.3’) ==> ’3’
exp takes one operand.
If the operand is a NaN then the general rules for special values
apply.
Otherwise, the result is e raised to the power of the operand,
with the following cases:
-
If the operand is –Infinity, the result is 0 and exact.
-
If the operand is a zero, the result is 1 and exact.
-
If the operand is +Infinity, the result is +Infinity and exact.
-
Otherwise the result is inexact and will be rounded using the
round-half-even algorithm. The coefficient will have exactly
precision digits (unless the result is subnormal).
These inexact results should be correctly rounded, but may be up to 1
ulp (unit in last place) in error.
Examples:
exp(’-Infinity’) ==> ’0’
exp(’-1’) ==> ’0.367879441’
exp(’0’) ==> ’1’
exp(’1’) ==> ’2.71828183’
exp(’0.693147181’) ==> ’2.00000000’
exp(’+Infinity’) ==> ’Infinity’
Notes:
-
The rounding setting in the context is not used; this means
that the algorithm described in Variable Precision Exponential
Function by T. E. Hull and A. Abrham (ACM Transactions on
Mathematical Software, Vol 12 #2, pp79–91, ACM, June 1986) may be
used for this operation.
-
When the result is inexact, the cost of exp at precision
d is likely to be at least
13×log2(d) times the cost of an inexact
multiplication at the same precision (see Multiple-precision
zero-finding methods and the complexity of elementary function
evaluation by R. P. Brent, in Analytic Computational Complexity
pp151–176, Academic Press, York, 1976, and Fast
Multiple-Precision Evaluation of Elementary Functions by the same
author, in Journal of the ACM (JACM), Vol 23 # 2, pp242–251,
ACM, April 1976).
fused-multiply-add takes three operands; the first two are
multiplied together, using multiply, with sufficient
precision and exponent range that the result is exact and unrounded.[4]
No flags are set by the multiplication unless one of the
first two operands is a signaling NaN or one is a zero and the other
is an infinity.
Unless the multiplication failed, the third operand is then added to
the result of that multiplication, using add, under the
current context.
In other words, fused-multiply-add(x, y, z) delivers a result
which is (x × y) + z with only the one, final,
rounding.
Examples:
fused-multiply-add(’3’, ’5’, ’7’) ==> ’22’
fused-multiply-add(’3’, ’-5’, ’7’) ==> ’-8’
fused-multiply-add(’888565290’, ’1557.96930’,
’-86087.7578’) ==> ’1.38435736E+12’
Note that the last example would have given the
result ’1.38435735E+12’ if the operation had been carried out
as a separate multiply followed by an add.
ln takes one operand.
If the operand is a NaN then the general rules for special values
apply.
Otherwise, the operand must be a zero or positive, and the result is
the natural (base e) logarithm of the operand, with the
following cases:
-
If the operand is a zero, the result is –Infinity and exact.
-
If the operand is +Infinity, the result is +Infinity and exact.
-
If the operand equals one, the result is 0 and exact.
-
Otherwise the result is inexact and will be rounded using the
round-half-even algorithm. The coefficient will have exactly
precision digits (unless the result is subnormal).
These inexact results should be correctly rounded, but may be up to 1
ulp (unit in last place) in error.
Examples:
ln(’0’) ==> ’-Infinity’
ln(’1.000’) ==> ’0’
ln(’2.71828183’) ==> ’1.00000000’
ln(’10’) ==> ’2.30258509’
ln(’+Infinity’) ==> ’Infinity’
Notes:
-
The rounding setting in the context is not used.
-
When the result is inexact, the cost of ln at a given
precision is likely to be similar to, or more expensive than, the
exp function (see notes under that function).
log10 takes one operand.
If the operand is a NaN then the general rules for special values
apply.
Otherwise, the operand must be a zero or positive, and the result is
the base 10 logarithm of the operand, with the following cases:
-
If the operand is a zero, the result is –Infinity and exact.
-
If the operand is +Infinity, the result is +Infinity and exact.
-
If the operand equals an integral power of ten (including
100 and negative powers) and there is sufficient
precision to hold the integral part of the result, the result
is an integer (with an exponent of 0) and exact.
-
Otherwise the result is inexact and will be rounded using the
round-half-even algorithm. The coefficient will have exactly
precision digits (unless the result is subnormal).
These inexact results should be correctly rounded, but may be up to 1
ulp (unit in last place) in error.
Examples:
log10(’0’) ==> ’-Infinity’
log10(’0.001’) ==> ’-3’
log10(’1.000’) ==> ’0’
log10(’2’) ==> ’0.301029996’
log10(’10’) ==> ’1’
log10(’70’) ==> ’1.84509804’
log10(’+Infinity’) ==> ’Infinity’
Notes:
-
The rounding setting in the context is not used.
-
When the result is inexact, the cost of log10 at a given
precision is likely to be similar to, or more expensive than, the
exp function (see notes under that function).
max takes two operands, compares their values numerically, and
returns the maximum.[5]
If either operand is a NaN then the general rules apply, unless one
is a quiet NaN and the other is numeric, in which case the numeric
operand is returned.[6]
Otherwise, the operands are compared as as though by the
compare operation. If they are not
numerically equal then the maximum (closer to positive infinity) of
the two operands is chosen as the result.
Otherwise (they are numerically equal):
- if the operand signs differ the operand with sign 0 is chosen
- if the signs and exponents are equal the operands are
identical so either can be chosen
- if the signs are both positive, the operand with the maximum
exponent is chosen
- if the signs are both negative, the operand with the minimum
exponent is chosen.
For numerical results, the result is the same as using the
plus operation on the chosen operand,
except that the sign of a zero does not change.
Examples:
max(’3’, ’2’) ==> ’3’
max(’-10’, ’3’) ==> ’3’
max(’1.0’, ’1’) ==> ’1’
max(’7’, ’NaN’) ==> ’7’
max-magnitude takes two operands and compares their values
numerically with their sign ignored and assumed to be 0.
If, without signs, the first operand is the larger then the
original first operand is returned (that is, with the original sign).
If, without signs, the second operand is the larger then the original
second operand is returned.
Otherwise the result is the same as from the max operation.
min takes two operands, compares their values numerically, and
returns the minimum.[7]
If either operand is a NaN then the general rules apply, unless one
is a quiet NaN and the other is numeric, in which case the numeric
operand is returned.
Otherwise, the operands are compared as as though by the
compare operation. If they are not
numerically equal then the minimum (closer to negative infinity) of
the two operands is chosen as the result.
Otherwise (they are numerically equal):
- if the operand signs differ the operand with sign 1 is chosen
- if the signs and exponents are equal the operands are
identical so either can be chosen
- if the signs are both positive, the operand with the minimum
exponent is chosen
- if the signs are both negative, the operand with the maximum
exponent is chosen.
For numerical results, the result is the same as using the
plus operation on the chosen operand,
except that the sign of a zero does not change.
Examples:
min(’3’, ’2’) ==> ’2’
min(’-10’, ’3’) ==> ’-10’
min(’1.0’, ’1’) ==> ’1.0’
min(’7’, ’NaN’) ==> ’7’
min-magnitude takes two operands and compares their values
numerically with their sign ignored and assumed to be 0.
If, without signs, the first operand is the smaller then the
original first operand is returned (that is, with the original sign).
If, without signs, the second operand is the smaller then the original
second operand is returned.
Otherwise the result is the same as from the min operation.
minus and plus both take one operand, and correspond
to the prefix minus and plus operators in programming languages.
The operations are evaluated using the same rules as add and
subtract; the operations plus(a) and minus(a)
(where a and b refer to any numbers) are calculated
as the operations add(’0’, a) and subtract(’0’, b)
respectively, where the ’0’ has the same exponent as the operand.
Examples:
plus(’1.3’) ==> ’1.3’
plus(’-1.3’) ==> ’-1.3’
minus(’1.3’) ==> ’-1.3’
minus(’-1.3’) ==> ’1.3’
Note that the result of these operations is affected by context and may
set flags. The copy-negate
operation may be used instead of minus if this is not
desired.
multiply takes two operands.
If either operand is a special value then the general rules
apply.
Otherwise, the operands are multiplied together (‘long
multiplication’), resulting in a number which may be as long as the
sum of the lengths of the two operands, as follows:
-
The coefficient of the result, before rounding, is computed by
multiplying together the coefficients of the operands.
-
The exponent of the result, before rounding, is the sum of
the exponents of the two operands.
-
The sign of the result is the exclusive or of the signs
of the operands.
The result is then rounded to precision digits if necessary,
counting from the most significant digit of the result.
Examples:
multiply(’1.20’, ’3’) ==> ’3.60’
multiply(’7’, ’3’) ==> ’21’
multiply(’0.9’, ’0.8’) ==> ’0.72’
multiply(’0.9’, ’-0’) ==> ’-0.0’
multiply(’654321’, ’654321’) ==> ’4.28135971E+11’
next-minus takes one operand; if the operand is a NaN then
the general rules apply.
Otherwise the result is the largest representable number that is
smaller than the operand unless the operand is –Infinity, in
which case the result is –Infinity. If the result is zero its
sign will be 0 and its exponent will be the
smallest possible.
No flags will be set when the operand is numeric.
In the following examples, Emax and Emin are
assumed to be +999 and –999 respectively.
Examples:
next-minus(’1’) ==> ’0.999999999’
next-minus(’1E-1007’) ==> ’0E-1007’
next-minus(’-1.00000003’) ==> ’-1.00000004’
next-minus(’Infinity’) ==> ’9.99999999E+999’
next-plus takes one operand; if the operand is a NaN then
the general rules apply.
Otherwise the result is the smallest representable number that is
larger than the operand unless the operand is +Infinity, in
which case the result is +Infinity. If the result is zero its
sign will be 1 and its exponent will be the
smallest possible.
No flags will be set when the operand is numeric.
In the following examples, Emax and Emin are
assumed to be +999 and –999 respectively.
Examples:
next-plus(’1’) ==> ’1.00000001’
next-plus(’-1E-1007’) ==> ’-0E-1007’
next-plus(’-1.00000003’) ==> ’-1.00000002’
next-plus(’-Infinity’) ==> ’-9.99999999E+999’
next-toward takes two operands; if either operand is a NaN then
the general rules apply.
Otherwise the result is the representable number closest to
the first operand (but not the first operand) that is in the
direction towards the second operand, unless the operands have the same
value. Specifically:
- If the second operand is larger than the first operand then the
result is the result of the operation next-plus on the first operand
- If the second operand is smaller than the first operand then the
result is the result of the operation next-minus on the first operand
- If the two operands are numerically equal, then the result is a
copy of the first operand with the sign set to be the same
as the sign of the second operand; in this case no
flags are set.
In the first two cases, flags are set as though the
operation had been computed by adding (in the first case) or
subtracting (in the second) an infinitesimally small positive value
to or from the first operand with the rounding mode set to be
round-ceiling or round-floor respectively.[8]
In the following examples, Emax and Emin are
assumed to be +999 and –999 respectively.
Examples:
next-toward(’1’, ’2’) ==> ’1.00000001’
next-toward(’-1E-1007’, ’1’) ==> ’-0E-1007’
next-toward(’-1.00000003’, ’0’) ==> ’-1.00000002’
next-toward(’1’, ’0’) ==> ’0.999999999’
next-toward(’1E-1007’, ’-100’) ==> ’0E-1007’
next-toward(’-1.00000003’, ’-10’) ==> ’-1.00000004’
next-toward(’0.00’, ’-0.0000’) ==> ’-0.00’
This operation derives its anomalous rules for flags
from the IEEE 754-1985 operation nextAfter; the operation was
dropped from the IEEE 754-2008 standard.
power takes two operands, and raises a number (the
left-hand operand) to a power (the right-hand operand).
If either operand is a special value then the general rules
apply, except as stated below.
The following rules apply:
-
If both operands are zero, or if the left-hand operand is less than
zero and the right-hand operand does not have an integral value[9]
or is infinite, an Invalid operation condition is
raised, the result is [0,qNaN], and the following rules do not
apply.
-
If the left-hand operand is infinite, the result will be exact and
will be infinite if the right-hand side is positive, 1 if the
right-hand side is a zero, and 0 if the right-hand side is negative.
-
If the left-hand operand is a zero, the result will be exact and will
be infinite if the right-hand side is negative or 0 if the right-hand
side is positive.
-
If the right-hand operand is a zero, the result will be 1 and exact.
-
In cases not covered above, the result will be inexact unless the
right-hand side has an integral value and the result is finite and
can be expressed exactly within precision digits. In this
latter case, if the result is unrounded then its exponent will be
that which would result if the operation were calculated by repeated
multiplication (if the second operand is negative then the reciprocal
of the first operand is used, with the absolute value of the second
operand determining the multiplications).
-
Inexact finite results should be correctly rounded, but may be up to
1 ulp (unit in last place) in error.
-
The sign of the result will be 1 only if the
right-hand side has an integral value and is odd (and is not
infinite) and also the sign of the left-hand side is 1.
In all other cases, the sign of the result will be 0.
Examples:
power(’2’, ’3’) ==> ’8’
power(’-2’, ’3’) ==> ’-8’
power(’2’, ’-3’) ==> ’0.125’
power(’1.7’, ’8’) ==> ’69.7575744’
power(’10’, ’0.301029996’) ==> ’2.00000000’
power(’Infinity’, ’-1’) ==> ’0’
power(’Infinity’, ’0’) ==> ’1’
power(’Infinity’, ’1’) ==> ’Infinity’
power(’-Infinity’, ’-1’) ==> ’-0’
power(’-Infinity’, ’0’) ==> ’1’
power(’-Infinity’, ’1’) ==> ’-Infinity’
power(’-Infinity’, ’2’) ==> ’Infinity’
power(’0’, ’0’) ==> ’NaN’
Notes:
-
When the result is inexact, the cost of power at a given
precision is likely to be at least twice as expensive as the
exp function (see notes under that function).
-
An infinite result is always exact, as described in the general rules.
-
Versions of this specification prior to version 1.48 defined a
simpler power operation which only required support for
integer powers.
-
It can be argued that the special cases where one operand is zero and
the other is infinite (such as power(’0’, ’Infinity’)
and power(’Infinity’, ’0’)) should return a NaN, whereas the
specification above leads to results of 0 and 1 respectively for the
two examples (for compatibility with the earlier version of this
operation).
If NaN results are desired instead, then these special cases should
be tested for before calling the power operation.
quantize takes two operands.
If either operand is a special value then the general rules
apply, except that if either operand is infinite and the other is finite
an Invalid operation condition is raised and the
result is [0,qNaN], or if both are infinite then the result is
the first operand.
Otherwise (both operands are finite), quantize returns the
number which is equal in value (except for any rounding) and sign to the
first (left-hand) operand and which has an exponent set to be
equal to the exponent of the second (right-hand) operand.
The coefficient of the result is derived from that of the
left-hand operand. It may be rounded using the current
rounding setting (if the exponent is being
increased), multiplied by a positive power of ten (if the
exponent is being decreased), or is unchanged (if the
exponent is already equal to that of the right-hand operand).
Unlike other operations, if the length of the coefficient after
the quantize operation would be greater than precision then
an Invalid operation condition is raised.
This guarantees that, unless there is an error condition, the
exponent of the result of a quantize is always equal to that
of the right-hand operand.
Also unlike other operations, quantize will never raise Underflow, even
if the result is subnormal and inexact.
Examples:
quantize(’2.17’, ’0.001’) ==> ’2.170’
quantize(’2.17’, ’0.01’) ==> ’2.17’
quantize(’2.17’, ’0.1’) ==> ’2.2’
quantize(’2.17’, ’1e+0’) ==> ’2’
quantize(’2.17’, ’1e+1’) ==> ’0E+1’
quantize(’-Inf’ ’Infinity’) ==> ’-Infinity’
quantize(’2’, ’Infinity’) ==> ’NaN’
quantize(’-0.1’, ’1’ ) ==> ’-0’
quantize(’-0’, ’1e+5’) ==> ’-0E+5’
quantize(’+35236450.6’, ’1e-2’) ==> ’NaN’
quantize(’-35236450.6’, ’1e-2’) ==> ’NaN’
quantize(’217’, ’1e-1’) ==> ’217.0’
quantize(’217’, ’1e+0’) ==> ’217’
quantize(’217’, ’1e+1’) ==> ’2.2E+2’
quantize(’217’, ’1e+2’) ==> ’2E+2’
Notes:
-
In the penultimate example the result is [0,22,1],
leading to the string in scientific notation as shown.
-
This operation was previously called rescale, which had
identical semantics except that the second operand specified the power
of ten of the quantum. The quantize semantics specifies the
desired quantum by example, which allows a faster implementation in most
cases.
-
The sign and coefficient of the second operand are ignored; this allows
a ‘match the quantum of a variable’ operation to be effected
directly.
reduce takes one operand.
It has the same semantics as the plus operation, except that
if the final result is finite it is reduced to its simplest form, with
all trailing zeros removed and its sign preserved.
That is, while the coefficient is non-zero and a multiple of ten
the coefficient is divided by ten and the exponent is
incremented by 1.
Otherwise (the coefficient is zero) the exponent
is set to 0.
In all cases the sign is unchanged.
Examples:
reduce(’2.1’) ==> ’2.1’
reduce(’-2.0’) ==> ’-2’
reduce(’1.200’) ==> ’1.2’
reduce(’-120’) ==> ’-1.2E+2’
reduce(’120.00’) ==> ’1.2E+2’
reduce(’0.00’) ==> ’0’
This operation was called normalize prior to version
1.68 of this specification.
remainder takes two operands; it returns the remainder from
integer division.
If either operand is a special value then the general rules
apply.
Otherwise, the result is the residue of the dividend after the operation
of calculating integer division as described for
divide-integer, rounded to precision digits if
necessary.
The sign of the result, if non-zero, is the same as that of the
original dividend.
This operation will fail under the same conditions as integer
division (that is, if integer division on the same two operands would
fail, the remainder cannot be calculated).
Examples:
remainder(’2.1’, ’3’) ==> ’2.1’
remainder(’10’, ’3’) ==> ’1’
remainder(’-10’, ’3’) ==> ’-1’
remainder(’10.2’, ’1’) ==> ’0.2’
remainder(’10’, ’0.3’) ==> ’0.1’
remainder(’3.6’, ’1.3’) ==> ’1.0’
Notes:
-
The divide-integer and remainder operations are defined so that they
may be calculated as a by-product of the standard division operation
(described above). The division process is ended as soon as the
integer result is available; the residue of the dividend is the
remainder.
-
The sign of the result will always be sign of the dividend.
-
The remainder operation differs from the remainder operation defined in
IEEE 754 (the remainder-near operator), in that it gives the
same results for numbers whose values are equal to integers as would the
usual remainder operator on integers.
For example, the result of the operation remainder(’10’, ’6’) as
defined here is ’4’, and remainder(’10.0’, ’6’) would
give ’4.0’ (as would remainder(’10’, ’6.0’)
or remainder(’10.0’, ’6.0’)). The IEEE 754 remainder operation
would, however, give the result ’-2’ because its integer division
step chooses the closest integer, not the one nearer zero.
remainder-near takes two operands.
If either operand is a special value then the general rules
apply.
Otherwise, if the operands are given by x and y, then the
result is defined to be x – y × n,
where n is the integer nearest the exact value of
x ÷ y (if two integers are equally near then the
even one is chosen). If the result is equal to 0 then its sign will be
the sign of x.
(See IEEE 754 §5.3.1.)
This operation will fail under the same conditions as integer
division (that is, if integer division on the same two operands would
fail, the remainder cannot be calculated), except when the quotient
is very close to 10 raised to the power of the precision.[10]
Examples:
remainder-near(’2.1’, ’3’) ==> ’-0.9’
remainder-near(’10’, ’6’) ==> ’-2’
remainder-near(’10’, ’3’) ==> ’1’
remainder-near(’-10’, ’3’) ==> ’-1’
remainder-near(’10.2’, ’1’) ==> ’0.2’
remainder-near(’10’, ’0.3’) ==> ’0.1’
remainder-near(’3.6’, ’1.3’) ==> ’-0.3’
Notes:
-
The remainder-near operation differs from the remainder
operation in that it does not give the same results for numbers whose
values are equal to integers as would the usual remainder operator on
integers.
For example, the operation remainder(’10’, ’6’) gives the
result ’4’, and remainder(’10.0’, ’6’) gives ’4.0’
(as would the operations remainder(’10’, ’6.0’)
or remainder(’10.0’, ’6.0’)).
However, remainder-near(’10’, ’6’) gives the result ’-2’
because its integer division step chooses the closest integer, not the
one nearer zero.
-
The result of this operation is always exact.
-
This operation is sometimes known as ‘IEEE remainder’.
round-to-integral-exact takes one operand.
If the operand is a special value, or the exponent of the
operand is non-negative, then the result is the same as the operand
(unless the operand is a signaling NaN, as usual).
Otherwise (the operand has a negative exponent) the result is the same
as using the quantize operation using the given operand as the
left-hand-operand, 1E+0 as the right-hand-operand, and the precision of
the operand as the precision setting.
The rounding mode is taken from the context, as usual.
Examples:
round-to-integral-exact(’2.1’) ==> ’2’
round-to-integral-exact(’100’) ==> ’100’
round-to-integral-exact(’100.0’) ==> ’100’
round-to-integral-exact(’101.5’) ==> ’102’
round-to-integral-exact(’-101.5’) ==> ’-102’
round-to-integral-exact(’10E+5’) ==> ’1.0E+6’
round-to-integral-exact(’7.89E+77’) ==> ’7.89E+77’
round-to-integral-exact(’-Inf’) ==> ’-Infinity’
round-to-integral-value takes one operand.
It is identical to the round-to-integral-exact operation
except that the Inexact and Rounded flags are never set even if the
operand is rounded (that is, the operation is quiet unless the
operand is a signaling NaN).
square-root takes one operand.
If the operand is a special value then the general rules
apply.
Otherwise, the ideal exponent of the result is defined to be half the
exponent of the operand (rounded to an integer, towards –Infinity,[11]
if necessary) and then:
-
If the operand is less than zero an Invalid operation condition is
raised.
-
If the operand is greater than zero, the result is the square root
of the operand. If no rounding is necessary (the exact result requires
precision digits or fewer) then the the coefficient and
exponent giving the correct value and with the exponent closest to the
ideal exponent is used.
If the result must be inexact, it is rounded using the
round-half-even algorithm and the coefficient will have exactly
precision digits (unless the result is subnormal), and the
exponent will be set to maintain the correct value.
-
Otherwise (the operand is equal to zero), the result will be the zero
with the same sign as the operand and with the ideal exponent.
Examples:
square-root(’0’) ==> ’0’
square-root(’-0’) ==> ’-0’
square-root(’0.39’) ==> ’0.62449980’
square-root(’100’) ==> ’10’
square-root(’1’) ==> ’1’
square-root(’1.0’) ==> ’1.0’
square-root(’1.00’) ==> ’1.0’
square-root(’7’) ==> ’2.64575131’
square-root(’10’) ==> ’3.16227766’
Notes:
-
The rounding setting in the context is not used; this means
that the algorithm described in
Properly Rounded Variable Precision Square Root by T. E. Hull
and A. Abrham (ACM Transactions on Mathematical Software, Vol 11 #3,
pp229–237, ACM, September 1985) may be used for this operation.
-
A subnormal result is only possible if the working precision is greater
than Emax+1.
-
The rules for setting the exponent of the result apply to many
operations; they can be used for any operation for which an ideal
exponent can be defined.
-
A negative zero is allowed as an operand as per IEEE 754 §5.4.1.
-
Square-root can also be calculated by using the
power operation (with a second
operand of 0.5). The result in that case will not be exact in most
cases, and may not be correctly rounded.[12]
Footnotes:
[1] |
In practice, it is only necessary to work with intermediate results of
up to twice the current precision. Some rounding settings may require
some inspection of possible remainders or additional digits (for
example, to determine whether a result is exactly 0.5 in the next
position), though their actual values would not be required.
For round-half-up, rounding can be effected by truncating the
result to precision (and adding the count of truncated digits
to the exponent).
The first truncated digit is then inspected, and if it has the value 5
through 9 the result is incremented by 1. This could cause the result
to again exceed precision digits, in which case it is divided
by 10 and the exponent is incremented by 1.
|
[2] |
This rule removes the possibility of an arithmetic overflow
during a numeric comparison.
|
[3] |
In practice, only two bits need to be noted, indicating whether the
remainder was 0, or was exactly half of the final coefficient of the
divisor, or was in one of the two ranges above or below the half-way
point.
|
[4] |
This requires up to twice the current exponent range and a
precision which is the sum of the lengths of the two operands’
coefficients.
|
[5] |
This is the IEEE 754 maxnum operation, with an explicit
result for equal operands.
|
[6] |
This permits a useful ordering of data in which NaNs are used to
indicate ‘unknown’ values.
|
[7] |
This is the IEEE 754 minnum operation, with an explicit
result for equal operands.
|
[8] |
The result can in fact be computed by an appropriate addition, with
one infinite value having a special case result and the sign of a
zero result being set appropriately.
|
[9] |
That is, any fractional part (after the decimal point) is non-zero.
|
[10] |
This is a deviation from IEEE 754, necessary to assure realistic
execution times when the operands have a wide range of exponents.
|
[11] |
This rule matches the typical implementations. For example, the
square-root of either [0,10,-1] or [0,11,-1] is often
calculated by first multiplying the coefficient by ten and reducing
the exponent by 1 and then determining the square root.
If the exponent is held as a two’s complement binary number, the
ideal exponent is trivially calculated by applying an arithmetic
right shift of one bit.
|
[12] |
This is because a typical implementation of power(x,y) will calculate
its result using exp(ln(x)*y), and few results of the exp function
are exact.
|
[previous | contents | next]