Decimal Arithmetic Specification, version 1.70
Copyright (c) IBM Corporation, 2009. All rights reserved. ©
7 Apr 2009
[previous | contents | next]

Arithmetic operations

This section describes the arithmetic operations on, and some other functions of, numbers, including subnormal numbers, negative zeros, and special values (see also IEEE 754 §5 and §6). The operations described are:

Arithmetic operation notation

In this section, a simplified notation is used to illustrate arithmetic operations: a number is shown as the string that would result from using the to-scientific-string operation. Single quotes are used to indicate that a number converted from an abstract representation is implied.

Also, operations are indicated as functions (taking up to three operands), and the sequence ==> means ‘results in’. Hence:

  add(’12’, ’7.00’) ==> ’19.00’
means that the result of the add operation with the operands [0,12,0] and [0,700,-2] is [0,1900,-2].

Finally, in this example and in the examples below, the context is assumed to have precision set to 9, rounding set to round-half-up, and all trap-enablers set to 0.

Arithmetic operation rules

The following general rules apply to all arithmetic operations except where stated below.

Examples involving special values:

  add(’Infinity’, ’1’)        ==>  ’Infinity’
  add(’NaN’, ’1’)             ==>  ’NaN’
  add(’NaN’, ’Infinity’)      ==>  ’NaN’
  subtract(’1’, ’Infinity’)   ==>  ’-Infinity’
  multiply(’-1’, ’Infinity’)  ==>  ’-Infinity’
  subtract(’-0’, ’0’)         ==>  ’-0’
  multiply(’-1’, ’0’)         ==>  ’-0’
  divide(’1’, ’0’)            ==>  ’Infinity’
  divide(’1’, ’-0’)           ==>  ’-Infinity’
  divide(’-1’, ’0’)           ==>  ’-Infinity’
Notes:
  1. Operands may have more than precision digits and are not rounded before use.
  2. The context (precision and rounding, etc.) for an operation might be wholly implied, or be a global or local setting, or be passed to operations individually – depending on the implementation of the specification (for example, in a programming language).
  3. NaNs propagate any associated diagnostic information as described in IEEE 854 §6.2. The meaning of any such diagnostic information is outside the scope of this specification, but typically indicates the origin of the NaN. In IEEE 754-2008, this information is only held in the coefficient of decimal numbers and does not use the first digit of the coefficient.
  4. The rules above imply that the compare operation can return a quiet NaN as a result, which indicates an ‘unordered’ comparison (see IEEE 754 §5.11).
  5. An implementation may use the compare operation ‘under the covers’ to implement a closed set of comparison operations (greater than, equal, etc.) if desired. In this case, the additional constraints detailed in IEEE 754 §5.11 will apply; that is, a comparison (such a ‘greater than’) which does not explicitly allow for an ‘unordered’ result yet would require an unordered result will give rise to an Invalid operation condition.
  6. If a result is rounded, remains finite, and is not subnormal, its coefficient will have exactly precision digits (except after the quantize or round-to-integral operations, as described below). That is, only unrounded or subnormal coefficients can have fewer than precision digits.
  7. Trailing zeros are not removed after operations. The reduce operation may be used to remove trailing zeros if desired.


abs

abs takes one operand. If the operand is negative, the result is the same as using the minus operation on the operand. Otherwise, the result is the same as using the plus operation on the operand.

Examples:

  abs(’2.1’)    ==>  ’2.1’
  abs(’-100’)   ==>  ’100’
  abs(’101.5’)  ==>  ’101.5’
  abs(’-101.5’) ==>  ’101.5’
Note that the result of this operation is affected by context and may set flags. The copy-abs operation may be used if this is not desired.


add and subtract

add and subtract both take two operands. If either operand is a special value then the general rules apply.

Otherwise, the operands are added (after inverting the sign used for the second operand if the operation is a subtraction), as follows:

The result is then rounded to precision digits if necessary, counting from the most significant digit of the result.

Examples:

  add(’12’, ’7.00’)        ==>  ’19.00’
  add(’1E+2’, ’1E+4’)      ==>  ’1.01E+4’
  subtract(’1.3’, ’1.07’)  ==>  ’0.23’
  subtract(’1.3’, ’1.30’)  ==>  ’0.00’
  subtract(’1.3’, ’2.07’)  ==>  ’-0.77’


compare

compare takes two operands and compares their values numerically. If either operand is a special value then the general rules apply. No flags are set unless an operand is a signaling NaN.

Otherwise, the operands are compared as follows.

If the signs of the operands differ, a value representing each operand (’-1’ if the operand is less than zero, ’0’ if the operand is zero or negative zero, or ’1’ if the operand is greater than zero) is used in place of that operand for the comparison instead of the actual operand.[2] 

The comparison is then effected by subtracting the second operand from the first and then returning a value according to the result of the subtraction: ’-1’ if the result is less than zero, ’0’ if the result is zero or negative zero, or ’1’ if the result is greater than zero.

An implementation may use this operation ‘under the covers’ to implement a closed set of comparison operations (greater than, equal, etc.) if desired. It need not, in this case, expose the compare operation itself.

Examples:

  compare(’2.1’, ’3’)     ==>  ’-1’
  compare(’2.1’, ’2.1’)   ==>  ’0’
  compare(’2.1’, ’2.10’)  ==>  ’0’
  compare(’3’, ’2.1’)     ==>  ’1’
  compare(’2.1’, ’-3’)    ==>  ’1’
  compare(’-3’, ’2.1’)    ==>  ’-1’
Notes:
  1. The result of compare is always exact and unrounded, and may be a NaN.
  2. The compare-total operation can be used for a non-numerical comparison which provides a total ordering over the abstract representation of values.


compare-signal

compare-signal takes two operands and compares their values numerically. This operation is identical to compare, except that if neither operand is a signaling NaN then any quiet NaN operand is treated as though it were a signaling NaN. (That is, all NaNs signal, with signaling NaNs taking precedence over quiet NaNs.)


divide

divide takes two operands. If either operand is a special value then the general rules apply.

Otherwise, if the divisor is zero then either the Division undefined condition is raised (if the dividend is zero) and the result is NaN, or the Division by zero condition is raised and the result is an Infinity with a sign which is the exclusive or of the signs of the operands.

Otherwise, a ‘long division’ is effected, as follows:

The result is then rounded to precision digits, if necessary, according to the rounding algorithm and taking into account the remainder from the division.

Examples:

  divide(’1’, ’3’  )      ==>  ’0.333333333’
  divide(’2’, ’3’  )      ==>  ’0.666666667’
  divide(’5’, ’2’  )      ==>  ’2.5’
  divide(’1’, ’10’ )      ==>  ’0.1’
  divide(’12’, ’12’)      ==>  ’1’
  divide(’8.00’, ’2’)     ==>  ’4.00’
  divide(’2.400’, ’2.0’)  ==>  ’1.20’
  divide(’1000’, ’100’)   ==>  ’10’
  divide(’1000’, ’1’)     ==>  ’1000’
  divide(’2.40E+6’, ’2’)  ==>  ’1.20E+6’
Note that the results as described above can alternatively be expressed as follows:


divide-integer

divide-integer takes two operands; it divides two numbers and returns the integer part of the result. If either operand is a special value then the general rules apply.

Otherwise, the result returned is defined to be that which would result from repeatedly subtracting the divisor from the dividend while the dividend is larger than or equal to the divisor. During this subtraction, the absolute values of both the dividend and the divisor are used: the sign of the final result is the same as that which would result if normal division were used.

In other words, if the operands x and y were given to the divide-integer and remainder operations, resulting in i and r respectively, then the identity

  • x = i × y + r
  • holds.

    The exponent of the result must be 0. Hence, if the result cannot be expressed exactly within precision digits, the operation is in error and will fail – that is, the result cannot have more digits than the value of precision in effect for the operation, and will not be rounded. For example, divide-integer(’10000000000’, ’3’) requires ten digits to express the result exactly (’3333333333’) and would therefore fail if precision were in the range 1 through 9.

    Notes:

    1. The divide-integer operation may not give the same result as truncating normal division (which could be affected by rounding and might be Inexact).
    2. The divide-integer and remainder operations are defined so that they may be calculated as a by-product of the standard division operation (described above). The division process is ended as soon as the integer result is available; the residue of the dividend is the remainder.
    3. The divide and divide-integer operation on the same operands give results of the same numerical value if no error occurs and there is no residue from the divide-integer operation.

    Examples:

      divide-integer(’2’, ’3’)    ==>  ’0’
      divide-integer(’10’, ’3’)   ==>  ’3’
      divide-integer(’1’, ’0.3’)  ==>  ’3’
    


    exp

    exp takes one operand. If the operand is a NaN then the general rules for special values apply.

    Otherwise, the result is e raised to the power of the operand, with the following cases:

    Examples:

      exp(’-Infinity’)    ==> ’0’
      exp(’-1’)           ==> ’0.367879441’
      exp(’0’)            ==> ’1’
      exp(’1’)            ==> ’2.71828183’
      exp(’0.693147181’)  ==> ’2.00000000’
      exp(’+Infinity’)    ==> ’Infinity’
    
    Notes:
    1. The rounding setting in the context is not used; this means that the algorithm described in Variable Precision Exponential Function by T. E. Hull and A. Abrham (ACM Transactions on Mathematical Software, Vol 12 #2, pp79–91, ACM, June 1986) may be used for this operation.
    2. When the result is inexact, the cost of exp at precision d is likely to be at least 13×log2(d) times the cost of an inexact multiplication at the same precision (see Multiple-precision zero-finding methods and the complexity of elementary function evaluation by R. P. Brent, in Analytic Computational Complexity pp151–176, Academic Press, York, 1976, and Fast Multiple-Precision Evaluation of Elementary Functions by the same author, in Journal of the ACM (JACM), Vol 23 # 2, pp242–251, ACM, April 1976).


    fused-multiply-add

    fused-multiply-add takes three operands; the first two are multiplied together, using multiply, with sufficient precision and exponent range that the result is exact and unrounded.[4]  No flags are set by the multiplication unless one of the first two operands is a signaling NaN or one is a zero and the other is an infinity.

    Unless the multiplication failed, the third operand is then added to the result of that multiplication, using add, under the current context.

    In other words, fused-multiply-add(x, y, z) delivers a result which is (x × y) + z with only the one, final, rounding.

    Examples:

      fused-multiply-add(’3’, ’5’, ’7’)               ==>  ’22’
      fused-multiply-add(’3’, ’-5’, ’7’)              ==>  ’-8’
      fused-multiply-add(’888565290’, ’1557.96930’,
                                      ’-86087.7578’)  ==>  ’1.38435736E+12’
    
    Note that the last example would have given the result ’1.38435735E+12’ if the operation had been carried out as a separate multiply followed by an add.


    ln

    ln takes one operand. If the operand is a NaN then the general rules for special values apply.

    Otherwise, the operand must be a zero or positive, and the result is the natural (base e) logarithm of the operand, with the following cases:

    Examples:

      ln(’0’)           ==> ’-Infinity’
      ln(’1.000’)       ==> ’0’
      ln(’2.71828183’)  ==> ’1.00000000’
      ln(’10’)          ==> ’2.30258509’
      ln(’+Infinity’)   ==> ’Infinity’
    
    Notes:
    1. The rounding setting in the context is not used.
    2. When the result is inexact, the cost of ln at a given precision is likely to be similar to, or more expensive than, the exp function (see notes under that function).


    log10

    log10 takes one operand. If the operand is a NaN then the general rules for special values apply.

    Otherwise, the operand must be a zero or positive, and the result is the base 10 logarithm of the operand, with the following cases:

    Examples:

      log10(’0’)          ==> ’-Infinity’
      log10(’0.001’)      ==> ’-3’
      log10(’1.000’)      ==> ’0’
      log10(’2’)          ==> ’0.301029996’
      log10(’10’)         ==> ’1’
      log10(’70’)         ==> ’1.84509804’
      log10(’+Infinity’)  ==> ’Infinity’
    
    Notes:
    1. The rounding setting in the context is not used.
    2. When the result is inexact, the cost of log10 at a given precision is likely to be similar to, or more expensive than, the exp function (see notes under that function).


    max

    max takes two operands, compares their values numerically, and returns the maximum.[5]  If either operand is a NaN then the general rules apply, unless one is a quiet NaN and the other is numeric, in which case the numeric operand is returned.[6] 

    Otherwise, the operands are compared as as though by the compare operation. If they are not numerically equal then the maximum (closer to positive infinity) of the two operands is chosen as the result. Otherwise (they are numerically equal):

    For numerical results, the result is the same as using the plus operation on the chosen operand, except that the sign of a zero does not change.

    Examples:

      max(’3’, ’2’)    ==>  ’3’
      max(’-10’, ’3’)  ==>  ’3’
      max(’1.0’, ’1’)  ==>  ’1’
      max(’7’, ’NaN’)  ==>  ’7’
    


    max-magnitude

    max-magnitude takes two operands and compares their values numerically with their sign ignored and assumed to be 0.

    If, without signs, the first operand is the larger then the original first operand is returned (that is, with the original sign). If, without signs, the second operand is the larger then the original second operand is returned. Otherwise the result is the same as from the max operation.


    min

    min takes two operands, compares their values numerically, and returns the minimum.[7]  If either operand is a NaN then the general rules apply, unless one is a quiet NaN and the other is numeric, in which case the numeric operand is returned.

    Otherwise, the operands are compared as as though by the compare operation. If they are not numerically equal then the minimum (closer to negative infinity) of the two operands is chosen as the result. Otherwise (they are numerically equal):

    For numerical results, the result is the same as using the plus operation on the chosen operand, except that the sign of a zero does not change.

    Examples:

      min(’3’, ’2’)    ==>  ’2’
      min(’-10’, ’3’)  ==>  ’-10’
      min(’1.0’, ’1’)  ==>  ’1.0’
      min(’7’, ’NaN’)  ==>  ’7’
    


    min-magnitude

    min-magnitude takes two operands and compares their values numerically with their sign ignored and assumed to be 0.

    If, without signs, the first operand is the smaller then the original first operand is returned (that is, with the original sign). If, without signs, the second operand is the smaller then the original second operand is returned. Otherwise the result is the same as from the min operation.


    minus and plus

    minus and plus both take one operand, and correspond to the prefix minus and plus operators in programming languages.

    The operations are evaluated using the same rules as add and subtract; the operations plus(a) and minus(a) (where a and b refer to any numbers) are calculated as the operations add(’0’, a) and subtract(’0’, b) respectively, where the ’0’ has the same exponent as the operand.

    Examples:

      plus(’1.3’)    ==>  ’1.3’
      plus(’-1.3’)   ==>  ’-1.3’
      minus(’1.3’)   ==>  ’-1.3’
      minus(’-1.3’)  ==>  ’1.3’
    
    Note that the result of these operations is affected by context and may set flags. The copy-negate operation may be used instead of minus if this is not desired.


    multiply

    multiply takes two operands. If either operand is a special value then the general rules apply.

    Otherwise, the operands are multiplied together (‘long multiplication’), resulting in a number which may be as long as the sum of the lengths of the two operands, as follows:

    The result is then rounded to precision digits if necessary, counting from the most significant digit of the result.

    Examples:

      multiply(’1.20’, ’3’)         ==>  ’3.60’
      multiply(’7’, ’3’)            ==>  ’21’
      multiply(’0.9’, ’0.8’)        ==>  ’0.72’
      multiply(’0.9’, ’-0’)         ==>  ’-0.0’
      multiply(’654321’, ’654321’)  ==>  ’4.28135971E+11’
    


    next-minus

    next-minus takes one operand; if the operand is a NaN then the general rules apply. Otherwise the result is the largest representable number that is smaller than the operand unless the operand is –Infinity, in which case the result is –Infinity. If the result is zero its sign will be 0 and its exponent will be the smallest possible. No flags will be set when the operand is numeric.

    In the following examples, Emax and Emin are assumed to be +999 and –999 respectively.

    Examples:

      next-minus(’1’)            ==>  ’0.999999999’
      next-minus(’1E-1007’)      ==>  ’0E-1007’
      next-minus(’-1.00000003’)  ==>  ’-1.00000004’
      next-minus(’Infinity’)     ==>  ’9.99999999E+999’
    


    next-plus

    next-plus takes one operand; if the operand is a NaN then the general rules apply. Otherwise the result is the smallest representable number that is larger than the operand unless the operand is +Infinity, in which case the result is +Infinity. If the result is zero its sign will be 1 and its exponent will be the smallest possible. No flags will be set when the operand is numeric.

    In the following examples, Emax and Emin are assumed to be +999 and –999 respectively.

    Examples:

      next-plus(’1’)            ==>  ’1.00000001’
      next-plus(’-1E-1007’)     ==>  ’-0E-1007’
      next-plus(’-1.00000003’)  ==>  ’-1.00000002’
      next-plus(’-Infinity’)    ==>  ’-9.99999999E+999’
    


    next-toward

    next-toward takes two operands; if either operand is a NaN then the general rules apply. Otherwise the result is the representable number closest to the first operand (but not the first operand) that is in the direction towards the second operand, unless the operands have the same value. Specifically: In the first two cases, flags are set as though the operation had been computed by adding (in the first case) or subtracting (in the second) an infinitesimally small positive value to or from the first operand with the rounding mode set to be round-ceiling or round-floor respectively.[8] 

    In the following examples, Emax and Emin are assumed to be +999 and –999 respectively.

    Examples:

      next-toward(’1’, ’2’)              ==>  ’1.00000001’
      next-toward(’-1E-1007’, ’1’)       ==>  ’-0E-1007’
      next-toward(’-1.00000003’, ’0’)    ==>  ’-1.00000002’
      next-toward(’1’, ’0’)              ==>  ’0.999999999’
      next-toward(’1E-1007’, ’-100’)     ==>  ’0E-1007’
      next-toward(’-1.00000003’, ’-10’)  ==>  ’-1.00000004’
      next-toward(’0.00’, ’-0.0000’)     ==>  ’-0.00’
    
    This operation derives its anomalous rules for flags from the IEEE 754-1985 operation nextAfter; the operation was dropped from the IEEE 754-2008 standard.


    power

    power takes two operands, and raises a number (the left-hand operand) to a power (the right-hand operand). If either operand is a special value then the general rules apply, except as stated below.

    The following rules apply:

    Examples:

      power(’2’, ’3’)             ==>  ’8’
      power(’-2’, ’3’)            ==>  ’-8’
      power(’2’, ’-3’)            ==>  ’0.125’
      power(’1.7’, ’8’)           ==>  ’69.7575744’
      power(’10’, ’0.301029996’)  ==>  ’2.00000000’
      power(’Infinity’, ’-1’)     ==>  ’0’
      power(’Infinity’, ’0’)      ==>  ’1’
      power(’Infinity’, ’1’)      ==>  ’Infinity’
      power(’-Infinity’, ’-1’)    ==>  ’-0’
      power(’-Infinity’, ’0’)     ==>  ’1’
      power(’-Infinity’, ’1’)     ==>  ’-Infinity’
      power(’-Infinity’, ’2’)     ==>  ’Infinity’
      power(’0’, ’0’)             ==>  ’NaN’
    
    Notes:
    1. When the result is inexact, the cost of power at a given precision is likely to be at least twice as expensive as the exp function (see notes under that function).
    2. An infinite result is always exact, as described in the general rules.
    3. Versions of this specification prior to version 1.48 defined a simpler power operation which only required support for integer powers.
    4. It can be argued that the special cases where one operand is zero and the other is infinite (such as power(’0’, ’Infinity’) and power(’Infinity’, ’0’)) should return a NaN, whereas the specification above leads to results of 0 and 1 respectively for the two examples (for compatibility with the earlier version of this operation). If NaN results are desired instead, then these special cases should be tested for before calling the power operation.


    quantize

    quantize takes two operands. If either operand is a special value then the general rules apply, except that if either operand is infinite and the other is finite an Invalid operation condition is raised and the result is [0,qNaN], or if both are infinite then the result is the first operand.

    Otherwise (both operands are finite), quantize returns the number which is equal in value (except for any rounding) and sign to the first (left-hand) operand and which has an exponent set to be equal to the exponent of the second (right-hand) operand.

    The coefficient of the result is derived from that of the left-hand operand. It may be rounded using the current rounding setting (if the exponent is being increased), multiplied by a positive power of ten (if the exponent is being decreased), or is unchanged (if the exponent is already equal to that of the right-hand operand).

    Unlike other operations, if the length of the coefficient after the quantize operation would be greater than precision then an Invalid operation condition is raised. This guarantees that, unless there is an error condition, the exponent of the result of a quantize is always equal to that of the right-hand operand.

    Also unlike other operations, quantize will never raise Underflow, even if the result is subnormal and inexact.

    Examples:

      quantize(’2.17’, ’0.001’)        ==>  ’2.170’
      quantize(’2.17’, ’0.01’)         ==>  ’2.17’
      quantize(’2.17’, ’0.1’)          ==>  ’2.2’
      quantize(’2.17’, ’1e+0’)         ==>  ’2’
      quantize(’2.17’, ’1e+1’)         ==>  ’0E+1’
      quantize(’-Inf’  ’Infinity’)     ==>  ’-Infinity’
      quantize(’2’,    ’Infinity’)     ==>  ’NaN’
      quantize(’-0.1’, ’1’  )          ==>  ’-0’
      quantize(’-0’,   ’1e+5’)         ==>  ’-0E+5’
      quantize(’+35236450.6’, ’1e-2’)  ==>  ’NaN’
      quantize(’-35236450.6’, ’1e-2’)  ==>  ’NaN’
      quantize(’217’,  ’1e-1’)         ==>  ’217.0’
      quantize(’217’,  ’1e+0’)         ==>  ’217’
      quantize(’217’,  ’1e+1’)         ==>  ’2.2E+2’
      quantize(’217’,  ’1e+2’)         ==>  ’2E+2’
    
    Notes:
    1. In the penultimate example the result is [0,22,1], leading to the string in scientific notation as shown.
    2. This operation was previously called rescale, which had identical semantics except that the second operand specified the power of ten of the quantum. The quantize semantics specifies the desired quantum by example, which allows a faster implementation in most cases.
    3. The sign and coefficient of the second operand are ignored; this allows a ‘match the quantum of a variable’ operation to be effected directly.


    reduce

    reduce takes one operand. It has the same semantics as the plus operation, except that if the final result is finite it is reduced to its simplest form, with all trailing zeros removed and its sign preserved.

    That is, while the coefficient is non-zero and a multiple of ten the coefficient is divided by ten and the exponent is incremented by 1. Otherwise (the coefficient is zero) the exponent is set to 0. In all cases the sign is unchanged.

    Examples:

      reduce(’2.1’)     ==>  ’2.1’
      reduce(’-2.0’)    ==>  ’-2’
      reduce(’1.200’)   ==>  ’1.2’
      reduce(’-120’)    ==>  ’-1.2E+2’
      reduce(’120.00’)  ==>  ’1.2E+2’
      reduce(’0.00’)    ==>  ’0’
    
    This operation was called normalize prior to version 1.68 of this specification.


    remainder

    remainder takes two operands; it returns the remainder from integer division. If either operand is a special value then the general rules apply.

    Otherwise, the result is the residue of the dividend after the operation of calculating integer division as described for divide-integer, rounded to precision digits if necessary. The sign of the result, if non-zero, is the same as that of the original dividend.

    This operation will fail under the same conditions as integer division (that is, if integer division on the same two operands would fail, the remainder cannot be calculated).

    Examples:

      remainder(’2.1’, ’3’)    ==>  ’2.1’
      remainder(’10’, ’3’)     ==>  ’1’
      remainder(’-10’, ’3’)    ==>  ’-1’
      remainder(’10.2’, ’1’)   ==>  ’0.2’
      remainder(’10’, ’0.3’)   ==>  ’0.1’
      remainder(’3.6’, ’1.3’)  ==>  ’1.0’
    
    Notes:
    1. The divide-integer and remainder operations are defined so that they may be calculated as a by-product of the standard division operation (described above). The division process is ended as soon as the integer result is available; the residue of the dividend is the remainder.
    2. The sign of the result will always be sign of the dividend.
    3. The remainder operation differs from the remainder operation defined in IEEE 754 (the remainder-near operator), in that it gives the same results for numbers whose values are equal to integers as would the usual remainder operator on integers.
      For example, the result of the operation remainder(’10’, ’6’) as defined here is ’4’, and remainder(’10.0’, ’6’) would give ’4.0’ (as would remainder(’10’, ’6.0’) or remainder(’10.0’, ’6.0’)). The IEEE 754 remainder operation would, however, give the result ’-2’ because its integer division step chooses the closest integer, not the one nearer zero.


    remainder-near

    remainder-near takes two operands. If either operand is a special value then the general rules apply.

    Otherwise, if the operands are given by x and y, then the result is defined to be xy × n, where n is the integer nearest the exact value of x ÷ y (if two integers are equally near then the even one is chosen). If the result is equal to 0 then its sign will be the sign of x. (See IEEE 754 §5.3.1.)

    This operation will fail under the same conditions as integer division (that is, if integer division on the same two operands would fail, the remainder cannot be calculated), except when the quotient is very close to 10 raised to the power of the precision.[10] 

    Examples:

      remainder-near(’2.1’, ’3’)    ==>  ’-0.9’
      remainder-near(’10’, ’6’)     ==>  ’-2’
      remainder-near(’10’, ’3’)     ==>  ’1’
      remainder-near(’-10’, ’3’)    ==>  ’-1’
      remainder-near(’10.2’, ’1’)   ==>  ’0.2’
      remainder-near(’10’, ’0.3’)   ==>  ’0.1’
      remainder-near(’3.6’, ’1.3’)  ==>  ’-0.3’
    
    Notes:
    1. The remainder-near operation differs from the remainder operation in that it does not give the same results for numbers whose values are equal to integers as would the usual remainder operator on integers. For example, the operation remainder(’10’, ’6’) gives the result ’4’, and remainder(’10.0’, ’6’) gives ’4.0’ (as would the operations remainder(’10’, ’6.0’) or remainder(’10.0’, ’6.0’)). However, remainder-near(’10’, ’6’) gives the result ’-2’ because its integer division step chooses the closest integer, not the one nearer zero.
    2. The result of this operation is always exact.
    3. This operation is sometimes known as ‘IEEE remainder’.


    round-to-integral-exact

    round-to-integral-exact takes one operand. If the operand is a special value, or the exponent of the operand is non-negative, then the result is the same as the operand (unless the operand is a signaling NaN, as usual).

    Otherwise (the operand has a negative exponent) the result is the same as using the quantize operation using the given operand as the left-hand-operand, 1E+0 as the right-hand-operand, and the precision of the operand as the precision setting. The rounding mode is taken from the context, as usual.

    Examples:

      round-to-integral-exact(’2.1’)      ==>  ’2’
      round-to-integral-exact(’100’)      ==>  ’100’
      round-to-integral-exact(’100.0’)    ==>  ’100’
      round-to-integral-exact(’101.5’)    ==>  ’102’
      round-to-integral-exact(’-101.5’)   ==>  ’-102’
      round-to-integral-exact(’10E+5’)    ==>  ’1.0E+6’
      round-to-integral-exact(’7.89E+77’) ==>  ’7.89E+77’
      round-to-integral-exact(’-Inf’)     ==>  ’-Infinity’
    


    round-to-integral-value

    round-to-integral-value takes one operand. It is identical to the round-to-integral-exact operation except that the Inexact and Rounded flags are never set even if the operand is rounded (that is, the operation is quiet unless the operand is a signaling NaN).


    square-root

    square-root takes one operand. If the operand is a special value then the general rules apply.

    Otherwise, the ideal exponent of the result is defined to be half the exponent of the operand (rounded to an integer, towards –Infinity,[11]  if necessary) and then:

    Examples:

      square-root(’0’)     ==> ’0’
      square-root(’-0’)    ==> ’-0’
      square-root(’0.39’)  ==> ’0.62449980’
      square-root(’100’)   ==> ’10’
      square-root(’1’)     ==> ’1’
      square-root(’1.0’)   ==> ’1.0’
      square-root(’1.00’)  ==> ’1.0’
      square-root(’7’)     ==> ’2.64575131’
      square-root(’10’)    ==> ’3.16227766’
    
    Notes:
    1. The rounding setting in the context is not used; this means that the algorithm described in Properly Rounded Variable Precision Square Root by T. E. Hull and A. Abrham (ACM Transactions on Mathematical Software, Vol 11 #3, pp229–237, ACM, September 1985) may be used for this operation.
    2. A subnormal result is only possible if the working precision is greater than Emax+1.
    3. The rules for setting the exponent of the result apply to many operations; they can be used for any operation for which an ideal exponent can be defined.
    4. A negative zero is allowed as an operand as per IEEE 754 §5.4.1.
    5. Square-root can also be calculated by using the power operation (with a second operand of 0.5). The result in that case will not be exact in most cases, and may not be correctly rounded.[12] 

    Footnotes:
    [1] In practice, it is only necessary to work with intermediate results of up to twice the current precision. Some rounding settings may require some inspection of possible remainders or additional digits (for example, to determine whether a result is exactly 0.5 in the next position), though their actual values would not be required.
    For round-half-up, rounding can be effected by truncating the result to precision (and adding the count of truncated digits to the exponent). The first truncated digit is then inspected, and if it has the value 5 through 9 the result is incremented by 1. This could cause the result to again exceed precision digits, in which case it is divided by 10 and the exponent is incremented by 1.
    [2] This rule removes the possibility of an arithmetic overflow during a numeric comparison.
    [3] In practice, only two bits need to be noted, indicating whether the remainder was 0, or was exactly half of the final coefficient of the divisor, or was in one of the two ranges above or below the half-way point.
    [4] This requires up to twice the current exponent range and a precision which is the sum of the lengths of the two operands’ coefficients.
    [5] This is the IEEE 754 maxnum operation, with an explicit result for equal operands.
    [6] This permits a useful ordering of data in which NaNs are used to indicate ‘unknown’ values.
    [7] This is the IEEE 754 minnum operation, with an explicit result for equal operands.
    [8] The result can in fact be computed by an appropriate addition, with one infinite value having a special case result and the sign of a zero result being set appropriately.
    [9] That is, any fractional part (after the decimal point) is non-zero.
    [10] This is a deviation from IEEE 754, necessary to assure realistic execution times when the operands have a wide range of exponents.
    [11] This rule matches the typical implementations. For example, the square-root of either [0,10,-1] or [0,11,-1] is often calculated by first multiplying the coefficient by ten and reducing the exponent by 1 and then determining the square root. If the exponent is held as a two’s complement binary number, the ideal exponent is trivially calculated by applying an arithmetic right shift of one bit.
    [12] This is because a typical implementation of power(x,y) will calculate its result using exp(ln(x)*y), and few results of the exp function are exact.

    [previous | contents | next]