# The Arithmetic Model

This specification is based on a model of decimal arithmetic which is a formalization of the decimal system of numeration (Algorism) as further defined and constrained by the relevant standards (IEEE 854, ANSI X3-274, and IEEE 754-2008).

There are three components to the model:

1. numbers – which represent the values which can be manipulated by, or be the results of, the core operations defined in this specification
2. operations – the core operations (such as addition, multiplication, etc.) which can be carried out on numbers
3. context – which represents the user-selectable parameters and rules which govern the results of arithmetic operations (for example, the precision to be used).

This specification defines these components in the abstract. It neither defines the way in which operations are expressed (which might vary depending on the computer language or other interface being used),  nor does it define the concrete representation (specific layout in storage, or in a processor’s register, for example) of numbers or context.

The remainder of this section describes the abstract model for each component.

### Abstract representation of numbers

Numbers represent the values which can be manipulated by, or be the results of, the core operations defined in this specification. Numbers may be finite numbers (numbers whose value can be represented exactly) or they may be special values (infinities and other values which are not finite numbers).

#### Finite numbers

Finite numbers are defined by three integer parameters:
1. sign – a value which must be either 0 or 1, where 1 indicates that the number is negative or is the negative zero and 0 indicates that the number is zero or positive.
2. coefficient – an integer which is zero or positive.
In the abstract, there is no upper limit on the maximum size of the coefficient. In practice, an implementation may need to define a specific upper limit (for example, the length of the maximum coefficient supported by the concrete representation). This limit must be expressed as an integral number of decimal digits.
3. exponent – a signed integer which indicates the power of ten by which the coefficient is multiplied.
In the abstract, there is no upper limit on the absolute value of the exponent. In practice there may be some upper limit, Elimit, on the absolute value of the exponent.
If the coefficient has a maximum length then it is required  that Elimit be greater than 5 × mlength, where mlength is the maximum length of the coefficient in decimal digits. It is recommended that Elimit be greater than 10 × mlength.
The adjusted exponent is the value of the exponent of a number when that number is expressed as though in scientific notation with one digit (non-zero unless the coefficient is 0) before any decimal point. This is given by the value of the exponent+(clength–1), where clength is the length of the coefficient in decimal digits.
When a limit to the exponent applies, it must result in a balanced range of positive or negative numbers,  taking into account the magnitude of the coefficient. To achieve this balanced range, the minimum and maximum values of the adjusted exponent (Emin and Emax respectively) must have magnitudes which differ by no more than one, so Emin will be –Emax±1. IEEE 754 further constrains this so that Emin = 1–Emax.
Therefore, if the length of the coefficient is clength digits, the exponent may take any of the values –Elimit–(clength–1)+1 through Elimit–(clength–1).
For example, if the coefficient had the value 123456789 (9 digits) and the exponent had an Elimit of 999 (3 digits), then the exponent could range from –1006 through +991. This would allow positive values of the number to range from 1.23456789E–998 through 1.23456789E+999.
It is recommended that Emax be expressed as an integral number of decimal digits or be one of the numbers 1, 5, or 25, multiplied by an positive integral power of ten and optionally reduced by one (for example, 49 or 50).
The numerical value of a finite number is given by: (–1)sign × coefficient × 10exponent.

The quantum of a finite number is given by: 1 × 10exponent. This is the value of a unit in the least significant position of the coefficient of a finite number.

This abstract definition deliberately allows for multiple representations of values which are numerically equal but are visually distinct (such as 1 and 1.00). However, there is a one-to-one mapping between the abstract representation and the result of the primary conversion to string using to-scientific-string on that abstract representation. In other words, if one number has a different abstract representation to another, then the primary string conversion will also be different.

Notes:

1. Many concrete representations for finite numbers have been used successfully. Typically, the coefficient is represented in some form of binary coded or packed decimal, or is encoded using a base which is a higher power of ten. It may also be expressed as a binary integer. The exponent is typically represented by a two’s complement or biased binary integer. The IEEE 754 decimal-encoded concrete representations are described in detail at: http://speleotrove.com/decimal/decbits.html
2. The one-to-one mapping between the abstract representation and the result of the primary conversion to string is required, as described above. However, no such constraint applies to a concrete representation (that is, there may be multiple concrete representations of a single abstract representation).
3. A number with a coefficient of 0 is permitted to have a non-zero sign. This negative zero is accepted as an operand for all operations (see IEEE 754 §3.2).

#### Special values

In addition to the finite numbers, numbers must also be able to represent one of three named special values:
1. infinity – a value representing a number whose magnitude is infinitely large (¥, see IEEE 754 §3.2 and §6.1)
2. quiet NaN – a value representing undefined results (‘Not a Number’) which does not cause an Invalid operation condition. IEEE 754 recommends that additional diagnostic information be associated with NaNs (see IEEE 754 §3.2 and §6.2)
3. signaling NaN – a value representing undefined results (‘Not a Number’) which will usually cause an Invalid operation condition if used in any operation defined in this specification (see IEEE 754 §3.2 and §6.2).
When a number has one of these special values, its coefficient and exponent are undefined.  A NaN, however, may have associated diagnostic information, also known as its payload. This is treated as though it can be encoded as a positive integer (greater than zero) which must be no larger than can be represented by the coefficient less one digit.

All special values may have a sign, as for finite numbers. The sign of an infinity is significant (that is, it is possible to have both positive and negative infinity), and the sign of a NaN has no meaning, although it may be considered part of the diagnostic information.

#### Normal numbers, subnormal numbers and Underflow

In any context where exponents are bounded most finite numbers are normal. Non-zero finite numbers whose adjusted exponents are greater than or equal to Emin are called normal numbers; those non-zero numbers whose adjusted exponents are less than Emin are called subnormal numbers.  Like other numbers, subnormal numbers are accepted as operands for all operations, and may result from any operation. If a result is subnormal, before any rounding, then the Subnormal condition is raised.

For a subnormal result, the minimum value of the exponent becomes Emin–(precision–1), called Etiny, where precision is the working precision, as described below. The result will be rounded, if necessary, to ensure that the exponent is no smaller than Etiny. If, during this rounding, the result becomes inexact, then the Underflow condition is raised. A subnormal result does not necessarily raise Underflow, therefore, but is always indicated by the Subnormal condition (even if, after rounding, its value is 0 or ten to the power of Emin).

When a number underflows to zero during a calculation, its exponent will be Etiny. The maximum value of the exponent is unaffected.

Note that the minimum value of the exponent for subnormal numbers is the same as the minimum value of exponent which can arise during operations which do not result in subnormal numbers, which occurs in the case where clength = precision.

#### Notation

In later sections of this document, a specific finite number is described by its abstract representation, using the triad notation: [sign, coefficient, exponent], where each value is an integer. Only the exponent can be negative.

Similarly, pairs or triads are used to indicate the special values. These have the form [sign, special-value] or the form [sign, special-value, diagnostic], where the sign is indicated as before, and the special-value is one of inf, qNaN, or sNaN, representing infinity, quiet NaN, or signaling NaN, respectively, and diagnostic is a positive integer.

So, for example, the triad [0,2708,-2] represents the number 27.08, the triad [1,1953,0] represents the integer -1953, the pair [1,inf] represents the number –¥, and the pair [0,qNaN] represents a quiet NaN.

### Abstract representation of operations

The core operations which must be provided by an implementation are described in later sections which define Conversions and Arithmetic Operations. Each operation is given an abstract name (for example, ‘add’), and its semantics are strictly defined. However, the implementation of each operation and the manner by which each is effected is not defined by this specification.

For example, in a object-oriented language, the addition operation might be effected by a method called add, whereas in a calculator application it might be effected by clicking on a button icon. In other uses, an infix ‘+’ symbol might be used to indicate addition. And in all cases, the operation might be carried out in software, hardware, or some combination of these.

Similarly, operations which are distinct in the specification need not be mapped one-to-one to distinct operations in the implementation – it is only necessary that all the core operations are available. For example, conversions to a string could be handled by a single method, with variations determined from context or additional arguments.

### Abstract representation of context

The context represents the user-selectable parameters and rules which govern the results of arithmetic operations (for example, the precision to be used). This context might be implied in some way, or be a global or local setting, or be passed to operations – depending on the implementation of the specification (for example, in a programming language).

The context is defined by the following parameters:

precision

An integer which must be positive (greater than 0). This sets the maximum number of significant digits that can result from an arithmetic operation.

In the abstract, there is no upper bound on the precision (although a specific precision must always be provided). In practice there may need to be some upper limit to it (for example, the length of the maximum coefficient supported by a concrete representation). This limit must be expressed as an integral number of decimal digits.

Similarly, there may be a lower bound on the setting on precision, which may be the same as the upper bound (for example, if it is implied by the length of the maximum coefficient supported by a concrete representation). This limit must also be expressed as an integral number of decimal digits.

rounding

A named value which indicates the algorithm to be used when rounding is necessary. Rounding is applied when a result coefficient has more significant digits than the value of precision; in this case the result coefficient is shortened to precision digits and may then be incremented by one (which may require a further shortening), depending on the rounding algorithm selected and the remaining digits of the original coefficient. The exponent is adjusted to compensate for any shortening.

The five following rounding algorithms are defined and must be supported:

round-down

(Round toward 0; truncate.) The discarded digits are ignored; the result is unchanged.

round-half-up

If the discarded digits represent greater than or equal to half (0.5) of the value of a one in the next left position then the result coefficient should be incremented by 1 (rounded up). Otherwise the discarded digits are ignored.

round-half-even

If the discarded digits represent greater than half (0.5) the value of a one in the next left position then the result coefficient should be incremented by 1 (rounded up). If they represent less than half, then the result coefficient is not adjusted (that is, the discarded digits are ignored).

Otherwise (they represent exactly half) the result coefficient is unaltered if its rightmost digit is even, or incremented by 1 (rounded up) if its rightmost digit is odd (to make an even digit).

round-ceiling

(Round toward +¥.) If all of the discarded digits are zero or if the sign is 1 the result is unchanged. Otherwise, the result coefficient should be incremented by 1 (rounded up).

round-floor

(Round toward –¥.) If all of the discarded digits are zero or if the sign is 0 the result is unchanged. Otherwise, the sign is 1 and the result coefficient should be incremented by 1.

Three further rounding algorithms are defined; these are optional:

round-half-down

If the discarded digits represent greater than half (0.5) of the value of a one in the next left position then the result coefficient should be incremented by 1 (rounded up). Otherwise (the discarded digits are 0.5 or less) the discarded digits are ignored.

round-up

(Round away from 0.) If all of the discarded digits are zero the result is unchanged. Otherwise, the result coefficient should be incremented by 1 (rounded up).

round-05up

(Round zero or five away from 0.) The same as round-up, except that rounding up only occurs if the digit to be rounded up is 0 or 5, and after overflow the result is the same as for round-down.

When a result is rounded, the coefficient may become longer than the current precision. In this case the least significant digit of the coefficient (it will be a zero) is removed (reducing the precision by one), and the exponent is incremented by one. This in turn may give rise to an overflow condition, which determines the result after overflow.

flags and trap-enablers

The exceptional conditions are grouped into signals, which can be controlled individually. The context contains a flag (which is either 0 or 1) and a trap-enabler (which also is either 0 or 1) for each signal.

For each of the signals, the corresponding flag is set to 1 when the signal occurs. It is only reset to 0 by explicit user action.

For each of the signals, the corresponding trap-enabler indicates which action is to be taken when the signal occurs (see IEEE 754 §7). If 0, a defined result is supplied, and execution continues (for example, an overflow is perhaps converted to a positive or negative infinity). If 1, then execution of the operation is ended or paused and control passes to a ‘trap handler’, which will have access to the defined result.

The signals are:

clamped

raised when the exponent of a result has been altered or constrained in order to fit the constraints of a specific concrete representation

division-by-zero

raised when a non-zero dividend is divided by zero

inexact

raised when a result is not exact (one or more non-zero coefficient digits were discarded during rounding)

invalid-operation

raised when a result would be undefined or impossible

overflow

raised when the exponent of a result is too large to be represented

rounded

raised when a result has been rounded (that is, some zero or non-zero coefficient digits were discarded)

subnormal

raised when a result is subnormal (its adjusted exponent is less than Emin), before any rounding

underflow

raised when a result is both subnormal and inexact.

This specification does not define the means by which flags and traps are reset or altered, respectively, or the means by which traps are effected.

The context might also specify further variables, such as Emax where a variable exponent bound is required.

Notes:

1. The setting of precision may be used to reduce a result to a narrower precision, using the plus operation.
2. IEEE 854 and IEEE 754 were designed under the assumption that some small number of known precisions would be available to users. This specification extends this concept to allow (but not require) variable precisions, as specified by ANSI X3.274. This generalization allows improved interoperation between software arbitrary-precision decimal packages and hardware implementations (which are expected to have relatively low maximum precision limits, typically just tens of digits).
3. precision can be set to positive values lower than nine. Small values, however, should be used with care – the loss of precision and rounding thus requested will affect all computations affected by the context, including comparisons. To conform to IEEE 854, this value should not be set less than 6; the smallest IEEE 754 interchange format supports 7.
4. The concrete representation of rounding is often a series of integer constants, or an enumeration, held in an object or control register.
5. It has been proposed that each exceptional condition should have its own, distinct, signal and trap-enabler. This specification may change to this approach.

### Default contexts

This specification defines optional default contexts, which define suitable settings for basic arithmetic and for the extended arithmetic defined by IEEE 854 and IEEE 754. It is recommended that the default contexts be easily selectable by the user.

#### Basic default context

In the basic default context, the parameters are set as follows:
• flags – all set to 0
• trap-enablersinexact, rounded, and subnormal are set to 0; all others are set to 1 (that is, the other conditions are treated as errors)
• precision – is set to 9
• rounding – is set to round-half-up

#### Extended default contexts

In the extended default contexts, the parameters are set as follows:
• flags – all set to 0
• trap-enablers – all set to 0 (IEEE 854 §7)
• precision – is set to the appropriate precision for a given numerical format (for the IEEE 754 smallest and basic formats, the precisions are 7, 16, or 34 digits).
• rounding – is set to round-half-even (IEEE 754 §4.3.3)

Footnotes:
  Indeed, some variations of operations could be selected by using context settings outside the scope of this specification.  That is, the maximum value of the coefficient will be an integral power of ten, less one – for example, 99999999999999999999.  See IEEE 854 §3.1.  This rule, a requirement for both ANSI X3.274 and IEEE 854, constrains the number of values which would overflow or underflow when inverted (divided into 1).  This is slightly different from an ulp (unit in last position), which is defined such that ulp(x) is the difference between the two nearest bracketing representable values to x, and which if x is exactly representable and is an exact power of the base gives the ‘ulp below’.  Typically, in a concrete representation, certain out-of-range values of the exponent are used to indicate the special values, and the coefficient is used to carry additional diagnostic information for quiet NaNs. In the case of the proposed IEEE 754 decimal formats, the exponent is 0, the coefficient (excluding the first digit) may hold a decimal value which is the diagnostic information, and the special value is indicated by the combination field and exponent continuation bits.  This restriction allows the abstract coefficient in IEEE 754 encodings to be used to hold the diagnostic information for a NaN.  That is, numbers whose absolute value is non-zero and is closer to zero than ten to the power of Emin.  The term ‘round to nearest’ is not used because it is ambiguous. round-half-up is the usual round-to-nearest algorithm used in European countries, in international financial dealings, and in the USA for tax calculations. round-half-even is often used for other applications in the USA, where it is usually called ‘round to nearest’ and is sometimes called ‘banker’s rounding’.  The rounding mode round-05up permits arithmetic at shorter lengths to be emulated in a fixed-precision environment without double rounding. For example, a multiplication at a precision of 9 can be effected by carrying out the multiplication at (say) 16 digits using round-05up and then rounding to the required length using the desired rounding algorithm.  IEEE 754 suggests that there be a mechanism allowing traps to return a substitute result to the operation that raised the exception, but this may not be possible in some environments (including some object-oriented environments).

[previous | contents | next]