Decimal Arithmetic FAQ
Part 4 – Arithmetic Specification Questions
think 10
Copyright © IBM Corporation, 2000, 2007. All rights reserved.

Contents [back to FAQ contents]

Why is decimal arithmetic unnormalized?

It is quite possible for decimal arithmetic to be normalized, even if the encoding (representation) of numbers is not. A floating-point unit can easily offer both normalized and unnormalized arithmetic, either under the control of a context flag or by a specific ‘normalize’ instruction (the latter being especially appropriate in a RISC design).

However, for decimal arithmetic, intended as a tool for human use, the choice of unnormalized arithmetic is dictated by the need to mirror manual calculations and other human conventions. A normalized (non-redundant) arithmetic is suitable for purely mathematical calculations, but is inadequate for many other applications.

Note that the unnormalized arithmetic gives exactly the same mathematical value for every result as normalized arithmetic would, but has the following additional advantages.

  1. Unnormalized arithmetic is compatible with existing languages and applications.
    Decimal arithmetic in computing almost invariably uses scaled integer calculations. For example, the languages COBOL, PL/I, Java, C#, Rexx, Visual Basic and the databases DB2, Oracle, MS SQL Server, and Informix all use this form of computation, as do decimal arithmetic libraries, including decNumber for C, bignum for Perl 6, Decimal in Python 2.4, EDA for Eiffel, ArciMath and IBM's BigDecimal classes for Java, ADAR for Ada, and the X/Open ISAM decimal type.

    A normalized arithmetic cannot duplicate the results of unnormalized arithmetic and so these existing software decimal calculations cannot be replaced by hardware which provided only normalized arithmetic. In up to 27% of cases the resulting coefficient and exponent will be different. This would require that all applications and their testcases be rewritten; an effort comparable to but significantly larger than the ‘Year 2000’ problem.

    Normalized arithmetic is close to useless for most applications which use decimal data, because so many operations would have to be be recalculated in software in order to give the expected results.

  2. The arithmetic of all existing decimal datatypes can be derived by constraining the unnormalized arithmetic.
    Decimal arithmetic has a number of flavors:
    • Integer arithmetic: Addition and multiplication are the standard arithmetic, unchanged. Integer division is the general division followed by a truncating rescale(0) (in practice, an early-finish division).
    • Fixed-point arithmetic: addition is unchanged; multiplication and divison are the general operations followed by rescale(n).
    • Variable precision floating-point: all operations are the general arithmetic, either with precision control or with a following re-round operation.
    • Normalized floating-point: all operations are the general operations, followed by a normalize operation (this simply removes trailing zeros while increasing the exponent appropriately; no errors are possible from this operation).
    • Unnormalized floating point: all operations are the general arithmetic.

    All of these types are a subset of the standard type; in a RISC-like implementation it is especially appropriate for the rescale, re-round, and normalize operations to be independent of the general arithmetic.

    There is therefore no need for fixed-point and integer decimal datatypes in addition to the floating-point type, and no conversions are needed when mixed integer, fixed-point, and floating-point types are used in a calculation. (For example, when calculating the product 17 × 19.99, as in 17 items at $19.99.)

    If a normalized arithmetic were used, a separate unnormalized floating-point processing unit would still be needed, and conversions between the two units would be necessary.

  3. Unnormalized arithmetic often permits performance improvements.
    Common addition and subtraction tasks (e.g., summing a column of prices or costs) need no alignment shift or normalization step. Compare the following sums (unnormalized on the left, normalized on the right):
              1.23        1.23
              2.50         2.5
            + 1.27      + 1.27
           -------     -------
              5.00           5

    With a normalized arithmetic, alignment of the operands is needed more often and an extra step is required at the end of every calculation to determine whether normalization is required and then effect it if it is. These unnecessary steps require extra code (in software) or extra circuitry (in hardware). These both increase the costs of calculations, testing, and validation.

    Further, in the unnormalized arithmetic, the calculations of a result coefficient and exponent are independent (except when the result has to be rounded). This independence reduces the complexity of the calculation and permits the two parts of the calculation to be done entirely in parallel.

  4. Gradual underflow is ‘free’.
    Subnormal numbers are simply the low end of the range of unnormalized numbers; they arise naturally out of calculations when required and require no special treatment. This helps performance, testing, etc.

    For example, 10000×10Etiny (the smallest normal number if precision=5) ÷2 gives 5000×10Etiny, a subnormal number, with no special processing.

    In contrast, normalized arithmetic necessarily must treat subnormal values as special cases, which adds complexity and complicates implementation and testing.

  5. Zeros are not special cases.
    Similarly, results and operands which are zero are treated in exactly the same manner as any other values.

    In a normalized arithmetic, zero must be encoded differently from all other values, so every result or use of zero requires special-case treatment. This requires extra code or circuitry, with the associated testing and validation burden.

  6. Results are easier to predict and to test.
    Compared to a normalized arithmetic with fractional coefficients, the integer-coefficient arithmetic is simpler. Fractional arithmetic has rules which are different from integer arithmetic (even though the values represented are equivalent). With integer coefficients one set of rules suffices.

    Also, although theoretically equivalent, integer arithmetic has a longer history and is more familiar than left-aligned fractional arithmetic. This may make it easier to find existing proofs for results, etc. In general, testing and validation are simplified: all computers already have integer arithmetic, so testing methods can be made consistent with the testing of the integer arithmetic unit (and are well-known and understood).

  7. Rounded numbers are full precision
    A number which has been rounded will always have the full working precision (except in the case of a subnormal result).

    In particular, this means that if the result of an operation has less than full precision then it was not rounded by that operation; this is particularly useful for checking that adequate precision has been allowed for calculations which must not be rounded.

  8. Conversions to and from existing decimal datatypes are faster.
    Existing decimal data are invariably held in an unnormalized two-integer format. The standard decimal type is also an unnormalized two-integer format, so conversions are simple and fast, with no exponent adjustment or shifting needed.

  9. Flexibility.
    An optional normalization step can be omitted or added to suit the application. This could be a normalize instruction or control register bit, depending on the architecture (or, in software, an extra subroutine or method call). When the normalization step is separated in this way, the penalties of normalization can be avoided when desired.

  10. Application design is simpler.
    If the exponent is implicit in the data or calculations then presentation decisions are often simpler or unnecessary. Verification and testing are often easier, too. For example, currency regulations state that certain rates must be expressed to exactly 6 digits. This is easier to verify if at all times 6 digits are actually stored.

  11. Unnormalized arithmetic results match human expectations.
    The results from the unnormalized arithmetic defined in the specification are exactly those of Algorism. These are the results which humans are taught to compute, and therefore expect. For example, 1.23 + 1.27 gives 2.50 (2.5 is a surprise to calculator users).

    In general exponents must be preserved; they form part of the type of a number. If two numbers which have the same exponent are added or subtracted, the result is expected to have that same exponent. If normalization is applied, however, the result exponent varies depending on the values of the operands.

    When results are different from those expected, users are surprised, frustrated, and lose confidence in the application.

    Further, when results match the results computed by hand exactly, application test case generation and validation are easier. Testing is often simpler, too, because the operations are essentially integer operations.

What are the advantages of normalization?

  1. Normalization guarantees that the minimum number of digits are involved in a calculation.
    This can speed multiplies and divides, especially in software.

    However, if operands are random, then only one in ten will have trailing zeros, and only one in a hundred dyadic operations will have both operands with trailing zeros. Any performance advantage of normalizing these is quite possibly negated by the extra alignments and shifts required for normalization. (The result of 27% of multiplications would need an otherwise unnecessary normalization shift, for example.)

  2. Normalization allows more values to be encoded.
    Potentially 11% more values can be encoded, or a wider exponent range supported, because one digit of the coefficient would only need to take the values 1 through 9 (instead of 0 through 9).

    With the current layouts being discussed, it is not practical to increase the coefficient length, so the benefit is limited to increasing the exponent range. In a 64-bit layout, for example, using a normalized representation would allow increasing the exponent range from ±384 (already larger than the range available in a binary double) to ±456.

    This slight increase in exponent range is unlikely to enable a new class of applications or otherwise significantly improve the usefulness of the arithmetic.

  3. Testing for equality can be a byte-array compare.
    This can speed up localized processing in databases, etc.

    This advantage also applies to unnormalized numbers, provided that they are normalized before storing. For example, this can be achieved using a normalize (or ‘store normalized’) instruction.

How are zeros with exponents handled?

In scaled integer arithmetic, zero need not be treated as a special case; just like some other numbers, redundant encodings are allowed. All numbers with a coefficient of zero and with any exponent are valid, and (of course) they all have the value zero. (0 × 105 is zero.)

These permitted redundant encodings of zeros mean that, very importantly, the exponent is independent of the coefficient in all calculations which are not rounded.

For instance, consider subtraction. The rule here is simply that the exponent of the result is the lesser of the exponents of the operands and the coefficient of the result is the result of subtracting the coefficients of the operands after (if necessary) aligning one coefficient so its exponent would have the result exponent.

For example:

      123 - 122 => 1
(123 - 122 gives the result 1), and this can also be written (showing the integer coefficient before the E and the exponent after it) as:
      123E+0 - 122E+0 => 1E+0
Now consider the similar calculation, but with exponent -2 on the two operands instead. The coefficients are used in the same way and give exactly the same result coefficient, and again the exponent of the result is the same as the exponents of the operands:
      1.23 - 1.22  => 0.01   or:  123E-2 - 122E-2 => 1E-2
We follow exactly the same process of calculation even if the result happens to be zero:
       123 - 123   => 0      or:  123E+0 - 123E+0 => 0E+0
And with exponent -2 on the two operands, again the process is the same:
      1.23 - 1.23  => 0.00   or:  123E-2 - 123E-2 => 0E-2
Note that we do not have to inspect the result coefficient to determine the exponent. The exponent can be calculated entirely in parallel with, or even in advance of, the calculation of the coefficient. This simplifies a hardware design or speeds up a software implementation.

Similarly, we don't have to inspect the exponent in order to determine if a value of a number is zero; the coefficient determines this.

If a zero result were to be treated as a special case (perhaps forcing the exponent to zero), there would be different paths for addition and subtraction depending on the value of the result.

Further, the exponent of the result might then have to depend on the coefficient of the operands, too. For example, in the sum:

      1E+3 + yE+0
one could argue that if y=0 then its exponent should be ignored, and the answer should therefore be 1E+3. However, if y=1 then the answer would be 1001E+0. With the consistent rule for zero, the exponent of the result would be +0 in both cases (1000E+0 and 1001E+0), and again the same process is used whatever the operands.

The only complication is when the exponent of a zero result would be too large or small for a given representation (after a multiplication or division). In this case it is set (clamped) to the largest or smallest exponent available, respectively (a result of 0 cannot cause an overflow or underflow).

The preservation of exponents on zeros, therefore, simplifies the basic arithmetic operations. It also gives a helpful indication that an underflow may have occurred (after an underflow which rounds to zero, the exponent of the zero will always be the smallest possible exponent, whereas zeros which are the results of normal calculations will have exponents appropriately related to the operands).

Is the decimal arithmetic ‘significance’ arithmetic?

Absolutely not. Every operation is carried out as though an infinitely precise result were possible, and is only rounded if the destination for the result does not have enough space for the coefficient. This means that results are exact where possible, and can be used for converging algorithms (such as Newton-Raphson approximation), too.

To see how this differs from significance arithmetic, and why it is different, here’s a mini-tutorial...

First, it is quite legitimate (and common practice) for people to record measurements in a form in which, by convention, the number of digits used – and, in particular, the units of the least-significant-digit – indicate in some way the likely error in a measurement.

With this scheme, 1.23m might indicate a measurement which is believed to be 1.230000000... meters, plus or minus 0.005m (or “to the nearest 0.01m”), and 1.234m might indicate one that is ±0.0005m (“to the nearest millimeter”), and so on.

(Of course, the error will sometimes be outside those bounds, due to human error, random events, and so on, but let us assume that is rare.)

This is one of the reasons why a representation of a number needs to be able to ‘preserve trailing zeros’ (that is, record the original power of ten associated with the number). When a number is a record of a measurement, the two numbers: 1.000 and 1 are in some sense different. The first perhaps records a measurement within ±0.0005, and the other could be ±0.5 (exactly what they mean depends on the context, of course).

This is the distinction that is lost in a normalized representation (which would record both of these as 1).

Now let us consider arithmetic on these two numbers, in a scientific or other context, and in particular let’s consider adding 0.001 to each.

For the first case (1.000+0.001) we would expect the answer 1.001.

For the second case (1 + 0.001, where the 1 is understood to be ±0.5) it can be argued that the error bounds associated with the 1 means that the 0.001 is irrelevant (‘swamped by the noise’). So what happens when we attempt this addition?

Note that the first three of these actions only make sense if we know we are dealing with measurements (with their implied error bounds). We must not take these actions in the analogous case of adding one thousand dollars ($1E+3) to a bank account with a balance of one dollar and five cents (which must give the result $1001.05).

Given that an addition operator does not know whether a number is a measurement or an exact quantity then it has no option other than to return the exact answer (the fourth case). How this result is interpreted is up to the application, and that application is at liberty to apply some rounding rule or to treat the result as exact, as appropriate.

So, default operations must treat calculations as exact. But could we not have a context setting of some kind and do another kind of arithmetic where the precision (significance) of the result depends on the precision of the operands?

Let us consider the most popular of such arithmetics. It’s called significance arithmetic and is sometimes advocated in experimental science (particularly in the field of Chemistry).

Significance arithmetic is essentially a set of ‘rules of thumb’. The primary rule is that a result of a calculation must not be given to more digits than the less precise of its operands. This and its other rules ‘work’ in the sense that they alert a student to a situation in which he or she must take care, but they are not a valid arithmetic on measurements.

Consider a simple example in which two measurements are added together, such as 2.3 + 7.2 (and let’s assume the true values really are in the ranges 2.25 through 2.35 and 7.15 through 7.25). From the rules of ’significance arithmetic’, a result should have the same precision as the least precise of the operands, so in this case the result should be 9.5.

But, if we continue to assume the rule that the last digit indicates the error bounds, this would suggest that the result is 9.5 ±0.05. However, both the original measurements could err in the same direction, so it is clear that in fact the final total could be anywhere in the range 9.4 through 9.6 (that is, 9.5 ±0.1, not ±0.05). So this rule is over-optimistic after the very first calculation (and compounds with each subsequent calculation), and so we cannot apply the rule for more than the first calculation in a sequence.

But the problems are worse than this, because with measurements the likely error is not uniformly distributed between two bounds. As Delury pointed out (In Computation with Approximate Numbers, Daniel B. Delury, The Mathematics Teacher 51, pp521-530, November 1958):
The statement that a length of something or other is 11.3 inches, to the nearest tenth of an inch, is either trivial or not true. The statement that a length of something or other is 11.3 inches, with a standard deviation of 0.2 inch, is a meaningful statement.
(This is because the distribution of errors is not a rectangle but a normal curve – consider a measurement of something about half way between 11.3 and 11.4.)

He then goes on to quote a practical example, of a real experiment where 10 measurements are made of the breaking strain of a wire; the 10 measurements vary from 568 through 596, all three figures. The sum of the 10 measurements is 5752, giving an average (to three figures) of 575 which, according to the rules of significance arithmetic means that the result is 575 ±0.5.

But he also calculates the standard deviation of the average of the measurements (charitably assuming each individual measurement is infinitely accurate), which is 8.26, meaning that “with near certainty” the true mean lies in the range 575 ±9. In other words, the significance arithmetic has given a grossly false estimate of the bounds in which the true mean will lie while at the same time losing a digit which indicates the center point of any result distribution.

The underlying problem here is that a convention for recording measurements does not follow through to an arithmetic on measurements, however much one might wish it to. One has to apply the theory of errors, or use some other technique (such as interval arithmetic).

Delury gives some specific advice:
Two questions require answers. “How shall [students] carry out their arithmetic and how shall they present the results of their calculations?”
    The answers are easily given. In their arithmetic, all numbers are to be treated as exact, with the proviso that if the number of decimal places becomes unduly large, some of them may be eliminated by rounding off in the course of the calculation. The final answer should be rounded off to a reasonable number of decimals.
    I am sure we would, all of us, like to have something more definite than this, but the fact is that there are no grounds for definiteness.

So, in summary:

And that is what IEEE 754-1985, IEEE 854, and the arithmetic now described in the revised IEEE 754, provide.

Which is larger? 7.5 or 7.500?

Numerically, the decimal numbers 7.5 or 7.500 compare equal, but sometimes it is useful to have a defined sorting order for all decimal numbers. The IEEE 754-2008 standard defines a suitable total ordering for decimal numbers (including the special values). In brief, when two finite numbers have the same numeric value but different exponents, the one with the larger exponent is treated as though it were ‘larger’ than the other.

This following table lists a sequence of sample decimal64 values, in the order defined in the IEEE 754-2008 standard. Here, Nmax is the largest positive normal number, Nmin is the smallest positive normal number, Ntiny is the smallest positive subnormal number (the tiniest non-zero). NaNs and signalling NaNs (sNaN) are sorted by ‘payload’ (e.g., NaN10 – a NaN with a payload of 10 – sorts higher than a NaN with a payload of 0 (shown as simply ‘NaN’ in the table).

decimal64 value
Nmax = 9.999999999999999E+384
Nmin as 1E−383
Nmin as 1.000000000000000E−383
which is 1000000000000000E−398
Ntiny = 1E−398
−Ntiny = −1E−398
−Nmin as −1.000000000000000E−383
which is −1000000000000000E−398
−Nmin as −1E−383
−Nmax = −9.999999999999999E+384

When (and why) is exponential notation used in strings?

The specification describes conversions from decimal numbers (which may be in some internal format) to strings which preserve the sign, coefficient, and exponent of finite numbers. Since the exponents of number can be very large, exponential notation is used to keep the lengths of the result strings reasonably short. For example, the value of 1 multiplied by 10−20 is converted to 1E-20 rather than 0.00000000000000000001.

If a negative exponent is small enough, however, a number is converted to a string without using exponential notation. The switch point is defined to be such that at most five zeros will appear between the decimal point and the first digit of the coefficient. This definition is an arbitrary choice, in a sense, but the choice of five means that measurements down to microns (for example) avoid using exponential notation.

Measurements are often quoted using a power of ten that is a multiple of three, so other possible choices included two leading zeros and eight; however, experiments carried out when the arithmetic in the Rexx programming language was designed in 1981 showed that eight zeros was too many to count ‘at a glance’, and two was too few for many applications.

Exponential notation is also used whenever the exponent is positive, to preserve the distinction between (say) 12300 and 123E+2.

Please send any comments or corrections to Mike Cowlishaw,
Copyright © IBM Corporation 2000, 2007. All rights reserved.