Decimal Arithmetic FAQ
Part 4 – Arithmetic Specification Questions 
It is quite possible for decimal arithmetic to be normalized, even if the encoding (representation) of numbers is not. A floatingpoint unit can easily offer both normalized and unnormalized arithmetic, either under the control of a context flag or by a specific ‘normalize’ instruction (the latter being especially appropriate in a RISC design).
However, for decimal arithmetic, intended as a tool for human use, the choice of unnormalized arithmetic is dictated by the need to mirror manual calculations and other human conventions. A normalized (nonredundant) arithmetic is suitable for purely mathematical calculations, but is inadequate for many other applications.
Note that the unnormalized arithmetic gives exactly the same mathematical value for every result as normalized arithmetic would, but has the following additional advantages.
A normalized arithmetic cannot duplicate the results of unnormalized arithmetic and so these existing software decimal calculations cannot be replaced by hardware which provided only normalized arithmetic. In up to 27% of cases the resulting coefficient and exponent will be different. This would require that all applications and their testcases be rewritten; an effort comparable to but significantly larger than the ‘Year 2000’ problem.
Normalized arithmetic is close to useless for most applications which use decimal data, because so many operations would have to be be recalculated in software in order to give the expected results.
All of these types are a subset of the standard type; in a RISClike implementation it is especially appropriate for the rescale, reround, and normalize operations to be independent of the general arithmetic.
There is therefore no need for fixedpoint and integer decimal datatypes in addition to the floatingpoint type, and no conversions are needed when mixed integer, fixedpoint, and floatingpoint types are used in a calculation. (For example, when calculating the product 17 × 19.99, as in 17 items at $19.99.)
If a normalized arithmetic were used, a separate unnormalized floatingpoint processing unit would still be needed, and conversions between the two units would be necessary.
1.23 1.23 2.50 2.5 + 1.27 + 1.27   5.00 5
With a normalized arithmetic, alignment of the opreands is needed more often, and an extra step is required at the end of every calculation to determine whether normalization is required and then effect it if it is. These unnecessary steps requires extra code (in software) or extra circuitry (in hardware). These both increase the costs of calculations, testing, and validation.
Further, in the unnormalized arithmetic, the calculations of a result coefficient and exponent are independent (except when the result has to be rounded). This independence reduces the complexity of the calculation and permits the two parts of the calculation to be done entirely in parallel.
For example, 10000×10^{Etiny} (the smallest normal number if precision=5) ÷2 gives 5000×10^{Etiny}, a subnormal number, with no special processing.
In contrast, normalized arithmetic necessarily must treat subnormal values as special cases, which adds complexity and complicates implementation and testing.
In a normalized arithmetic, zero must be encoded differently from all other values, so every result or use of zero requires specialcase treatment. This requires extra code or circuitry, with the associated testing and validation burden.
Also, although theoretically equivalent, integer arithmetic has a longer history and is more familiar than leftaligned fractional arithmetic. This may make it easier to find existing proofs for results, etc. In general, testing and validation are simplified: all computers already have integer arithmetic, so testing methods can be made consistent with the testing of the integer arithmetic unit (and are wellknown and understood).
In particular, this means that if the result of an operation has less than full precision then it was not rounded by that operation; this is particularly useful for checking that adequate precision has been allowed for calculations which must not be rounded.
In general exponents must be preserved; they form part of the type of a number. If two numbers which have the same exponent are added or subtracted, the result is expected to have that same exponent. If normalization is applied, however, the result exponent varies depending on the values of the operands.
When results are different from those expected, users are surprised, frustrated, and lose confidence in the application.
Further, when results match the results computed by hand exactly, application test case generation and validation are easier. Testing is often simpler, too, because the operations are essentially integer operations.
However, if operands are random, then only one in ten will have trailing zeros, and only one in a hundred dyadic operations will have both operands with trailing zeros. Any performance advantage of normalizing these is quite possibly negated by the extra alignments and shifts required for normalization. (The result of 27% of multiplications would need an otherwise unnecessary normalization shift, for example.)
With the current layouts being discussed, it is not practical to
increase the coefficient length, so the benefit is limited to
increasing the exponent range.
In a 64bit layout, for example, using a normalized representation
would allow increasing the exponent range from ±384 (already
larger than the range available in a binary
This slight increase in exponent range is unlikely to enable a new class of applications or otherwise significantly improve the usefulness of the arithmetic.
This advantage also applies to unnormalized numbers, provided that they are normalized before storing. For example, this can be achieved using a normalize (or ‘store normalized’) instruction.
In scaled integer arithmetic, zero need not be treated as a special case; just like some other numbers, redundant encodings are allowed. All numbers with a coefficient of zero and with any exponent are valid, and (of course) they all have the value zero. (0 × 10^{5} is zero.)
These permitted redundant encodings of zeros mean that, very importantly, the exponent is independent of the coefficient in all calculations which are not rounded.
For instance, consider subtraction. The rule here is simply that the exponent of the result is the lesser of the exponents of the operands and the coefficient of the result is the result of subtracting the coefficients of the operands after (if necessary) aligning one coefficient so its exponent would have the result exponent.
For example:
123  122 => 1(123  122 gives the result 1), and this can also be written (showing the integer coefficient before the E and the exponent after it) as:
123E+0  122E+0 => 1E+0Now consider the similar calculation, but with exponent 2 on the two operands instead. The coefficients are used in the same way and give exactly the same result coefficient, and again the exponent of the result is the same as the exponents of the operands:
1.23  1.22 => 0.01 or: 123E2  122E2 => 1E2We follow exactly the same process of calculation even if the result happens to be zero:
123  123 => 0 or: 123E+0  123E+0 => 0E+0And with exponent 2 on the two operands, again the process is the same:
1.23  1.23 => 0.00 or: 123E2  123E2 => 0E2Note that we do not have to inspect the result coefficient to determine the exponent. The exponent can be calculated entirely in parallel with, or even in advance of, the calculation of the coefficient. This simplifies a hardware design or speeds up a software implementation.
Similarly, we don't have to inspect the exponent in order to determine if a value of a number is zero; the coefficient determines this.
If a zero result were to be treated as a special case (perhaps forcing the exponent to zero), there would be different paths for addition and subtraction depending on the value of the result.
Further, the exponent of the result might then have to depend on the coefficient of the operands, too. For example, in the sum:
1E+3 + yE+0one could argue that if y=0 then its exponent should be ignored, and the answer should therefore be 1E+3. However, if y=1 then the answer would be 1001E+0. With the consistent rule for zero, the exponent of the result would be +0 in both cases (1000E+0 and 1001E+0), and again the same process is used whatever the operands.
The only complication is when the exponent of a zero result would be too large or small for a given representation (after a multiplication or division). In this case it is set (clamped) to the largest or smallest exponent available, respectively (a result of 0 cannot cause an overflow or underflow).
The preservation of exponents on zeros, therefore, simplifies the basic arithmetic operations. It also gives a helpful indication that an underflow may have occurred (after an underflow which rounds to zero, the exponent of the zero will always be the smallest possible exponent, whereas zeros which are the results of normal calculations will have exponents appropriately related to the operands).
Absolutely not. Every operation is carried out as though an infinitely precise result were possible, and is only rounded if the destination for the result does not have enough space for the coefficient. This means that results are exact where possible, and can be used for converging algorithms (such as NewtonRaphson approximation), too.
To see how this differs from significance arithmetic, and why it is different, here’s a minitutorial...
First, it is quite legitimate (and common practice) for people to record measurements in a form in which, by convention, the number of digits used – and, in particular, the units of the leastsignificantdigit – indicate in some way the likely error in a measurement.
With this scheme, 1.23m might indicate a measurement which is believed to be 1.230000000... meters, plus or minus 0.005m (or “to the nearest 0.01m”), and 1.234m might indicate one that is ±0.0005m (“to the nearest millimeter”), and so on.
(Of course, the error will sometimes be outside those bounds, due to human error, random events, and so on, but let us assume that is rare.)
This is one of the reasons why a representation of a number needs to be able to ‘preserve trailing zeros’ (that is, record the original power of ten associated with the number). When a number is a record of a measurement, the two numbers: 1.000 and 1 are in some sense different. The first perhaps records a measurement within ±0.0005, and the other could be ±0.5 (exactly what they mean depends on the context, of course).
This is the distinction that is lost in a normalized representation (which would record both of these as 1).
Now let us consider arithmetic on these two numbers, in a scientific or other context, and in particular let’s consider adding 0.001 to each.
For the first case (1.000+0.001) we would expect the answer 1.001.
For the second case (1 + 0.001, where the 1 is understood to be ±0.5) it can be argued that the error bounds associated with the 1 means that the 0.001 is irrelevant (‘swamped by the noise’). So what happens when we attempt this addition?
Note that the first three of these actions only make sense if we know we are dealing with measurements (with their implied error bounds). We must not take these actions in the analogous case of adding one thousand dollars ($1E+3) to a bank account with a balance of one dollar and five cents (which must give the result $1001.05).
Given that an addition operator does not know whether a number is a measurement or an exact quantity then it has no option other than to return the exact answer (the fourth case). How this result is interpreted is up to the application, and that application is at liberty to apply some rounding rule or to treat the result as exact, as appropriate.
So, default operations must treat calculations as exact. But could we not have a context setting of some kind and do another kind of arithmetic where the precision (significance) of the result depends on the precision of the operands?
Let us consider the most popular of such arithmetics. It’s called significance arithmetic and is sometimes advocated in experimental science (particularly in the field of Chemistry).
Significance arithmetic is essentially a set of ‘rules of thumb’. The primary rule is that a result of a calculation must not be given to more digits than the less precise of its operands. This and its other rules ‘work’ in the sense that they alert a student to a situation in which he or she must take care, but they are not a valid arithmetic on measurements.
Consider a simple example in which two measurements are added together, such as 2.3 + 7.2 (and let’s assume the true values really are in the ranges 2.25 through 2.35 and 7.15 through 7.25). From the rules of ’significance arithmetic’, a result should have the same precision as the least precise of the operands, so in this case the result should be 9.5.
But, if we continue to assume the rule that the last digit indicates the error bounds, this would suggest that the result is 9.5 ±0.05. However, both the original measurements could err in the same direction, so it is clear that in fact the final total could be anywhere in the range 9.4 through 9.6 (that is, 9.5 ±0.1, not ±0.05). So this rule is overoptimistic after the very first calculation (and compounds with each subsequent calculation), and so we cannot apply the rule for more than the first calculation in a sequence.
But the problems are worse than this, because with measurements the likely error is not uniformly distributed between two bounds. As Delury pointed out (In Computation with Approximate Numbers, Daniel B. Delury, The Mathematics Teacher 51, pp521530, November 1958):
The statement that a length of something or other is 11.3 inches, to the nearest tenth of an inch, is either trivial or not true. The statement that a length of something or other is 11.3 inches, with a standard deviation of 0.2 inch, is a meaningful statement. 
He then goes on to quote a practical example, of a real experiment where 10 measurements are made of the breaking strain of a wire; the 10 measurements vary from 568 through 596, all three figures. The sum of the 10 measurements is 5752, giving an average (to three figures) of 575 which, according to the rules of significance arithmetic means that the result is 575 ±0.5.
But he also calculates the standard deviation of the average of the measurements (charitably assuming each individual measurement is infinitely accurate), which is 8.26, meaning that “with near certainty” the true mean lies in the range 575 ±9. In other words, the significance arithmetic has given a grossly false estimate of the bounds in which the true mean will lie while at the same time losing a digit which indicates the center point of any result distribution.
The underlying problem here is that a convention for recording measurements does not follow through to an arithmetic on measurements, however much one might wish it to. One has to apply the theory of errors, or use some other technique (such as interval arithmetic).
Delury gives some specific advice:
Two questions require answers. “How shall [students] carry
out their arithmetic and how shall they present the results of
their calculations?”
The answers are easily given. In their arithmetic, all numbers are to be treated as exact, with the proviso that if the number of decimal places becomes unduly large, some of them may be eliminated by rounding off in the course of the calculation. The final answer should be rounded off to a reasonable number of decimals. I am sure we would, all of us, like to have something more definite than this, but the fact is that there are no grounds for definiteness. 
So, in summary:
And that is what IEEE 7541985, IEEE 854, and the arithmetic now described in the revised IEEE 754, provide.
Numerically, the decimal numbers 7.5 or 7.500 compare equal, but sometimes it is useful to have a defined sorting order for all decimal numbers. The IEEE 7542008 standard defines a suitable total ordering for decimal numbers (including the special values). In brief, when two finite numbers have the same numeric value but different exponents, the one with the larger exponent is treated as though it were ‘larger’ than the other.
This following table lists a sequence of sample decimal64 values, in the order defined in the IEEE 7542008 standard. Here, Nmax is the largest positive normal number, Nmin is the smallest positive normal number, Ntiny is the smallest positive subnormal number (the tiniest nonzero). NaNs and signalling NaNs (sNaN) are sorted by ‘payload’ (e.g., NaN10 – a NaN with a payload of 10 – sorts higher than a NaN with a payload of 0 (shown as simply ‘NaN’ in the table).

The specification describes conversions from decimal numbers (which may be in some internal format) to strings which preserve the sign, coefficient, and exponent of finite numbers. Since the exponents of number can be very large, exponential notation is used to keep the lengths of the result strings reasonably short. For example, the value of 1 multiplied by 10^{−20} is converted to 1E20 rather than 0.00000000000000000001.
If a negative exponent is small enough, however, a number is converted to a string without using exponential notation. The switch point is defined to be such that at most five zeros will appear between the decimal point and the first digit of the coefficient. This definition is an arbitrary choice, in a sense, but the choice of five means that measurements down to microns (for example) avoid using exponential notation.
Measurements are often quoted using a power of ten that is a multiple of three, so other possible choices included two leading zeros and eight; however, experiments carried out when the arithmetic in the Rexx programming language was designed in 1981 showed that eight zeros was too many to count ‘at a glance’, and two was too few for many applications.
Exponential notation is also used whenever the exponent is positive, to preserve the distinction between (say) 12300 and 123E+2.
Please send any comments or corrections to Mike Cowlishaw, mfc@speleotrove.com 
Copyright © IBM Corporation 2000, 2007. All rights reserved.
