The decNumber Library, version 3.68 Copyright (c) IBM Corporation, 2010. All rights reserved. ©	23 Jan 2010
[previous \| contents \| next]

Appendix A – Library performance

The decNumber module implements arbitrary-precision arithmetic with fully tailorable parameters (rounding precision, exponent range, and other factors can all be changed at run time). All decNumber operations can accept arbitrary-length operands. Further, decNumber uses a general-purpose internal format (tunable at compile time) which therefore requires conversions to and from any external format (such as strings, BCD, or the IEEE 754 fixed-size decimal encodings).

As a result, the module has significant overheads compared to the dedicated decFloats modules which work directly on the fixed-size encodings. This appendix compares the performance of the decNumber module with the decDouble and decQuad implementations of the same operations. As the tables below show, there is a significant performance advantage in using the decFloats modules when arbitrary-precision operations are not required.

Description of the tables

In the following tables, timings for each operation are given in processor clock cycles. While generally a more useful indicator of comparative performance than ‘wall clock’ times, cycle counts vary considerably with processor architecture. For example, the times below are cycles measured on an Intel Pentium M processor in an IBM X41T Thinkpad;^[1] on a Pentium 4 or RISC processor most of the tests would show significantly higher cycle counts. The compiler used also makes a measurable difference. Details of the tests and compiler are given in the notes at the end of this appendix.

Throughout the tables, worst-case cycle times are shown for the main operations in the decDouble and decQuad modules, compared with the same operations using the decNumber module (which requires conversion of operands and results).

Worst-case timings are quoted because best-case timings are generally trivial special cases (such as NaN arguments) and ‘typical’ instruction mixes are very application-dependent.

For each operation, the name of the operation is given, along with a brief description of the worst-case form of the operation. This is the worst case for the decFloats module (in some cases the worst case is different for the decNumber module).

decDouble performance tables

decDouble (64-bit) conversions
Operation	decDouble	decNumber
Encoding to BCD (with exponent) 16-digit finite	39	481
BCD to encoding (with exponent) 16-digit finite	46	327
Encoding to string 16-digit, with exponent	84	133
Exact string to encoding (unrounded) 16-digit, with exponent	229	196
String to encoding (rounded) 16-digit, rounded, with exponent	266	548
Widen to decQuad 16-digit, with exponent	30	209
int32 to encoding From most negative int	39	199
Encoded integer to int32 To most negative int32	32	136

decDouble (64-bit) miscellaneous operations
Operation	decDouble	decNumber
Class (classify datum) Negative small subnormal	37	113
Copies (Abs/Negate/Sign) CopySign, copy needed	25	338
Count significant digits Single digit	24	122
Logical And/Or/Xor/Invert (digitwise) 16-digit	23	510
Shift/Rotate Rotate 15 digits	154	583

decDouble (64-bit) computations
Operation	decDouble	decNumber
Add (same-sign addition) 16-digit, unaligned, rounded	248	848
Subtract (different-signs addition) 16-digit, unaligned, rounded, borrow	288	848
Compare 16-digit, unaligned, mismatch at end	126	442
CompareTotal 16-digit, unaligned, mismatch at end	149	594
Divide 16- by 16-digit (rounded)	828	1576
FMA (fused multiply-add) 16-digit, subtraction, rounded	785	1683
LogB (returns a decDouble) Negative result	48	279
MaxNum/MinNum 16-digit, unaligned, mismatch at end	155	656
Multiply 16×16-digit, round needed	362	1305
Quantize 16-digit, round all-nines	112	422
ScaleB (from decDoubles) Underflow	212	513
To integral value 16-digit, round all-nines	135	709

decQuad performance tables

decQuad (128-bit) conversions
Operation	decQuad	decNumber
Encoding to BCD (with exponent) 34-digit finite	53	460
BCD to encoding (with exponent) 34-digit finite	74	307
Encoding to string 34-digit, with exponent	183	239
Exact string to encoding (unrounded) 34-digit, with exponent	297	597
String to encoding (rounded) 34-digit, rounded, with exponent	451	956
Narrow to decDouble 34-digit, all nines	140	612
int32 to encoding From most negative int	44	199
Encoded integer to int32 To most negative int32	32	156

decQuad (128-bit) miscellaneous operations
Operation	decQuad	decNumber
Class (classify number) Negative small subnormal	53	133
Copies (Abs/Negate/Sign) CopySign, copy needed	27	380
Count significant digits Single digit	27	138
Logical And/Or/Xor/Invert (digitwise) 34-digit	27	622
Shift/Rotate Rotate 33 digits	222	812

decQuad (128-bit) computations
Operation	decQuad	decNumber
Add (same-sign addition) 34-digit, aligned	433	1180
Subtract (different-signs addition) 34-digit, unaligned, rounded, borrow	457	1180
Compare 34-digit, unaligned, mismatch at end	187	1125
CompareTotal 34-digit, unaligned, mismatch at end	238	778
Divide 34- by 34-digit (rounded)	2018	3172
FMA (fused multiply-add) 34-digit, subtraction, rounded	1622	2707
LogB (returns a decQuad) Negative result	58	299
MaxNum/MinNum 34-digit, unaligned, mismatch at end	241	857
Multiply 34×34-digit, round needed	821	2235
Quantize 34-digit, round all-nines	209	670
ScaleB (from decQuads) Underflow	263	553
To integral value 34-digit, round all-nines	233	886

Notes

The following notes apply to all the tables in this appendix.

All timings were made on an IBM X41T Tablet PC (Pentium M, 1.5GHz, 1.5GB RAM) under Windows XP Tablet Edition with SP2; the modules were compiled using GCC version 3.4.4 with optimization settings -O3 -march=i686.
The default tuning parameters were used (DECUSE64=1, DECDPUN=3, etc.); some of these only affect decNumber.
Timings include call/return overhead, and for the decNumber module also include the costs of converting operand(s) to decNumbers and results back to the appropriate format using the decimal64 or decimal128 module.
‘BCD’ for decNumber is Packed BCD, using the decPacked module; for decFloats it is 8-bit BCD.
The worst case for each operation is not always obvious from the code and is implementation-dependent (for example, in the decFloats modules, an unaligned add is sometimes faster than an aligned add). It is possible that there may be unusual cases which are slower than the decFloats counts listed above, although a wide variety of micro-benchmarks have been tried.
A string-to-number conversion can theoretically have an arbitrarily large worst case as the string could contain any number of leading, trailing, or embedded zeros; the timings above measured cases where the input string’s coefficient had up to eight more digits than the precision of the destination format.

Footnotes:

[1]	‘Intel’ and ‘Pentium’ are trade marks of the Intel Corporation. ‘Thinkpad’ is a trade mark of Lenovo.

[previous | contents | next]