Decimal Performance, version 1.12
Copyright (c) IBM Corporation, 2009. All rights reserved. ©
Draft of 7 Apr 2009
[contents | next]

Introduction

This document describes some performance measurements of three implementations (libraries) of decimal operations. These libraries implement various subsets of the operations defined in the Decimal Arithmetic Specification,[1]  which describes a superset of the arithmetic operations in the 2008 IEEE 754 Standard for Floating-Point Arithmetic (‘754r’).[2] 

IEEE 754 specifies two variants of the encoding for decimal data; one with a decimal significand and the other with a binary significand. Each of the libraries measured supports one of these encodings (in various ways), and the performance measurements here use the encoding best suited to each library.

Comments on this document are welcome. Please send any comments, suggestions, and corrections to the author, Mike Cowlishaw (mfc@speleotrove.com).


The libraries

The tables later in this document give measurements for operations (where available) for three decimal implementations:

decNumber module

The decNumber module is part of the IBM decNumber package;[3]  it implements arbitrary-precision arithmetic with fully tailorable parameters (rounding precision, exponent range, and other factors can all be changed at run time). All decNumber operations always accept arbitrary-length operands. The decNumber module uses a general-purpose internal format (tunable at compile time) which requires conversions to and from any external format. When working with 754r encodings all parameters and results require conversions (each about 100 cycles).

decFloats modules

The decFloats modules are also part of the decNumber package; they work directly on the fixed-size encodings with decimal significand. This document gives results for the decDouble and decQuad modules (64-bit and 128-bit formats).

Intel Decimal Floating-Point Library

The Intel[4]  Decimal Floating-Point Library (IDFPL) is an Intel Software Development Product.[5]  The functions in the library work directly on the fixed-size encodings with binary significand (64-bit and 128-bit formats).

All three implementations are open source and written in C.

The decNumber and decFloats implementations require 32-bit binary integer types only, conform to strict aliasing and alignment rules, and are tested for use on both little-endian and big-endian architectures. They support string conversions for both ASCII/UTF8 and EBCDIC, BCD conversions, and decimal integer operations (integer divide, shift, rotate, logical and, or, xor, etc.).

The IDFPL implementation requires 64-bit binary integer and floating-point types, and is assumed to be little-endian and ASCII/UTF8 only (the README files do not refer to big-endian[6]  or EBCDIC support). BCD conversions and decimal integer operations are not supported by the IDFPL implementation.


Description of the tables

In the tables in the later sections, timings for each operation are given in processor clock cycles. Cycle counts are generally a more useful indicator of comparative performance than ‘wall clock’ times, but vary considerably with processor architecture.

For example, the times below are cycles measured on an Intel Pentium M processor in an IBM X41T Thinkpad[7]  – on a Pentium 4 or RISC processor most of the tests would show significantly higher cycle counts. The compiler used also makes a measurable difference, so all the cycle counts were measured using the same hardware, compiler, and compiler options (detailed in the notes in the next section).

In the tables, worst-case cycle times are shown for each operation for the decFloats modules (in the column headed decDouble or decQuad), the IDFPL library (headed idfpl64 or idfpl128), and the decNumber module (headed decNum).

Worst-case timings are quoted because best-case timings are generally trivial special cases (such as NaN arguments) and ‘typical’ instruction mixes are too application-dependent to be generally applicable.

For each operation, the name of the operation is given, along with a brief description of the worst-case form of the operation. This is the worst case for the decFloats modules (in some cases the worst case is different for the other modules).


Notes

The following notes apply to all the tables in this document.
  1. All timings were made on an IBM X41T Tablet PC (Pentium M, 1.5GHz, 1.5GB RAM) under Windows XP Tablet Edition with SP2.
  2. All modules were compiled using GCC version 3.4.4 with optimization settings -O3 -march=i686 (earlier experiments have indicated that these settings are the best compromise for this hardware and version of GCC).
  3. The default tuning parameters were used for decNumber and decFloats (DECUSE64=1, DECDPUN=3, etc.); most of these only affect decNumber.
  4. The options used for compiling and measuring the IDFPL functions were DECIMAL_CALL_BY_REFERENCE=1, DECIMAL_GLOBAL_ROUNDING=0, and DECIMAL_GLOBAL_EXCEPTION_FLAGS=0; these were chosen as the other two implementations also pass parameters and context by reference.
  5. Timings include call/return overhead, and for the decNumber module also include the costs of converting operand(s) to decNumbers and results back to the appropriate format using the decimal64 or decimal128 proxy modules.
  6. ‘n/s’ indicates an operation that is not supported.
  7. ‘BCD’ for decNumber is Packed BCD, using the decPacked module; for decFloats it is 8-bit BCD.[8]  The IDFPL implementation does not provide BCD conversions.
  8. The worst case for each operation is not always obvious from the code and is implementation-dependent (for example, in the decFloats modules, an unaligned add is sometimes faster than an aligned add). It is possible that there may be unusual cases which are slower than the counts listed in the tables, for all the modules, although a wide variety of micro-benchmarks have been tried.
  9. A string-to-number conversion can theoretically have an arbitrarily large worst case as the string could contain any number of leading, trailing, or embedded zeros; the timings shown in the tables measured cases where the input string’s coefficient had up to eight more digits than the precision of the destination format.
  10. Since the performance measurements shown in the tables were made (in October 2007), the common case of aligned additions on relatively short numbers (6–9 digits) has been measured informally with the same compiler on similar hardware. For these, decNumber and IDFPL are close to the same speed, and decFloats requires about 65% of the cycles (and is about 2.5× as fast as the worst-case addition, for both formats).

Footnotes:
[1] See http://speleotrove.com/decimal/decarith.html
[2] IEEE Std 754-1985 – IEEE Standard for Floating-Point Arithmetic, The Institute of Electrical and Electronics Engineers, Inc., New York, 1985.
[3] See http://speleotrove.com/decimal/#decNumber
[4] ‘Intel’ is a trade mark of the Intel Corporation.
[5] See http://www.intel.com/cd/software/products/asmo-na/eng/219861.htm
[6] In version 1.0 there are said to be references in the code to ENDIAN values, so some support may be present.
[7] ‘Pentium’ is a trade mark of the Intel Corporation. ‘Thinkpad’ is a trade mark of Lenovo.
[8] The most recent decFloats modules support Packed BCD directly, however these conversions have not yet been benchmarked.

[contents | next]