Performance

Bibliography of material on Decimal Arithmetic [Index]

Decimal Arithmetic: Performance

bhat2007
¿Web? Performance Characterization of Decimal Arithmetic in Commercial Java Workloads, M. Bhat, J. Crawford, R. Morin, and K. Shiv, IEEE International Symposium on Performance Analysis of Systems & Software, 2007 (ISPASS 2007) IEEE, pp54–61, April 2007.
Abstract: Binary floating-point numbers with finite precision cannot represent all decimal numbers with complete accuracy. This can often lead to errors while performing calculations involving floating point numbers. For this reason many commercial applications use special decimal representations for performing these calculations, but their use carries performance costs such as bi-directional conversion. The purpose of this study was to understand the total application performance impact of using these decimal representations in commercial workloads, and provide a foundation of data to justify pursuing optimized hardware support for decimal math. In Java, a popular development environment for commercial applications, the BigDecimal class is used for performing accurate decimal computations. BigDecimal provides operations for arithmetic, scale manipulation, rounding, comparison, hashing, and format conversion. We studied the impact of BigDecimal usage on the performance of server-side Java applications by analyzing its usage on two standard enterprise benchmarks, SPECjbb2005 and SPECjAppServer2004 as well as a real-life mission-critical financial workload, Morgan Stanley’s Trade Completion. In this paper, we present detailed performance characteristics and we conclude that, relative to total application performance, the overhead of using software decimal implementations is low, and at least from the point of view of these workloads, there is insufficient performance justification to pursue hardware solutions

buch1959
¿Web? Fingers or Fists? (The Choice of Decimal or Binary representation), Werner Buchholz, Communications of the ACM, Vol. 2 #12, pp3–11, ACM Press, December 1959.
Abstract: The binary number system offers many advantages over a decimal representation for a high-perfornmnee, general-purpose computer. The greater simplicity of a binary arithmetic unit and the greater compactness of binary numbers both contribute directly to arithmetic speed. Less obvious and perhaps more important is the way binary addressing and instruction formats can increase the overall performance. Binary addresses are also essential to certain powerful operations which are not practical with decimal instruction formats.
    On the other hand, decimal numbers are essential for communicating between man and the computer. In applications requiring the processing of a large volume of inherently decimal input and output data, the time for decimal-binary conversion needed by a purely binary computer may be significant. A slower decimal adder may take less time than a fast binary adder doing an addition and two conversions.
    A careful review ef the significance of decimal and binary number systems led to the adoption in the IBM STRETCH computer of binary addressing and both binary and decimal data arithmetic, supplemented by efficient conversion instructions.
Note: Letters to the edtor in response to this paper were published in CACM, Vol. 3, #3, March 1960.

cowlis2002b
URL
¿Web? The ‘telco’ benchmark, M. F. Cowlishaw, URL: http://speleotrove.com/decimal, 3pp, IBM Hursley Laboratory, May 2002.
Abstract: This benchmark was devised in order to investigate the balance between Input and Output (I/O) time and calculation time in a simple program which realistically captures the essence of a telephone company billing application.
    In summary, the application reads a large input file containing a suitably distributed list of telephone call durations (each in seconds). For each call, a charging rate is chosen and the price calculated and rounded to hundreths. One or two taxes are applied (depending on the type of call) and the total cost is converted to a character string and written to an output file. Running totals of the total cost and taxes are kept; these are displayed at the end of the benchmark for verification.

erle2002
¿Web? Potential Speedup with Decimal Floating-Point Hardware, Mark A Erle, Michael J Schulte, and J G Linebarger, Proceedings of the Thirty Sixth Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, California, pp1073–1077, IEEE Press, November 2002.
Abstract: This paper address the potential speedup achieved by using decimal floating-point hardware, instead of software routines, on a high-performance super-scalar architecture. Software routines were written to performag decimal addition, subtraction, multiplication, and division. Cycle counts were then measured for each instruction using the Simplescalar simulator. After this, new hardware algorithms were developed, existing algorithms were analyzed, and cycle counts were estimated for the same set of instructions using specialized decimal floating-point hardware. This data was then used to show the potential speedup obtained for programs with different instruction mixes and a recently developed benchmark.

hickmann2007
¿Web? A Parallel IEEE P754 Decimal Floating-Point Multiplier, Brian J. Hickmann, Andrew Krioukov, Michael J. Schulte, and Mark A. Erle, Proceedings of the IEEE International Conference on Computer Design 2007, pp296–303, IEEE, October 2007.
Abstract: Decimal floating-point multiplication is important in many commercial applications including banking, tax calculation, currency conversion, and other financial areas. This paper presents a fully parallel decimal floating-point multiplier compliant with the recent draft of the IEEE P754 Standard for Floating-point Arithmetic (IEEE P754). The novelty of the design is that it is the first parallel decimal floating-point multiplier offering low latency and high throughput. This design is based on a previously published parallel fixed-point decimal multiplier which uses alternate decimal digit encodings to reduce area and delay. The fixed-point design is extended to support floating-point multiplication by adding several components including exponent generation, rounding, shifting, and exception handling. Area and delay estimates are presented that show a significant latency and throughput improvement with a substantial increase in area as compared to the only published IEEE P754 compliant sequential floating-point multiplier. To the best of our knowledge, this is the first publication to present a fully parallel decimal floating-point multiplier that complies with IEEE P754.

kenney2004a
¿Web? Multioperand Decimal Addition (extended version), Robert D Kenney and Michael J Schulte, Proceedings of the IEEE Computer Society Annual Symposium on VLSI, Lafayette, LA, February, 2004., 10pp, IEEE, February 2004.
Abstract: This paper introduces and analyzes four techniques for performing fast decimal addition on multiple binary coded decimal (BCD) operands. Three of the techniques speculate BCD correction values and use chaining to correct intermediate results. The first speculates over one addition. The second speculates over two additions. The third employs multiple instances of the second technique in parallel and then merges the results. The fourth technique uses a binary carry-save adder tree and produces a binary sum. Combinational logic is then used to correct the sum and determine the carry into the next digit. Multioperand adder designs are constructed and synthesized for four to sixteen input operands. Analyses are performed on the synthesis results and the merits of each technique are discussed. Finally, these techniques are compared to previous attempts made at speeding up decimal addition.

kenney2004b
¿Web? High-Frequency Decimal Multiplier, Robert D Kenney, Michael J Schulte, and Mark A. Erle, Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors, ISBN 0 7695 2231 9, pp26–29, IEEE, October 2004.
Abstract: Decimal arithmetic is regaining popularity in the computing community due to the growing importance of commercial, financial, and Internet-based applications, which process decimal data. This paper presents an iterative decimal multiplier, which operates at high clock frequencies and scales well to large operand sizes. The multiplier uses a new decimal representation for intermediate products, which allows for a very fast two- stage iterative multiplier design. Decimal multipliers, which are synthesized using a 0.11 micron CMOS standard cell library, operate at clock frequencies close to 2 GHz. The latency of the proposed design to multiply two n-digit BCD operands is (n + 8) cycles with a new multiplication able to begin every (n + 1) cycles.

kenney2005
¿Web? High-speed multioperand decimal adders, R.D. Kenney and M. J. Schulte, IEEE Transactions on Computers, Vol. 54 #8, ISSN 0018-9340, pp953–963, IEEE, August 2005.
Abstract: There is increasing interest in hardware support for decimal arithmetic as a result of recent growth in commercial, financial, and Internet-based applications. Consequently, new specifications for decimal floating-point arithmetic have been added to the draft revision of the IEEE-754 Standard for Floating-Point Arithmetic. This paper introduces and analyzes three techniques for performing fast decimal addition on multiple binary coded decimal (BCD) operands. Two of the techniques speculate BCD correction values and correct intermediate results while adding the input operands. The first speculates over one addition. The second speculates over two additions. The third technique uses a binary carry-save adder tree and produces a binary sum. Combinational logic is then used to correct the sum and determine the carry into the next more significant digit. Multioperand adder designs are constructed and synthesized for four to 16 input operands. Analyses are performed on the synthesis results and the merits of each technique are discussed. Finally, these techniques are compared to several previous techniques for high-speed decimal addition.

kim2006
¿Web? A Hybrid Decimal Division Algorithm Reducing Computational Iterations, Yong-Dae Kim, Soon-Youl Kwon, Seon-Kyoung Han, Kyoung-Rok Cho, and Younggap You, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences Vol. E89-A #6, pp1807–1812, The Institute of Electronics, Information and Communication Engineers, 2006.
Abstract: This paper presents a hybrid decimal division algorithm to improve division speed. The proposed hybrid algorithm employs either non-restoring or restoring algorithm on each digit to reduce iterative computations. The selection of the algorithm is based on the relative remainder values with respect to the half of its divisor. The proposed algorithm requires maximum 7n+4 add/subtract operations for an n-digit quotient, whereas other restoring or non-restoring schemes comprise more than 10n+1 operations.

nikmehr2004
¿Web? A decimal carry-free adder, Hooman Nikmehr, Braden Phillips, and Cheng-Chew Lim, SPIE Symposium Smart Materials, Nano-, and Micro-Smart Systems, Proceedings of SPIE Vol. 5649, 12pp, SPIE International Society for Optical Engineering, December 2004.
Abstract: Recently, decimal arithmetic has become attractive in the financial and commercial world including banking, tax calculation, currency conversion, insurance and accounting. Although computers are still carrying out decimal calculation using software libraries and binary floating-point numbers, it is likely that in the near future, all processors will be equipped with units performing decimal operations directly on decimal operands. One critical building block for some complex decimal operations is the decimal carry-free adder. This paper discusses the mathematical framework of the addition, introduces a new signed-digit format for representing decimal numbers and presents an efficient architectural implementation. Delay estimation analysis shows that the adder offers improved performance over earlier designs.

peuto1977
¿Web? An instruction timing model of CPU performance, Bernard L. Peuto and Leonard J. Shustek, Proceedings of the 4th annual symposium on Computer architecture, pp165–178, ACM Press, 1977.
Abstract: A model of high-performance computers is derived from instruction timing formulas, with compensation for pipeline and cache memory effects. The model is used to predict the performance of the IBM 370/168 and the Amdahl 470 V/6 on specific programs, and the results are verified by comparison with actual performance. Data collected about program behavior is combined with the performance analysis to highlight some of the problems with high-performance implementations of such architectures.

peuto1998
¿Web? An Instruction Timing Model of CPU Performance, Bernard L. Peuto and Leonard J. Shustek, International Conference on Computer Architecture: 25 years of the International Symposia on Computer architecture, pp152–165, ACM Press, 1998.
Abstract: A model of high-performance computers is derived from instruction timing formulas, with compensation for pipeline and cache memory effects. The model is used to predict the performance of the IBM 370/168 and the Amdahl 470 V/6 on specific programs, and the results are verified by comparison with actual performance. Data collected about program behavior is combined with the performance analysis to highlight some of the problems with high-performance implementations of such architectures.
Note: Original reference: ISCA 1977: pp165-178.

schulte2005
URL
¿Web? Performance Evaluation of Decimal Floating-Point Arithmetic, Michael J. Schulte, Nick Lindberg, and Anitha Laxminarain, Proceedings of the 6th IBM Austin Center for Advanced Studies Conference, Austin, TX,, 8pp, IBM, February 2005.
Abstract: The prominence of decimal data in commercial and financial applications has led researchers to pursue efficient techniques for performing decimal floating-point arithmetic. While several software implementations of decimal floating-point arithmetic have been implemented, there is a growing need to provide hardware support for decimal floating-point arithmetic to keep up with the processing demands of emerging commercial and financial applications. This paper evaluates and compares the performance of decimal floating-point arithmetic operations when implemented on superscalar processors using either software libraries or specialized hardware designs. Our comparisons show that hardware implementations of decimal floating-point arithmetic operations are one to two orders of magnitude faster than software implementations.

wang2004
¿Web? Decimal Floating-Point Division Using Newton-Raphson Iteration, Liang-Kai Wang and Michael J Schulte, Proceedings of the 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP’04), pp84–95, IEEE Computer Society Press, September 2004.
Abstract: Decreasing feature sizes allow additional functionality to be added to future microprocessors to improve the performance of important application domains. As a result of rapid growth in financial, commercial, and Internet-based applications, hardware support for decimal floating-point arithmetic is now being considered by various computer manufacturers and specifications for decimal floating-point arithmetic have been added to the draft revision of the IEEE-754 Standard for Floating-Point Arithmetic (IEEE-754R). This paper presents an efficient arithmetic algorithm and hardware design for decimal floating-point division. The design uses an optimized piecewise linear approximation, a modified Newton- Raphson iteration, a specialized rounding technique, and a simplified combined decimal incrementer/decrementer. Synthesis results show that a 64-bit (16-digit) implementation of the decimal divider, which is compliant with IEEE-754R, has an estimated critical path delay of 0.69 ns when implemented using LSI Logic’s 0.11 micron gflx-p standard cell library.

wang2005
¿Web? Decimal Floating-Point Square Root Using Newton-Raphson Iteration, Liang-Kai Wang and Michael J Schulte, Proceedings of the 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP’05), pp309–315, IEEE Computer Society Press, July 2005.
Abstract: With continued reductions in feature size, additional functionality may be added to future microprocessors to boost the performance of important application domains. Due to growth in commercial, financial, and Internet-based applications, decimal floating point arithmetic is now attracting more attention, and hardware support for decimal operations is being considered by various computer manufacturers. In order to standardize decimal number formats and operations, specifications for decimal floating-point arithmetic have been added to the draft revision of the IEEE-754 Standard for Floating-Point Arithmetic (IEEE-754R). This paper presents an efficient arithmetic algorithm and hardware design for decimal floating-point square root. This design uses an optimized piecewise linear approximation, a modified Newton-Raphson iteration, a specialized rounding technique, and a modified decimal multiplier. Synthesis results show that a 64-bit (16-digit) implementation of the decimal square root, which is compliant with the IEEE-754R, has an estimated critical path delay of 0.95 ns and maximum latency of 210 clock cycles when implemented using LSI Logic’s 0.11 micron Gflx-P Standard Cell library.

wang2007b
URL
¿Web? Benchmarks and Performance Analysis of Decimal Floating-Point Applications, Liang-Kai Wang, Charles Tsen, Michael J. Schulte, and Divya Jhalani, Proceedings of the IEEE International Conference on Computer Design 2007, pp164–170, IEEE, October 2007.
Abstract: The IEEE P754 Draft Standard for Floating-point Arithmetic provides specifications for Decimal Floating-Point (DFP) formats and operations. Based on this standard, many developers will provide support for DFP calculations. We present a benchmark suite for DFP applications and use this suite to evaluate the performance of hardware and software DFP solutions. Our benchmarks include banking, commerce, risk-management, tax, and telephone billing applications organized into a suite of five macro benchmarks. In addition to developing our own applications, we leverage open-source projects and academic financial analysis applications. The benchmarks are modular, making them easy to adapt for different DFP solutions. We use the benchmarks to evaluate the performance of the decNumber DFP library and an extended version of the SimpleScalar PISA architecture with hardware and instruction set support for DFP operations. Our analysis shows that providing processor support for high-speed DFP operations significantly improves the performance of DFP applications.

you2006
¿Web? Dynamic decimal adder circuit design by using the carry look ahead, Younggap You, Yong Dae Kim, and Jong Hwa Choi, IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems, 3pp, IEEE Computer Society, April 2006.
Abstract: This paper presents a carry look ahead (CLA) circuitry design based on dynamic circuit aiming at delay reduction in addition of BCD coded decimal numbers. The performance of the proposed dynamic decimal adder is analyzed demonstrating its speed improvement. Timing simulation on the proposed decimal addition circuit employing 0.25µm CMOS technology yields the worst case delay of 622 ns.

The 17 references listed on this page are selected from the bibliography on Decimal Arithmetic collected by Mike Cowlishaw. Please see the index page for more details and other categories.

bhat2007 ¿Web?	Performance Characterization of Decimal Arithmetic in Commercial Java Workloads, M. Bhat, J. Crawford, R. Morin, and K. Shiv, IEEE International Symposium on Performance Analysis of Systems & Software, 2007 (ISPASS 2007) IEEE, pp54–61, April 2007. Abstract: Binary floating-point numbers with finite precision cannot represent all decimal numbers with complete accuracy. This can often lead to errors while performing calculations involving floating point numbers. For this reason many commercial applications use special decimal representations for performing these calculations, but their use carries performance costs such as bi-directional conversion. The purpose of this study was to understand the total application performance impact of using these decimal representations in commercial workloads, and provide a foundation of data to justify pursuing optimized hardware support for decimal math. In Java, a popular development environment for commercial applications, the BigDecimal class is used for performing accurate decimal computations. BigDecimal provides operations for arithmetic, scale manipulation, rounding, comparison, hashing, and format conversion. We studied the impact of BigDecimal usage on the performance of server-side Java applications by analyzing its usage on two standard enterprise benchmarks, SPECjbb2005 and SPECjAppServer2004 as well as a real-life mission-critical financial workload, Morgan Stanley’s Trade Completion. In this paper, we present detailed performance characteristics and we conclude that, relative to total application performance, the overhead of using software decimal implementations is low, and at least from the point of view of these workloads, there is insufficient performance justification to pursue hardware solutions
buch1959 ¿Web?	Fingers or Fists? (The Choice of Decimal or Binary representation), Werner Buchholz, Communications of the ACM, Vol. 2 #12, pp3–11, ACM Press, December 1959. Abstract: The binary number system offers many advantages over a decimal representation for a high-perfornmnee, general-purpose computer. The greater simplicity of a binary arithmetic unit and the greater compactness of binary numbers both contribute directly to arithmetic speed. Less obvious and perhaps more important is the way binary addressing and instruction formats can increase the overall performance. Binary addresses are also essential to certain powerful operations which are not practical with decimal instruction formats. On the other hand, decimal numbers are essential for communicating between man and the computer. In applications requiring the processing of a large volume of inherently decimal input and output data, the time for decimal-binary conversion needed by a purely binary computer may be significant. A slower decimal adder may take less time than a fast binary adder doing an addition and two conversions. A careful review ef the significance of decimal and binary number systems led to the adoption in the IBM STRETCH computer of binary addressing and both binary and decimal data arithmetic, supplemented by efficient conversion instructions. Note: Letters to the edtor in response to this paper were published in CACM, Vol. 3, #3, March 1960.
cowlis2002b URL ¿Web?	The ‘telco’ benchmark, M. F. Cowlishaw, URL: `http://speleotrove.com/decimal`, 3pp, IBM Hursley Laboratory, May 2002. Abstract: This benchmark was devised in order to investigate the balance between Input and Output (I/O) time and calculation time in a simple program which realistically captures the essence of a telephone company billing application. In summary, the application reads a large input file containing a suitably distributed list of telephone call durations (each in seconds). For each call, a charging rate is chosen and the price calculated and rounded to hundreths. One or two taxes are applied (depending on the type of call) and the total cost is converted to a character string and written to an output file. Running totals of the total cost and taxes are kept; these are displayed at the end of the benchmark for verification.
erle2002 ¿Web?	Potential Speedup with Decimal Floating-Point Hardware, Mark A Erle, Michael J Schulte, and J G Linebarger, Proceedings of the Thirty Sixth Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, California, pp1073–1077, IEEE Press, November 2002. Abstract: This paper address the potential speedup achieved by using decimal floating-point hardware, instead of software routines, on a high-performance super-scalar architecture. Software routines were written to performag decimal addition, subtraction, multiplication, and division. Cycle counts were then measured for each instruction using the Simplescalar simulator. After this, new hardware algorithms were developed, existing algorithms were analyzed, and cycle counts were estimated for the same set of instructions using specialized decimal floating-point hardware. This data was then used to show the potential speedup obtained for programs with different instruction mixes and a recently developed benchmark.
hickmann2007 ¿Web?	A Parallel IEEE P754 Decimal Floating-Point Multiplier, Brian J. Hickmann, Andrew Krioukov, Michael J. Schulte, and Mark A. Erle, Proceedings of the IEEE International Conference on Computer Design 2007, pp296–303, IEEE, October 2007. Abstract: Decimal floating-point multiplication is important in many commercial applications including banking, tax calculation, currency conversion, and other financial areas. This paper presents a fully parallel decimal floating-point multiplier compliant with the recent draft of the IEEE P754 Standard for Floating-point Arithmetic (IEEE P754). The novelty of the design is that it is the first parallel decimal floating-point multiplier offering low latency and high throughput. This design is based on a previously published parallel fixed-point decimal multiplier which uses alternate decimal digit encodings to reduce area and delay. The fixed-point design is extended to support floating-point multiplication by adding several components including exponent generation, rounding, shifting, and exception handling. Area and delay estimates are presented that show a significant latency and throughput improvement with a substantial increase in area as compared to the only published IEEE P754 compliant sequential floating-point multiplier. To the best of our knowledge, this is the first publication to present a fully parallel decimal floating-point multiplier that complies with IEEE P754.
kenney2004a ¿Web?	Multioperand Decimal Addition (extended version), Robert D Kenney and Michael J Schulte, Proceedings of the IEEE Computer Society Annual Symposium on VLSI, Lafayette, LA, February, 2004., 10pp, IEEE, February 2004. Abstract: This paper introduces and analyzes four techniques for performing fast decimal addition on multiple binary coded decimal (BCD) operands. Three of the techniques speculate BCD correction values and use chaining to correct intermediate results. The first speculates over one addition. The second speculates over two additions. The third employs multiple instances of the second technique in parallel and then merges the results. The fourth technique uses a binary carry-save adder tree and produces a binary sum. Combinational logic is then used to correct the sum and determine the carry into the next digit. Multioperand adder designs are constructed and synthesized for four to sixteen input operands. Analyses are performed on the synthesis results and the merits of each technique are discussed. Finally, these techniques are compared to previous attempts made at speeding up decimal addition.
kenney2004b ¿Web?	High-Frequency Decimal Multiplier, Robert D Kenney, Michael J Schulte, and Mark A. Erle, Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors, ISBN 0 7695 2231 9, pp26–29, IEEE, October 2004. Abstract: Decimal arithmetic is regaining popularity in the computing community due to the growing importance of commercial, financial, and Internet-based applications, which process decimal data. This paper presents an iterative decimal multiplier, which operates at high clock frequencies and scales well to large operand sizes. The multiplier uses a new decimal representation for intermediate products, which allows for a very fast two- stage iterative multiplier design. Decimal multipliers, which are synthesized using a 0.11 micron CMOS standard cell library, operate at clock frequencies close to 2 GHz. The latency of the proposed design to multiply two n-digit BCD operands is (n + 8) cycles with a new multiplication able to begin every (n + 1) cycles.
kenney2005 ¿Web?	High-speed multioperand decimal adders, R.D. Kenney and M. J. Schulte, IEEE Transactions on Computers, Vol. 54 #8, ISSN 0018-9340, pp953–963, IEEE, August 2005. Abstract: There is increasing interest in hardware support for decimal arithmetic as a result of recent growth in commercial, financial, and Internet-based applications. Consequently, new specifications for decimal floating-point arithmetic have been added to the draft revision of the IEEE-754 Standard for Floating-Point Arithmetic. This paper introduces and analyzes three techniques for performing fast decimal addition on multiple binary coded decimal (BCD) operands. Two of the techniques speculate BCD correction values and correct intermediate results while adding the input operands. The first speculates over one addition. The second speculates over two additions. The third technique uses a binary carry-save adder tree and produces a binary sum. Combinational logic is then used to correct the sum and determine the carry into the next more significant digit. Multioperand adder designs are constructed and synthesized for four to 16 input operands. Analyses are performed on the synthesis results and the merits of each technique are discussed. Finally, these techniques are compared to several previous techniques for high-speed decimal addition.
kim2006 ¿Web?	A Hybrid Decimal Division Algorithm Reducing Computational Iterations, Yong-Dae Kim, Soon-Youl Kwon, Seon-Kyoung Han, Kyoung-Rok Cho, and Younggap You, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences Vol. E89-A #6, pp1807–1812, The Institute of Electronics, Information and Communication Engineers, 2006. Abstract: This paper presents a hybrid decimal division algorithm to improve division speed. The proposed hybrid algorithm employs either non-restoring or restoring algorithm on each digit to reduce iterative computations. The selection of the algorithm is based on the relative remainder values with respect to the half of its divisor. The proposed algorithm requires maximum 7n+4 add/subtract operations for an n-digit quotient, whereas other restoring or non-restoring schemes comprise more than 10n+1 operations.
nikmehr2004 ¿Web?	A decimal carry-free adder, Hooman Nikmehr, Braden Phillips, and Cheng-Chew Lim, SPIE Symposium Smart Materials, Nano-, and Micro-Smart Systems, Proceedings of SPIE Vol. 5649, 12pp, SPIE International Society for Optical Engineering, December 2004. Abstract: Recently, decimal arithmetic has become attractive in the financial and commercial world including banking, tax calculation, currency conversion, insurance and accounting. Although computers are still carrying out decimal calculation using software libraries and binary floating-point numbers, it is likely that in the near future, all processors will be equipped with units performing decimal operations directly on decimal operands. One critical building block for some complex decimal operations is the decimal carry-free adder. This paper discusses the mathematical framework of the addition, introduces a new signed-digit format for representing decimal numbers and presents an efficient architectural implementation. Delay estimation analysis shows that the adder offers improved performance over earlier designs.
peuto1977 ¿Web?	An instruction timing model of CPU performance, Bernard L. Peuto and Leonard J. Shustek, Proceedings of the 4th annual symposium on Computer architecture, pp165–178, ACM Press, 1977. Abstract: A model of high-performance computers is derived from instruction timing formulas, with compensation for pipeline and cache memory effects. The model is used to predict the performance of the IBM 370/168 and the Amdahl 470 V/6 on specific programs, and the results are verified by comparison with actual performance. Data collected about program behavior is combined with the performance analysis to highlight some of the problems with high-performance implementations of such architectures.
peuto1998 ¿Web?	An Instruction Timing Model of CPU Performance, Bernard L. Peuto and Leonard J. Shustek, International Conference on Computer Architecture: 25 years of the International Symposia on Computer architecture, pp152–165, ACM Press, 1998. Abstract: A model of high-performance computers is derived from instruction timing formulas, with compensation for pipeline and cache memory effects. The model is used to predict the performance of the IBM 370/168 and the Amdahl 470 V/6 on specific programs, and the results are verified by comparison with actual performance. Data collected about program behavior is combined with the performance analysis to highlight some of the problems with high-performance implementations of such architectures. Note: Original reference: ISCA 1977: pp165-178.
schulte2005 URL ¿Web?	Performance Evaluation of Decimal Floating-Point Arithmetic, Michael J. Schulte, Nick Lindberg, and Anitha Laxminarain, Proceedings of the 6th IBM Austin Center for Advanced Studies Conference, Austin, TX,, 8pp, IBM, February 2005. Abstract: The prominence of decimal data in commercial and financial applications has led researchers to pursue efficient techniques for performing decimal floating-point arithmetic. While several software implementations of decimal floating-point arithmetic have been implemented, there is a growing need to provide hardware support for decimal floating-point arithmetic to keep up with the processing demands of emerging commercial and financial applications. This paper evaluates and compares the performance of decimal floating-point arithmetic operations when implemented on superscalar processors using either software libraries or specialized hardware designs. Our comparisons show that hardware implementations of decimal floating-point arithmetic operations are one to two orders of magnitude faster than software implementations.
wang2004 ¿Web?	Decimal Floating-Point Division Using Newton-Raphson Iteration, Liang-Kai Wang and Michael J Schulte, Proceedings of the 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP’04), pp84–95, IEEE Computer Society Press, September 2004. Abstract: Decreasing feature sizes allow additional functionality to be added to future microprocessors to improve the performance of important application domains. As a result of rapid growth in financial, commercial, and Internet-based applications, hardware support for decimal floating-point arithmetic is now being considered by various computer manufacturers and specifications for decimal floating-point arithmetic have been added to the draft revision of the IEEE-754 Standard for Floating-Point Arithmetic (IEEE-754R). This paper presents an efficient arithmetic algorithm and hardware design for decimal floating-point division. The design uses an optimized piecewise linear approximation, a modified Newton- Raphson iteration, a specialized rounding technique, and a simplified combined decimal incrementer/decrementer. Synthesis results show that a 64-bit (16-digit) implementation of the decimal divider, which is compliant with IEEE-754R, has an estimated critical path delay of 0.69 ns when implemented using LSI Logic’s 0.11 micron gflx-p standard cell library.
wang2005 ¿Web?	Decimal Floating-Point Square Root Using Newton-Raphson Iteration, Liang-Kai Wang and Michael J Schulte, Proceedings of the 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP’05), pp309–315, IEEE Computer Society Press, July 2005. Abstract: With continued reductions in feature size, additional functionality may be added to future microprocessors to boost the performance of important application domains. Due to growth in commercial, financial, and Internet-based applications, decimal floating point arithmetic is now attracting more attention, and hardware support for decimal operations is being considered by various computer manufacturers. In order to standardize decimal number formats and operations, specifications for decimal floating-point arithmetic have been added to the draft revision of the IEEE-754 Standard for Floating-Point Arithmetic (IEEE-754R). This paper presents an efficient arithmetic algorithm and hardware design for decimal floating-point square root. This design uses an optimized piecewise linear approximation, a modified Newton-Raphson iteration, a specialized rounding technique, and a modified decimal multiplier. Synthesis results show that a 64-bit (16-digit) implementation of the decimal square root, which is compliant with the IEEE-754R, has an estimated critical path delay of 0.95 ns and maximum latency of 210 clock cycles when implemented using LSI Logic’s 0.11 micron Gflx-P Standard Cell library.
wang2007b URL ¿Web?	Benchmarks and Performance Analysis of Decimal Floating-Point Applications, Liang-Kai Wang, Charles Tsen, Michael J. Schulte, and Divya Jhalani, Proceedings of the IEEE International Conference on Computer Design 2007, pp164–170, IEEE, October 2007. Abstract: The IEEE P754 Draft Standard for Floating-point Arithmetic provides specifications for Decimal Floating-Point (DFP) formats and operations. Based on this standard, many developers will provide support for DFP calculations. We present a benchmark suite for DFP applications and use this suite to evaluate the performance of hardware and software DFP solutions. Our benchmarks include banking, commerce, risk-management, tax, and telephone billing applications organized into a suite of five macro benchmarks. In addition to developing our own applications, we leverage open-source projects and academic financial analysis applications. The benchmarks are modular, making them easy to adapt for different DFP solutions. We use the benchmarks to evaluate the performance of the decNumber DFP library and an extended version of the SimpleScalar PISA architecture with hardware and instruction set support for DFP operations. Our analysis shows that providing processor support for high-speed DFP operations significantly improves the performance of DFP applications.
you2006 ¿Web?	Dynamic decimal adder circuit design by using the carry look ahead, Younggap You, Yong Dae Kim, and Jong Hwa Choi, IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems, 3pp, IEEE Computer Society, April 2006. Abstract: This paper presents a carry look ahead (CLA) circuitry design based on dynamic circuit aiming at delay reduction in addition of BCD coded decimal numbers. The performance of the proposed dynamic decimal adder is analyzed demonstrating its speed improvement. Timing simulation on the proposed decimal addition circuit employing 0.25µm CMOS technology yields the worst case delay of 622 ns.