Title: Optimizing Big Number Arithmetic Without SSE

Author: Martin Moene

Date: 02 February 2014 07:53:06 +00:00 or Sun, 02 February 2014 07:53:06 +00:00

Summary: Addition and multiplication can be slow for big numbers. Sergey Ignatchenko and Dmytro Ivanchykhin try to speed things up.

Body:

Disclaimer: as usual, the opinions within this article are those of â€˜No Bugsâ€™ Bunny, and do not necessarily coincide with the opinions of the translator or the Overload editor. Please also keep in mind that translation difficulties from Lapine (like those described in [Loganberry04]) might have prevented providing an exact translation. In addition, both the translators and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

There is one thing which has puzzled me for a while, and it is the performance of programs written in C when it comes to big numbers. It may or may not help with the decades-long â€˜C vs Fortranâ€™ performance debate, but letâ€™s concentrate on one single and reasonably well-defined thing â€“ big number arithmetic in C and see if it can be improved.

In fact, there are very few things which gain from being rewritten in assembler (compared to C), but big number arithmetic is one of them, with relatively little progress in this direction over the years. Letâ€™s take a look at OpenSSL (a library which is among the most concerned about big number performance: 99% of SSL connections use RSA these days, and RSA_performance == Big_Number_Performance, and RSA is notoriously sslloooowww). Run:

  openssl speed rsa

(if youâ€™re on Windows, you may need to install OpenSSL first) and see what it shows. In most cases, at the end of the benchmark report it will show something like

compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DKRB5_MIT -DL_ENDIAN -DTERMIO -Wall -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i686 -mtune=atom -fasynchronous-unwind-tables -Wa,--noexecstack -DOPENSSL_BN_ASM_PART_WORDS -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DRMD160_ASM -DAES_ASM -DWHIRLPOOL_ASM

Note those -DOPENSSL_BN_ASM in the output â€“ it means that OpenSSL prefers to use assembler for big number calculations. It was the case back in 1998, and it is still the case now (last time I checked, the difference between C and asm implementations was 2x, but it was long ago, so things may have easily changed since). But why should this be the case? Why with all the optimizations compilers are doing now, should such a common and trivial thing still need to be written in asm (which has its own drawbacks â€“ from the need to write it manually for each and every platform, to sub-optimality of generic asm when it comes to pipeline optimizations â€“ and hand-rewriting asm for each new CPU -march/-mtune is not realistic)? If it can perform in asm but cannot perform in C, it means that all hardware support is present, but the performance is lost somewhere in between C developer and generated binary code; in short â€“ the compiler cannot produce good code. This article will try to perform some analysis of this phenomenon.

The problem

For the purposes of this article, letâ€™s restrict our analysis to the two most basic operations in big-number arithmetic: addition and multiplication. Also in this article we wonâ€™t go into SSE/SSE2, concentrating on much simpler things which still can provide substantial performance gains. The reason why I decided to stay away from SSE/SSE2/... is because while the whole area of vectorization has been extensively studied recently, it seems that people have forgotten about the good old plain instruction set, which is still capable of delivering good performance; combined with the fact that SSE has its own issues which make it not so optimal in certain scenarios, and another suggestion that the most optimal implementation would probably parallelise execution between the SSE engine and the good old plain instruction set.

Mind the gap

When Iâ€™m speaking about addition: sure, C/C++ has operator +. However, to add, say, two 128-bit numbers, we need to do more than just repeating twice an addition of two uint64_tâ€™s in the C/C++ sense, we also need to obtain carry (65th bit of result) which may easily occur when adding lower 64-bit parts of our operands.

  uint128_t c = (uint128_t)a + (uint128_t)b;

can be rewritten as

  pair<bool,uint64_t> cLo =
    add65(lower64(a) + lower64(b));
  uint64_t cHi = higher64(a)+higher64(b)+cLo.first;

(assuming that lower64() returns low 64 bits, higher64() returns high 64 bits, and add65() returns a pair representing <carry flag,64_bits_of_sum>)

It is this carry flag which is causing all the trouble with additions. In x64 assembler, this whole thing can be written as something like

  add reg_holding_lower64_of_a,
    reg_holding_lower64_of_b
  adc reg_holding_higher64_of_a,
    reg_holding_higher64_of_b

â€“ and thatâ€™s it. However, as in C there is no such thing as a â€˜carry flagâ€™, developers are doing all kinds of stuff to simulate it, usually with pretty bad results for performance.

With multiplication it is also not that easy, and for a similar reason. In mathematics, when multiplying 64 bits by 64 bits, the result is 128-bit long. In assembler (at least in x64, and also probably in ARM) the result is also 128-bit long. However, in C, the result of 64bit-by-64-bit multiplication is 64-bit long, which, when working with big number arithmetic, leads to four-fold (!) increase in number of multiplications. Ouch.

It should be noted that at least on x64 there is an instruction which multiplies 64 bits by 64 bits and returns lower 64 bits of the result, and this instruction is somewhat faster than full-scale 64-bit-by-64-bit-returning-128-bit multiplication. Still, as weâ€™ll see, it is not enough to compensate for the four-fold increase in number of multiplications described above.

Overall, it looks that there is a substantial gap between the capabilities of the hardware and the capabilities of the C. The whole thing seems to be one incarnation of the Over-Generic Use of Abstractions [NoBugs11] which, as it often happens, leads to a waste of resources.¹ However, the question is not exactly â€œWho is to blame?â€ [Herzen45], but â€œWhat Is to Be Done?â€ [Chernyshevsky63].

So far, so bad

So, what can be done in this regard? Iâ€™ve tried to perform some very basic experiments (based on things which were rather well-known some time ago, but apparently forgotten now) trying to implement uint256_t keeping in mind considerations outlined above, and to compare its performance with the performance of boost::uint256_t under recent versions of both MSVC and GCC (see the â€˜Benchmarkâ€™ section for details). The results Iâ€™ve managed to obtain show between 1.8x and 6.3x improvements over boost::uint256_t (both for addition and for multiplication), which, as I feel, indicate a step in the right direction.

Addition

To simplify the description, Iâ€™ll use uint128_t as an example; however, the same approach can be easily generalized to any other types (including uint256_t which Iâ€™ve used for benchmarks).

One thing which Iâ€™ve tried to do to deal with addition-with-carry, was:

struct myuint128_t { uint64_t lo; uint64_t hi; };
inline my_uint128_t add128(my_uint128_t a,
                           my_uint128_t b) {
  my_uint128_t ret;
  ret.cLo = a.lo + b.lo;
  ret.cHi = a.hi + b.hi + (ret.cLo<b.lo);
  return ret;
}

This is mathematically equivalent to the 128-bit addition, and it (with subtle differences depending on compiler, etc.) indeed compiles into something like:

  mov rax, a.lo
  
  add rax, b.lo //got cLo in rax
  cmp rax, b.lo //setting carry flag to
                //â€œret.cLo < b.loâ€ mov rbx, a.hi
  sbb rdx,rdx   //!
  neg rdx       //!
  add rbx, rdx
  add rbx, b.hi

Compilers could do significantly better if they would remove two lines marked with (!), and replace the add in the next line with adc (lines marked with '!' move carry to rdx, which is then added to addition of rbx, rdx â€“ ad will do the same thing and faster). Actually, compilers can go further and optimize away cmp rax, b.lo too (on the grounds that carry flag has already been set in the previous line). After these optimizations, the code would look as follows:

  mov rax, a.lo
  add rax, b.lo //got cLo in rax
  mov rbx, a.hi
  adc rbx, b.hi

This looks as a perfect asm for our 128-bit addition, doesnâ€™t it? And from this point, compiler can optimize it from a pipelining optimization point of view to its heartâ€™s content.

It is interesting to note that despite all the unnecessary stuff, our C still performs reasonably well compared to boost::.

boost:: uses an alternative approach, splitting uint128_t into four uint32_t â€™s and performing four additions; as my tests show, it seems to be slower (and I also feel that optimizing my code should be easier for compiler writers â€“ as outlined above, only two rather relatively minor optimizations are necessary to speed multi-word addition to the â€˜idealâ€™ level).

A word of caution: while this â€œa.hi + b.hi + (cLo<b.lo)â€ technique seems to be handled reasonably optimally by most modern compilers, when trying to compile my code which uses uint64_t for 32-bit platforms, with both GCC and MSVC, they tend to generate very ugly code (with conditional jumps, pipeline stalls etc.); while I didnâ€™t see such behaviour in any other scenario â€“ I cannot be 100% sure that it is the only case when the compiler goes crazy. When compiling for 32-bit platforms, one may use similar technique, but using natively supported uint32_tâ€™s instead of simulated uint64_tâ€™s.

Multiplication

With multiplication, things tend to get significantly more complicated. To write 64-by-64 multiplication in bare C (and without uint128_t which is not natively supported by most of the compilers), one would need to write something like Listing 1.

inline my_uint128_t mul64x64to128(uint64_t a,
                                  uint64_t b) {
  my_uint128_t ret;
  uint64_t t0 = ((uint64_t)(uint32_t)a) *
                ((uint64_t)(uint32_t)b);
  uint64_t t1 = (a>>32)*((uint64_t)(uint32_t)b) +
                (t0>>32);
  uint64_t t2 = (b>>32)*((uint64_t)(uint32_t)a) +
                ((uint32_t)t1);
  ret.lo = (((uint64_t)((uint32_t)t2))<<32) +
                ((uint32_t)t0);
  ret.hi = (a>>32) * (b>>32);
  ret.hi += (t2>>32) + (t1>>32);
  return ret;
}

Listing 1

However, in most cases we can access 64-bit-by-64-bit-returning-128-bit multiplication instruction (which, as Iâ€™ve seen, is available on most CPUs in modern use). In particular, on MSVC there is an â€˜intrinsicâ€™ _umul128, which does exactly what we need. Then, our multiplication of uint64_tâ€™s can be rewritten as:

  inline my_uint128_t mul64x64to128(uint64_t a,
                                    uint64_t b) {
    my_uint128_t ret;
    ret.lo = _umul128(a, b, &(ret.hi));
    return ret;
  }

In fact, a pure-C/C++ (without intrinsics) implementation uses 6 â€˜type Aâ€™ (64-bit-by-64-bit-returning-64-bit) multiplications, and an intrinsic-based one uses 3 multiplications, two being â€˜type Aâ€™ and one being â€˜type Bâ€™ (64-bit-by-64-bit-returning-128-bit). However, the difference in speed between the two is less than 2x, and I tend to attribute it to â€˜type Aâ€™ multiplications being faster than â€˜type Bâ€™ ones, at least on x64 CPUs Iâ€™ve tested my programs on. Still, intrinsic-based version looks significantly faster (than both my own pure-C implementation, and than boost:: pure-C implementation).

Under GCC, there is no such intrinsic, but it can be simulated using the following 5-line GCC inline asm:

  void multiply(uint64_t& rhi, uint64_t& rlo,
                uint64_t a, uint64_t b)
  {
    __asm__(
    "    mulq  %[b]\n"
    :"=d"(rhi),"=a"(rlo)
    :"1"(a),[b]"rm"(b));
  }

This one (from [Crypto++]) works, too:

  #define MultiplyWordsLoHi(p0, p1, a, b)  asm  \
  ("mulq %3" : "=a"(p0), "=d"(p1) : "a"(a),     \
  "r"(b) : "cc");

Benchmarks

Benchmark results (benchmarking was performed for 256-bit integers, but the same approach can be generalized to big integers of any size; also, where applicable, boost version 1.55.0 has been used):

	Visual Studio 2013	GCC 4.8.1
Addition (`boost::uint256_t`)	0.0202 op/clock	0.0202 op/clock
Addition (`uint256_t` according to present article)	0.0366 op/clock	0.1282 op/clock
Advantage over `boost::uint256_t`	1.83x	6.34x
Multiplication (`boost::uint256_t`)	0.0160 op/clock	0.0197op/clock
Multiplication (`uint256_t` according to present article)	0.0285 op/clock	0.0549 op/clock
Advantage over `boost::uint256_t`	1.78x	2.79x

It should be noted that absolute â€˜op/clockâ€™ numbers should not be used for comparison between different computers (let alone comparison between different architectures/compilers).

Conclusion

Weâ€™ve taken a look at problems with big number arithmetic in C, and described some optimizations which may speed things up significantly, providing somewhere around 2x to 6x gains over the current boost:: implementation.

Also note that while all examples were provided for x64 asm, very similar problems (and very similar solutions) seem to apply also to x86 and to ARM (both these platforms have both â€œadd with carryâ€ and â€œmultiply_with_double_width_resultâ€ asm-level operations, and these two operations are the only things necessary to deal with the big number arithmetic reasonably efficiently).

References

[Chernyshevsky63] Nikolai Chernyshevsky, What Is to Be Done?, 1863

[Crypto++] http://www.cryptopp.com/docs/ref/integer_8cpp_source.html

[Herzen45] Alexander Herzen, Who is to Blame?, 1845â€“1846

[Loganberry04] David â€˜Loganberryâ€™, Frithaes! â€“ an Introduction to Colloquial Lapine!, http://bitsnbobstones.watershipdown.org/lapine/overview.html

[NoBugs11] Sergey Ignatchenko. â€˜Over-generic use of abstractions as a major cause of wasting resourcesâ€™. Overload, August 2011

Acknowledgement

Cartoon by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Here an abstraction, which I see as being over-generic, is an abstraction of C operators, which often return the result of the same size as the size of its arguments

Notes:

More fields may be available via dynamicdata ..

Journal Articles